NSF funds novel tools for combining datasets with missing information

Wednesday, August 14, 2019

Mauricio Sadinle, an assistant professor of biostatistics from the University of Washington School of Public Health, received a two-year, $150,000 grant from the National Science Foundation to develop tools to identify and link information on individuals who appear in different datasets. Sadinle’s methodologies will allow researchers to confidently combine pre-existing files to conduct powerful, larger scale analyses.

The research will be broadly relevant, with uses that “include, but are not limited to, merging data from public health surveillance systems with electronic medical records or with mortality registries,” Sadinle explained.

“It is increasingly common to find complementary information on individuals scattered across multiple data sources.” For example, a study participant may have health outcome data in one file and biological measures in another. In other cases, researchers might seek to combine datasets and need to identify individuals who appear in both to avoid skewing their analysis.

“To take full advantage of such data sources, researchers need to be able to link information on the same individuals,” Sadinle said. “In many applications, however, there are no unique identifiers of the individuals in the datafiles,” and alternative methods of matching individuals, such as by birthday and name, are prone to uncertainty. The research will help to reduce the uncertainty in the process of linking records while also developing methodologies to account for remaining uncertainty in statistical analyses that utilize the linked data.

The funding will also allow Sadinle to share his expertise with the UW School of Public Health and across the broader research community. He plans to publish the results of his study in the form of manuscripts, software and tutorials on methods of combining data and analyzing combined data.