UW biostatisticians teach scientists how to mine massive genetic data for Precision Medicine
WRITTEN BY ASHLIE CHANDLER
ILLUSTRATIONS BY GABRIEL LOPEZ
Hundreds of geneticists and statisticians converge at the University of Washington every summer. Some travel from as far away as Japan and New Zealand, while others are based in one of Seattle’s world-renowned research institutes. All are on the hunt for the hottest data science tools that will open new frontiers in biological research.
Preeti Lakshman Kumar, a bioinformatician from the University of Alabama at Birmingham, came to the UW seeking solutions to challenges in her work to investigate the underlying genetic risk for a group of deadly lung diseases. She needed a more effective way to analyze her large data set to better visualize the true genetic patterns.
Lakshman Kumar found such an approach – and colleagues to network with – at the Summer Institute in Statistical Genetics, where UW School of Public Health instructors teach scientists how to analyze colossal and complex genomic data to unearth the roots of disease and pave the way for precision medicine.
She was among 67 researchers at a July workshop who dug into whole-genome sequencing data, which contain the entire genomes of study participants. The group tested analytical techniques and got their hands on one-of-a-kind computer programs that can be adapted to their specific research needs.
“The workshop will enhance my ability to analyze and visualize data in a more significant way,” Lakshman Kumar said. “I got practical experience using the UW’s package of computer programs. These are tools that work easily and efficiently with our data, showing us the outcomes we are looking for.”
The short course is part of a suite of resources provided by the School’s Department of Biostatistics, which hosts the national Data Coordinating Center for the Trans-Omics for Precision Medicine (TOPMed) program, the largest whole-genome sequencing project in the world. The center supports more than 1,000 investigators tapping into TOPMed’s treasure trove of rich and diverse data. Lakshman Kumar is among these investigators, working on the COPDGene Study.
Over five years, the TOPMed project has grown to include the genomic data of 150,000 participants from more than 80 studies, including the Framingham Heart Study and the Women’s Health Initiative.
TOPMed is run by the National Heart, Lung and Blood Institute and is part of the Precision Medicine Initiative, which aims to provide disease treatments – and ultimately cures – tailored to an individual’s unique genes and environment. TOPMed contributes to the initiative by generating scientific resources that improve the understanding of genetic risk for heart, lung, blood and sleep disorders.
TOPMed has billions of bits of information so far. It’s mindboggling.
“We’re at the heart of this huge project,” said Professor Bruce Weir, who co-leads the TOPMed Data Coordinating Center with Professors Kenneth Rice and Bruce Psaty and Associate Professor Timothy Thornton. Cathy Laurie is the center’s director of operations.
“I remember the days when I was excited to find four traits associated with disease,” Weir said. “TOPMed has billions of bits of information so far. It’s mindboggling.”
The UW center works closely with TOPMed’s Informatics Research Center, based at the University of Michigan School of Public Health, to match genomic data with other biological and clinical data – allowing investigators to make insights on key population health issues. Scientists at the Broad Institute of MIT and Harvard, for instance, have found a genetic mutation that may have a strong impact on heart health. Researchers at the Fred Hutchinson Cancer Research Center in Seattle have linked genetic variants in African Americans to lower levels of vitamin B12.
“These studies inform us about new biology, bringing us closer to predicting an individual’s risk for diseases,” said Rice, the project’s principal investigator. “No one has done this before at this scale. To make these data work together and turn it into knowledge is incredibly challenging, but we’re uniquely equipped to do the job.”
One challenge is that every study collects data on phenotypes, or clinical traits, differently. UW biostatisticians developed a process to “harmonize” phenotypes, ensuring that all the data are in the same units so they can be analyzed together.
For researchers who work with data at the scale of TOPMed, the task can be cumbersome, costly and time consuming. Plans are underway to facilitate the wider availability of TOPMed data and to support scientific research using the data set. A cloud-based ecosystem called DataSTAGE gives each user access to very specific human subject data they can use to do statistical analyses without the need for their own high-performing computers. DataSTAGE also supports the broad availability of other data – without revealing a person’s name or racial or ethnic origin – to train students and researchers.
The cloud environment not only democratizes the data, but also catalyzes meaningful research, according to Laurie, the operations director. “There’s a lot of useful information in there to mine, so the more people mining it, the more will come out of it in the end.”