Efficient Genome-wide Association in Biobanks Using Topic Modeling Identifies Multiple Novel Disease Loci

Thomas H McCoy, Jr, Victor M Castro, Leslie A Snapper, Kamber L Hart, and Roy H Perlis
Biobanks and national registries represent a powerful tool for genomic discovery, but rely on diagnostic codes that can be unreliable and fail to capture relationships between related diagnoses. We developed an efficient means of conducting genome-wide association studies using combinations of diagnostic codes from electronic health records for 10,845 participants in a biobanking program at two large academic medical centers. Specifically, we applied latent Dirichilet allocation to fit 50 disease topics based on diagnostic codes, then conducted a genome-wide common-variant association for each topic. In sensitivity analysis, these results were contrasted with those obtained from traditional single-diagnosis phenome-wide association analysis, as well as those in which only a subset of diagnostic codes were included per topic. In meta-analysis across three biobank cohorts, we identified 23 disease-associated loci with p < 1e-15, including previously associated autoimmune disease loci. In all cases, observed significant associations were of greater magnitude than single phenome-wide diagnostic codes, and incorporation of less strongly loading diagnostic codes enhanced association. This strategy provides a more efficient means of identifying phenome-wide associations in biobanks with coded clinical data.
Page Range
Date Published
August 31, 2017
Article PDF
17_100_McCoy.pdf17_100_McCoy.pdf1255 KB
Supplemental Data
Files.zipFiles.zip3354 KB
McCoy, Castro, Snapper, Hart, Perlis, biobanks, genome-wide association, electronic health centers, genetics, hepatology, liver disease, informatics, functional analysis, diagnostic codes
Article Type
Research Article