Rapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2.

TitleRapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2.
Publication TypeJournal Article
Year of Publication2020
AuthorsShen, F, Kidd, JM
JournalGenes (Basel)
Date Published2020 01 29
KeywordsAlgorithms, Computational Biology, DNA Copy Number Variations, Evolution, Molecular, Gene Duplication, Genome, Human, Humans, Sequence Analysis, DNA

Gene duplication is a major mechanism for the evolution of gene novelty, and copy-number variation makes a major contribution to inter-individual genetic diversity. However, most approaches for studying copy-number variation rely upon uniquely mapping reads to a genome reference and are unable to distinguish among duplicated sequences. Specialized approaches to interrogate specific paralogs are comparatively slow and have a high degree of computational complexity, limiting their effective application to emerging population-scale data sets. We present QuicK-mer2, a self-contained, mapping-free approach that enables the rapid construction of paralog-specific copy-number maps from short-read sequence data. This approach is based on the tabulation of unique k-mer sequences from short-read data sets, and is able to analyze a 20X coverage human genome in approximately 20 min. We applied our approach to newly released sequence data from the 1000 Genomes Project, constructed paralog-specific copy-number maps from 2457 unrelated individuals, and uncovered copy-number variation of paralogous genes. We identify nine genes where none of the analyzed samples have a copy number of two, 92 genes where the majority of samples have a copy number other than two, and describe rare copy number variation effecting multiple genes at the APOBEC3 locus.

Alternate JournalGenes (Basel)
PubMed ID32013076
PubMed Central IDPMC7073954
Grant ListR01 GM103961 / GM / NIGMS NIH HHS / United States
DP5 OD009154 / OD / NIH HHS / United States
UM1 HG008901 / HG / NHGRI NIH HHS / United States