Optimized sample selection for cost-efficient long-read population sequencing.

Ranallo-Benavidez, T.R.; Lemmon, Z.; Soyk, S.; Aganezov, S.; Salerno, W.J.; McCoy, R.C.; Lippman, Z.B.; Schatz, M.C.; Sedlazeck, F.J.

doi:10.1101/gr.264879.120

Optimized sample selection for cost-efficient long-read population sequencing.

Détails

Demande d'une copie Sous embargo indéterminé.
Accès restreint UNIL
Etat: Public
Version: Final published version
Licence: CC BY-NC 4.0

ID Serval

serval:BIB_9450E1C359BB

Type

Article: article d'un périodique ou d'un magazine.

Collection

Publications

Institution

UNIL/CHUV

Titre

Optimized sample selection for cost-efficient long-read population sequencing.

Périodique

Genome research

Auteur⸱e⸱s

Ranallo-Benavidez T.R., Lemmon Z., Soyk S., Aganezov S., Salerno W.J., McCoy R.C., Lippman Z.B., Schatz M.C., Sedlazeck F.J.

ISSN

1549-5469 (Electronic)

ISSN-L

1088-9051

Statut éditorial

Publié

Date de publication

05/2021

Peer-reviewed

Oui

Volume

Numéro

Pages

910-918

Langue

anglais

Notes

Publication types: Journal Article ; Research Support, N.I.H., Extramural ; Research Support, U.S. Gov't, Non-P.H.S.
Publication Status: ppublish

Résumé

An increasingly important scenario in population genetics is when a large cohort has been genotyped using a low-resolution approach (e.g., microarrays, exome capture, short-read WGS), from which a few individuals are resequenced using a more comprehensive approach, especially long-read sequencing. The subset of individuals selected should ensure that the captured genetic diversity is fully representative and includes variants across all subpopulations. For example, human variation has historically focused on individuals with European ancestry, but this represents a small fraction of the overall diversity. Addressing this, SVCollector identifies the optimal subset of individuals for resequencing by analyzing population-level VCF files from low-resolution genotyping studies. It then computes a ranked list of samples that maximizes the total number of variants present within a subset of a given size. To solve this optimization problem, SVCollector implements a fast, greedy heuristic and an exact algorithm using integer linear programming. We apply SVCollector on simulated data, 2504 human genomes from the 1000 Genomes Project, and 3024 genomes from the 3000 Rice Genomes Project and show the rankings it computes are more representative than alternative naive strategies. When selecting an optimal subset of 100 samples in these cohorts, SVCollector identifies individuals from every subpopulation, whereas naive methods yield an unbalanced selection. Finally, we show the number of variants present in cohorts selected using this approach follows a power-law distribution that is naturally related to the population genetic concept of the allele frequency spectrum, allowing us to estimate the diversity present with increasing numbers of samples.

Mots-clé

Exome/genetics, Gene Frequency, Genetics, Population, Genome, Human, Humans, Polymorphism, Single Nucleotide, Sequence Analysis, DNA/methods

OAI-PMH

oai:serval.unil.ch:BIB_9450E1C359BB

DOI

10.1101/gr.264879.120

Pubmed

33811084

Web of science

000646804400001

Open Access

Oui