μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data.

Cozzi, D.; Rossi, M.; Rubinacci, S.; Gagie, T.; Köppl, D.; Boucher, C.; Bonizzoni, P.

doi:10.1093/bioinformatics/btad552

μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data.

Détails

Télécharger: 37688560_BIB_A05A8B57A5AB.pdf (767.77 [Ko])
Etat: Public
Version: Final published version
Licence: CC BY 4.0

ID Serval

serval:BIB_A05A8B57A5AB

Type

Article: article d'un périodique ou d'un magazine.

Collection

Publications

Institution

UNIL/CHUV

Titre

μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data.

Périodique

Bioinformatics

Auteur⸱e⸱s

Cozzi D., Rossi M., Rubinacci S., Gagie T., Köppl D., Boucher C., Bonizzoni P.

ISSN

1367-4811 (Electronic)

ISSN-L

1367-4803

Statut éditorial

Publié

Date de publication

02/09/2023

Peer-reviewed

Oui

Volume

Numéro

Langue

anglais

Notes

Publication types: Journal Article ; Research Support, U.S. Gov't, Non-P.H.S. ; Research Support, Non-U.S. Gov't ; Research Support, N.I.H., Extramural
Publication Status: ppublish

Résumé

The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory.
In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.
Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.

Mots-clé

Biological Specimen Banks, Haplotypes, Whole Genome Sequencing, United Kingdom

URN

urn:nbn:ch:serval-BIB_A05A8B57A5AB4

OAI-PMH

oai:serval.unil.ch:BIB_A05A8B57A5AB

DOI

10.1093/bioinformatics/btad552

Pubmed

37688560

Web of science

001066398900002