OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

Détails

Ressource 1Demande d'une copie Sous embargo indéterminé.
Accès restreint UNIL
Etat: Public
Version: de l'auteur⸱e
Licence: CC BY 4.0
ID Serval
serval:BIB_E9D32F404F19
Type
Article: article d'un périodique ou d'un magazine.
Sous-type
Compte-rendu: analyse d'une oeuvre publiée.
Collection
Publications
Institution
Titre
OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.
Périodique
Bioinformatics
Auteur⸱e⸱s
Rossier V., Vesztrocy A.W., Robinson-Rechavi M., Dessimoz C.
ISSN
1367-4811 (Electronic)
ISSN-L
1367-4803
Statut éditorial
Publié
Date de publication
31/03/2021
Peer-reviewed
Oui
Editeur⸱rice scientifique
Birol Inanc
Volume
37
Pages
2866–2873
Langue
anglais
Notes
Publication types: Journal Article
Publication Status: aheadofprint
Résumé
Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive.
Here, we first show that in multiple animal and plant datasets, 18 to 62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily-informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND.
OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer.
Supplementary data are available at Bioinformatics online.
Mots-clé
Statistics and Probability, Computational Theory and Mathematics, Biochemistry, Molecular Biology, Computational Mathematics, Computer Science Applications
Pubmed
Web of science
Open Access
Oui
Création de la notice
31/03/2021 20:18
Dernière modification de la notice
08/03/2022 7:33
Données d'usage