OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

Details

Ressource 1Request a copy Under indefinite embargo.
UNIL restricted access
State: Public
Version: author
License: CC BY 4.0
Serval ID
serval:BIB_E9D32F404F19
Type
Article: article from journal or magazin.
Publication sub-type
Minutes: analyse of a published work.
Collection
Publications
Institution
Title
OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.
Journal
Bioinformatics
Author(s)
Rossier V., Vesztrocy A.W., Robinson-Rechavi M., Dessimoz C.
ISSN
1367-4811 (Electronic)
ISSN-L
1367-4803
Publication state
Published
Issued date
31/03/2021
Peer-reviewed
Oui
Editor
Birol Inanc
Volume
37
Pages
2866–2873
Language
english
Notes
Publication types: Journal Article
Publication Status: aheadofprint
Abstract
Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive.
Here, we first show that in multiple animal and plant datasets, 18 to 62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily-informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND.
OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer.
Supplementary data are available at Bioinformatics online.
Keywords
Statistics and Probability, Computational Theory and Mathematics, Biochemistry, Molecular Biology, Computational Mathematics, Computer Science Applications
Pubmed
Web of science
Open Access
Yes
Create date
31/03/2021 20:18
Last modification date
08/03/2022 7:33
Usage data