OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

Rossier, V.; Warwick Vesztrocy, A.; Robinson-Rechavi, M.; Dessimoz, C.

doi:10.1093/bioinformatics/btab219

OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

Details

Download: 33787851_BIB_E9D32F404F19.pdf (1015.97 [Ko])
State: Public
Version: Final published version
License: CC BY 4.0

Serval ID

serval:BIB_E9D32F404F19

Type

Article: article from journal or magazin.

Collection

Publications

Institution

UNIL/CHUV

Title

OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

Journal

Bioinformatics

Author(s)

Rossier V., Warwick Vesztrocy A., Robinson-Rechavi M., Dessimoz C.

ISSN

1367-4811 (Electronic)

ISSN-L

1367-4803

Publication state

Published

Issued date

29/09/2021

Peer-reviewed

Oui

Editor

Birol Inanc

Volume

Number

Pages

2866-2873

Language

english

Notes

Publication types: Journal Article ; Research Support, Non-U.S. Gov't
Publication Status: ppublish

Abstract

Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive.
Here, we first show that in multiple animal and plant datasets, 18-62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND.
OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer.
Supplementary data are available at Bioinformatics online.

Keywords

Animals, Algorithms, Sequence Alignment, Software, Proteins/genetics, Biological Evolution, Phylogeny

URN

urn:nbn:ch:serval-BIB_E9D32F404F199

OAI-PMH

oai:serval.unil.ch:BIB_E9D32F404F19

DOI

10.1093/bioinformatics/btab219

Pubmed

33787851

Web of science

000732709000009