Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci.

Détails

Ressource 1Télécharger: Genome Res.-2019-Mudge-gr.246462.118.pdf (1464.71 [Ko])
Etat: Public
Version: Author's accepted manuscript
Licence: CC BY 4.0
ID Serval
serval:BIB_24414116FADD
Type
Article: article d'un périodique ou d'un magazine.
Collection
Publications
Titre
Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci.
Périodique
Genome research
Auteur(s)
Mudge J.M., Jungreis I., Hunt T., Gonzalez J.M., Wright J.C., Kay M., Davidson C., Fitzgerald S., Seal R., Tweedie S., He L., Waterhouse R.M., Li Y., Bruford E., Choudhary J.S., Frankish A., Kellis M.
ISSN
1549-5469 (Electronic)
ISSN-L
1088-9051
Statut éditorial
In Press
Peer-reviewed
Oui
Langue
anglais
Notes
Publication types: Journal Article
Publication Status: aheadofprint
Résumé
The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely-used tool to identify evolutionary signatures of protein-coding regions using multi-species genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine-learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyse over 1000 high-scoring human PhyloCSF regions, and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously-annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously-undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic datasets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions, by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein-altering. Altogether, our PhyloCSF datasets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterisation.
Pubmed
Open Access
Oui
Création de la notice
21/09/2019 17:31
Dernière modification de la notice
24/09/2019 5:11
Données d'usage