What is hidden in the darkness? Deep-learning assisted large-scale protein family curation uncovers novel protein families and folds

Durairaj, Janani and Waterhouse, Andrew M. and Mets, Toomas and Brodiazhenko, Tetiana and Abdullah, Minhal and Studer, Gabriel and Akdel, Mehmet and Andreeva, Antonina and Bateman, Alex and Tenson, Tanel and Hauryliuk, Vasili and Schwede, Torsten and Pereira, Joana. (2023) What is hidden in the darkness? Deep-learning assisted large-scale protein family curation uncovers novel protein families and folds.

[img] PDF - Submitted Version
Available under License CC BY (Attribution).


Official URL: https://edoc.unibas.ch/94398/

Downloads: Statistics Overview


Driven by the development and upscaling of fast genome sequencing and assembly pipelines, the number of protein-coding sequences deposited in public protein sequence databases is increasing exponentially. Recently, the dramatic success of deep learning-based approaches applied to protein structure prediction has done the same for protein structures. We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover most of the catalogued natural proteins, including those difficult to annotate for function or putative biological role based on standard, homology-based approaches. In this work, we quantified how much of such "dark matter" of the natural protein universe was structurally illuminated by AlphaFold2 and modelled this diversity as an interactive sequence similarity network that can be navigated at https://uniprot3d.org/atlas/AFDB90v4 . In the process, we discovered multiple novel protein families by searching for novelties from sequence, structure, and semantic perspectives. We added a number of them to Pfam, and experimentally demonstrate that one of these belongs to a novel superfamily of toxin-antitoxin systems, TumE-TumA. This work highlights the role of large-scale, evolution-driven protein comparison efforts in combination with structural similarities, genomic context conservation, and deep-learning based function prediction tools for the identification of novel protein families, aiding not only annotation and classification efforts but also the curation and prioritisation of target proteins for experimental characterisation.
Faculties and Departments:05 Faculty of Science > Departement Biozentrum > Computational & Systems Biology > Bioinformatics (Schwede)
UniBasel Contributors:Schwede, Torsten and Durairaj, Janani and Waterhouse, Andrew and Studer, Gabriel
Item Type:Preprint
Publisher:Cold Spring Harbor Laboratory
Number of Pages:23
Note:Publication type according to Uni Basel Research Database: Discussion paper / Internet publication
Identification Number:
edoc DOI:
Last Modified:14 Jun 2023 12:44
Deposited On:26 Apr 2023 07:27

Repository Staff Only: item control page