Large Scale Language Modeling for Life Sience

Last modified Feb 8, 2022

Project Summary

Humans are very good at transferring knowledge from one task to a different, yet related task. This generalization of knowledge and skills allows humans to perform well even in situations which were not encountered before. This project relies on the concept of transfer learning to train deep learning models that can later be applied on data scarce problems. These models are able to extract information from large unlabelled data sets, i.e. wikipedia in the field of Natural Language Processing (NLP), and compress it by adjusting the weights of a neural network based model. Once the model has completed training on unlabeled data, the compressed knowledge can be transferred to any supervised classification task. By sharing the pre-trained model, other researchers and startups can benefit from the computational resources spent to train the model allowing them to accelerate their own machine learning training.

Transfer learning has shown a lot of improvement on performance as well as speed in the fields of computer vision and NLP. This project aims at porting these benefits to the life sciences and Software development. The most demanding aspect of transfer learning is the initial training of the models, which requires huge amounts of GPU computing power. However, once this computationally demanding pre-training is done, the trained models are easily deployable and operable by anyone, even on dated hardware. Bioinformatics provides a perfect use-case to learn a compressed representation of data, since the amount of information (protein sequences in databases) grows exponentially in terms of storage with every year.

To this extent, we have already successfully applied the idea of data compression via transfer learning as a small proof of concept study using a highly non-redundant dataset of protein sequences. This resulted in a model that generalized information for about 35 million protein sequences. Simple machine learning models trained on the representations extracted by the unsupervised generalization model can distinguish important aspects of protein function and structure, not shy of state-of-the-art predictors. State of the art devices in this field are though much less applicable for large-scale use-cases, and even for simple predictions may take unfeasible time and computing resources. This is, for instance, the case with the systematic study of protein mutation effect, aka. will a certain mutation in a protein cause it to cease its function, leading to potential diseases. A speed up in methods that assess the effect of mutation would benefit not only researchers, but potentially physicians in clinical settings.

After the initial success using natural language models on protein sequences, the next phase of this project is to scale the model up to datasets that better represent the variety of protein sequences present in life. At minimum, we will be able to:

Compress the information from large protein sequence databases, i.e. MetaClust30 with 2.5B protein sequences (1.5TB), into a single deep learning model (1GB).
Capture relevant biological information on protein sequences ranging from information on single amino acids over structural features and function to whole organisms. The information is provided in the form of distributed vector representations which allows an easy incorporation into existing machine learning workflows.
Capture information which is not directly accessible by current sequence alignment methods, i.e. structural similarity below a certain sequence similarity.
Publish the trained models for researchers and start-up to make best use of the pre-trained model.

After evaluating the results from training models via the DD grant on protein sequences, we plan to apply for the INCITE grant to expand our results for the other use-cases which we are already building small proof-of-concept for, i.e. small molecules and Software Engineering. Furthermore, to cover these different use-cases we seek to build a strategic and long-term partnership with life science teams at ORNL, such as the one led by Prof. Tourassi.

Overview and Significance of Research:

Recent literature suggests that many diseases ranging from obesity to cancer to neurodegenerative diseases can be associated with the human gut microbiome. Understanding the mechanistic relationships in the complex landscape of bacterial genomes that make up this microbiome requires the analysis of a huge number of proteins involved in communication and function of the bacteria and the host organism. Unfortunately, all current approaches that predict aspects of proteins rely on explicit searches for evolutionary information, which is very expensive and time consuming to gather, sometimes unavailable, and becomes exponentially harder to obtain with additional protein sequences being included in protein database every year.

Here, we propose a novel approach to harness evolutionary information and distill it in the form of a neural network. This is possible due to recent breakthroughs in the field of natural language processing (NLP). These novel approaches present the advantage of learning from large, unlabelled data sets. For example: they can be trained on all articles in Wikipedia by optimizing the underlying model to predict the next word in a sentence, given all previous words in the same sentence. This allows the model to learn the structure of the underlying language, i.e. syntax and semantics. The demanding training needs to be done only once in a centralized fashion. Afterwards, the pre-trained model can be distributed and used by researchers and industry using consumer hardware. Simplistically, the pre-trained model acts as a feature extractor in order to create machine-readable vector representations from single protein sequences. This allows to skip the resource intensive evolutionary information extraction phase in traditional protein bioinformatics pipelines. The resulting model representations can be used for all types of prediction or classification tasks involving proteins, i.e. predicting the effect of mutations or the interactions amongst proteins.

Research Objectives and Milestones:

The main goal of this project is to use transfer learning to harness the wealth of information in today’s biological databases and compress it into a single model. A preliminary study shows that such models are able to capture different realities of proteins, i.e. amino acid properties, structural features, function and subcellular localization.

Milestones:

Pre-Process the up-to now largest set of protein sequences which is already redundancy-reduced: MetaClust30, for the proposed model.
Test the proposed deep learning model on the SUMMIT on a small-scale experiment to ensure correct functioning during up-scaling.
Adjust hyperparameters, i.e. batch size and learning rate, according to previous test results.
Start training the model on MetaClust30 using large scale. Save checkpoints after regular intervals to avoid loss of training progress.
Evaluate the pre-trained model on a variety of supervised tasks, i.e. structure classification (SCOPe), function prediction (E.C.), secondary structure prediction, subcellular localization prediction, effect of mutation.
Repeat 1-5 with the next model, if there is still node hours available.

All these models will be trained using MPI/OpenMPI or NCCL backends which are scalable on large number of nodes.

Proposed Computational Approach:

The deep learning models which we are planning to train are ELMo, Transformer XL and XLNet. These models are based on pytorch and tensorflow machine learning libraries (Python). The pytorch based models use pytorch’s distributed communication package which support three different backends including MPI, Gloo or NCCL. The tensorflow based models uses horovod which uses Open MPI for distributed training. Both pytorch and tensorflow uses CUDA library for the GPU training.

We will start training with Transformer XL model, then XLNet, then ELMo. In each model training stage we will test the model training using 2 nodes for few minutes to confirm correct functionality, then we will scale it up to the maximum possible number of nodes. During each training stage we will monitor the training, in case the model diverges. Furthermore, after a certain, configurable number of training steps each model evaluates itself against the validation dataset and stores the best model checkpoint, which can be resumed in case the training was interrupted.

The training data is around 10 TB of textual data. In order to accelerate the training, at the start of each training, the dataset will be divided into chunks and loaded separately into the nodes main memory. Then, the GPUs will load the data from the main memory. Since, moving 10 TB of data to separate nodes will take some time, we will need to have a high speed storage disks, to decrease the loading time. On the other side, during training the model loss will be shared across nodes, and from our experiences the communication overhead could be reduced if there is a high speed connection between nodes like infiniband.

At the end of the project the total amount of storage will be around 10 TB for the dataset and around 50 GB for models’ checkpoints and logs.

Parallel Performance:

The Transformer XL model was tested on 3 clusters:

34x Nvidia V100 GPUs using 17 nodes on LRZ cluster.
128x Nvidia V100 GPUs using 16 nodes on Amazon Web Services.
240x Nvidia V100 GPUs using 30 nodes on Nvidia internal cluster.

These tests showed that the model can scale to hundredths and even thouthands of GPUs with low computation/communication overhead, given there is high network connection amongst nodes.

For example on Amazon Web Services, the following scale-up were achieved using NCCL backend:

Number of nodes Number of GPUs Scale-up

1 8 1x

2 16 1.9x

4 32 3.9x

8 64 6.8x

16 128 13.2x

These Amazon machines were connected with throughput of 93 Gbps, and produced a maximum performance of 50 Gbps using all-reduce. Furthermore, each network flow is limited to 10 Gbps.

The Nvidia internal cluster had much better scale-up because it had faster network connection using infiniband, as shown in this table:

# of nodes. # of GPUs. Global Batch Time (ms)

1 1 665

1 8 712

4 32 740

30 240 840

The current pytorch code is optimized to work on hundreds/thousands of nodes using MPI, Gloo or NCCL. It can achieve the current theoretical speed-up for any cluster system.

References:

[1] Heinzinger, M., Elnaggar, A., Wang, Y., Dallago, C., Nachaev, D., Matthes, F., & Rost, B. (2019). Modeling the Language of Life-Deep Learning Protein Sequences. bioRxiv, 614313.

Rives, A., Goyal, S., Meier, J., Guo, D., Ott, M., Zitnick, C. L., ... & Fergus, R. (2019).

[2] Rives, A., Goyal, S., Meier, J., Guo, D., Ott, M., Zitnick, C. L., ... & Fergus, R. (2019). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv, 622803.

[3] Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M., & Church, G. M. (2019). Unified rational protein engineering with sequence-only deep representation learning. bioRxiv, 589333.

[4] Bepler, T., & Berger, B. (2019). Learning protein sequence embeddings using information from structure. arXiv preprint arXiv:1902.08661.