Using Large Language Models (LLMs) to Unravel the Mysteries of Intrinsically Disordered Proteins (IDPs)

May 02, 2023
Intrinsically Disordered Proteins Large Language Models Molecular Dynamics Simulation

A gentle introduction to Intrinsically Disordered Proteins (IDPs)


Intrinsically Disordered Proteins (IDPs) are a fascinating class of proteins that lack a well-defined three-dimensional structure under physiological conditions. Unlike traditional proteins that fold into stable structures, IDPs exhibit high flexibility and dynamic properties, allowing them to perform various essential biological functions. These proteins have attracted significant attention in recent years due to their unique structural and functional characteristics.

Examples of Intrinsically Disordered Proteins:

1. p53: p53 is a well-known tumor suppressor protein involved in regulating cell cycle progression and preventing the formation of cancer. It contains a disordered region known as the transactivation domain (TAD) that interacts with other proteins and DNA to control gene expression.
2. α-Synuclein: α-Synuclein is a protein associated with Parkinson's disease. It exists in a disordered state and undergoes structural transitions to form aggregates, which are believed to play a role in the pathogenesis of the disease.
3. β-Amyloid: Amyloid-β is a peptide that accumulates in the brains of individuals with Alzheimer's disease. In its monomeric form, it is intrinsically disordered. However, it can undergo conformational changes and form aggregates, contributing to the development of amyloid plaques.
4. Prothymosin α: ProTα is essential for cell proliferation and survival. It is involved in chromatin remodeling and proapoptotic activity.

Compositions of IDPs


IDPs are composed of amino acids that differs in the amino acid compositions than the traditional proteins. It is believed there are abundance of proline (P), glycine (G), and glutamine (E) in the IDP sequences, which are known as disorder-promoting residues. In addition, IDPs contain a high proportion of charged residues such as arginine (R), lysine (K), and glutamic acid (E). These residues are thought to play an important role in the formation of disordered regions.

What are the compositions of IDPs?

Mobidb-lite is a database of intrinsically disordered proteins that contains information about the sequences. We have downloaded the sequences of IDPs from the database of length ranging from 30 to 150 amino acids. Now, let's take a look at the amino acid compositions of IDPs.

Indeed we find the two heights occurring amino acids are arginine (R) and lysine (K) which are positively charged followed by glycine (G) and serine (S).

Large Language Models


Large language models are sophisticated computational systems designed to understand and generate human-like text. They leverage vast amounts of data to learn patterns, relationships, and structures within natural language. These models have revolutionized various fields, including natural language processing, text generation, and even human-computer interaction.

One class of language models is known as N-gram models, which are widely used for language processing tasks. N-grams refer to contiguous sequences of N words or characters within a text. For example, in the sentence "The cat sat on the mat", the 2-grams (or bigrams) would be "The cat", "cat sat", "sat on", and so on. An the same way the 3-grams (or trigrams) would be "The cat sat", "cat sat on", "sat on the" and so on.

N-gram models capture the statistical relationships between these N-grams in a given corpus of text. By analyzing the frequencies of N-grams and their co-occurrences, these models can make predictions about the likelihood of certain word sequences or generate new text based on learned patterns.

One of the key advantages of N-gram models is their simplicity and efficiency. They can be easily implemented and scaled to handle large datasets. Additionally, N-gram models have been extensively used in applications such as machine translation, spell checking, speech recognition, and information retrieval.

However, N-gram models have some limitations. They rely solely on local context and do not capture long-range dependencies or global semantics. For instance, a trigram model might predict the word "ate" after seeing the sequence "I have," but it may not consider that "ate" is unlikely in the context of a medical document.

To address these limitations, more advanced language models, such as recurrent neural networks (RNNs) and transformer models, have been developed. These models can capture complex patterns and semantic relationships by incorporating contextual information from a wider context window.

Despite the rise of more sophisticated models, N-gram models still find utility in certain scenarios, especially when dealing with resource-constrained environments or for specific language modeling tasks where local context is sufficient.

Can we apply LLM to generate synthetic IDPs?


Predicting IDP sequences can contribute to various fields, including drug discovery, protein engineering, and understanding disease mechanisms while uncovering new therapeutic targets and designed molecules to modulate their functions. Let's now dive into the fascinating process of generating IDP sequences using a large language model. For this demonstration, we will use a Python code snippet that leverages the trigram model and a dataset of known IDP sequences.

Import the required libraries:


import random
import pandas as pd
from nltk import trigrams
from collections import defaultdict

Create a placeholder for the model:

 
model = defaultdict(lambda: defaultdict(lambda: 0))

Read the downloaded sequence data from MobiDB:


df = pd.read_csv('idp_mobidB.csv')
df.columns = ['ID', 'Origin', 'Sequence']

Build a trigram model by calculating the occurrence of an amino acid by precedence:


for idp in df['Sequence']:
    for w1, w2, w3 in trigrams(idp, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1

Select two random amino acids as a seeder sequence:


AA_list = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L',
            'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']

# Select two random amino acids
a1 = random.choice(AA_list)
a2 = random.choice(AA_list)

#Seeder sequence
seq = [a1, a2]

Set the maximum number of amino acids needed in the generated IDP sequence:


length_input = int(input('Enter the length of the IDP to be generated: '))
max_length = length_input - 2

=========================================================================================================================

Generate the sequence based on the trigram model:


for _ in range(max_length):
    a1, a2 = seq[-2], seq[-1]    

    # Check if there are possible next amino acids in the model
    if len(model[(a1, a2)]) > 0:
        # Sort the dictionary by descending order of frequency
        sorted_sequences = sorted(model[(a1, a2)].items(), key=lambda x: x[1], reverse=True)
        list_len = len(sorted_sequences)

        # Select a random index from the top 4 most frequent amino acids
        if list_len == 1:
            r = 0
        elif list_len == 2:
            r = random.randint(0, 1)
        elif list_len == 3:
            r = random.randint(0, 2)
        else:
            r = random.randint(0, 3)

        # Detect the None type
        if sorted_sequences[r][0] == None:
            # Choose a random amino acid
            seq.append(random.choice(AA_list))
        else:
            # Select the amino acid with that index
            seq.append(sorted_sequences[r][0])

    else:
        break

Print the generated IDP sequence:


predicted_idp = ''.join(seq)
print(len(predicted_idp), predicted_idp)

By running the code, one can specify the desired length of the IDP sequence to be generated. The model will utilize the trigram model and the dataset of known IDP sequences to generate a novel sequence that exhibits characteristics similar to real IDPs. The resulting sequence can be further analyzed and explored in the context of IDP research.


A few generated IDP sequences from the trigram model:


NDEPKKKAEEGDKAAKGKSKLVAPRRKGRPSPGGEGESKLQAGRKGKSSEEG
EEAAEEDDVGPGPGGSSSQQPAPPKPEEEAEEGEGASKLEAGGGRPPKG
LGTARRKGRPKRTSSGGRKRGGRGAGGRRRKVVPKRPRPRGGGRRGAGDEDKP
WMASASGGGSKETQRKGKAKTKRPPSASGESISQQQVVQPLASKNGDGGSGEKQPPPPPPPSPGGSSGGSLEEADAAEEGARKGGRRKVKDESASSQQVRRKRKRKGGESKEEGDAPPPPPP


How realistic are the trigram synthetic IDPs?


The immediate question that arises is how to evaluate the generated sequences and whether they are similar to real IDPs. One can use the previous figure (distributions of the amino acids in the IDPs) as a reference and compare the generated sequences with the real IDPs. We generated 1000 sequences of length between 30 to 150. Let's see how the generated sequences compare with the real IDPs.

We can see that the propensity of glycine (G) is overestimated in the generated sequences than arginine (R) and lysine (K) that are prevalent in the real IDPs. But, the distribution weights are not too far off from the real IDPs. This gives us confidence that the trigram model is the right direction to generate synthetic IDPs. However, the order of the amino acids propensity needs to be accurate in oder to generate more realistic but synthetic IDPs. A few remedies can be applied to improve the model and discussed below.

What to do next?


The trigram model is a simple model that can be used to generate synthetic IDPs. However, the model is not perfect and can be improved. One can use the trigram model as a starting point and build a more complex model that can generate more realistic IDPs. The model can be improved by using a more complex n-gram model, such as a 4-gram or 5-gram model. The another avenue to try is to implement a recurrent neural network model that can comprehend the long-ranged effects/influences. One can also use a more complex dataset of IDPs, such as the IDP-IM dataset, which contains more than 100,000 IDPs.

Disclaimer: The Python code provided in this article is for demonstration purposes only and should be used responsibly and within legal and ethical boundaries. The figures should not be copied/reproduced without author's consent.