Can GPT be used for Genomics?
- Yasin Uzun, MSc, PhD
- May 2
- 2 min read
Updated: May 25
GPT caught almost every industry with storm. Can it also transform Genomics for drug discovery?

Large language models (LLMs) found a lot of application in many areas of the industry, but its implications in the field of genomics are mostly limited. The obvious limitation is that the input to LLMs are “words” or “tokens”. However, in genomics, most times, the input is generally numerical, such as gene expression, activity, mutational or epigenomic signature.
LLMs do not show the best performance for this traditional type of learning. Hence, there are not many applications of transformers or LLMs for genomics in the industry. One possible exception might be using the DNA nucleotide sequence directly as the input to a transformer. This would have interesting implications in biomedicine.
LLMs split the texts into words and punctuations, which are called “tokens”. The nucleotides are like letters in the alphabet. The challenge is how to extract tokens from the DNA/RNA, as the genomic sequence is continuous (without gaps).
But there are some alternatives. One possibility is to use exons as tokens, but in that case, the size of the text may be too short. A more feasible option is to use codons or coded amino acids as tokens. This option provides the sufficient depth and complexity as a language. In fact, AlphaFold, which was the basis for the 2024 Nobel Prize in Chemistry, is based on transformer architecture just like LLMs and uses the amino acid sequence as input.
There is one catch though: 98% of the genome is non-coding continuous sequence without spacers/gaps/delimiters. This makes it very hard to tokenize. Alternative approaches like using sequence patterns, such as DNA binding sequence motifs and k-mers are also challenging because there is not a specific reference/starting point and these patterns often overlap. Hence, it is highly challenging to employ tools like GPT for non-coding DNA sequences.



Comments