Training mRNA Language Models Across 25 Species for $165

← Back to Links

Training mRNA Language Models Across 25 Species for $165

[huggingface.co] · 6 min read · Saved Apr 02, 2026

OpenMed developed a pipeline that transforms protein concepts into codon-optimized DNA sequences. They compared various transformer architectures for codon-level language modeling, finding CodonRoBERTa-large-v2 to outperform others in biological relevance. The project includes detailed results and runnable code for each stage.

More on the topic...

Generating detailed summary...

Failed to generate summary. Please try again.

OpenMed has developed a comprehensive AI pipeline for protein engineering that spans three critical stages: predicting protein structures, designing amino acid sequences, and optimizing codon usage for effective expression in target organisms. They focused heavily on the codon optimization aspect, training multiple transformer models on a dataset of 250,000 coding sequences from E. coli and then scaling to 381,000 sequences across 25 species. The standout model, CodonRoBERTa-large-v2, achieved a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming other models like ModernBERT. The pipeline's architecture exploration revealed that traditional NLP models like BERT and its variants did not transfer well to codon-level language modeling. OpenMed tested various architectures, ultimately favoring RoBERTa due to its proven performance in protein sequence modeling. They found that ModernBERT, despite its advanced design features, underperformed significantly, achieving a perplexity of 26.24 compared to RoBERTa's 4.01. This discrepancy was attributed to the limitations of pre-trained NLP weights and their inductive biases when applied to biological data. The training utilized 4 A100 GPUs with a focus on masked language modeling, maintaining a consistent evaluation protocol across all tested models. The results indicate a clear advantage for RoBERTa in understanding the codon sequences, with specific improvements in synonymous recovery and CAI metrics. The insights gained from their experiments not only inform their future work but also contribute to the broader field of protein AI, highlighting the importance of architecture selection and training strategies specific to biological data.

Questions about this article

No questions yet.