Foundation Model

Abstract

The rapid advancement of pathogen genomics demands efficient and scalable computational tools to interpret complex genomic data. We introduce Pathogen Encoder Base, a transformer encoder foundation model designed for pathogen genomics, comprising 350 million parameters—demonstrating a more compact footprint than comparable large-scale models, while retaining robust performance. Pathogen Encoder Base was pretrained on a subset of an open-source, multi-species genome dataset, totaling 13 billion tokens. Pathogen Encoder Base was developed using Keras 3 with multi-backend support (PyTorch, JAX, TensorFlow), enabling broader reproducibility and accessibility by accommodating varied research environments, and trained on a single consumer-grade RTX 4090 GPU. The pretraining process employed two distinct stages: a masked language modeling objective, followed by a shuffled + random identification task to enhance genomic sequence understanding. Despite being trained on less than 5% of the tokens and possessing 14% of the parameters of current state-of-the-art models, Pathogen Encoder Base achieves approximately 94% of their performance on benchmark tasks. This efficiency underscores how Pathogen Encoder Base is remarkably powerful despite its modest size and cost-effective training, making it a formidable foundation model for pathogen genomics research. We present the methodology, training details, and comparative results, demonstrating Pathogen Encoder Base’s effectiveness in genomic analysis and its applicability to diverse downstream tasks.

Introduction

Pathogen genomics continues to generate massive datasets that require advanced analytical methods to unlock meaningful insights. Foundation models—large-scale neural networks trained on extensive corpora—have emerged as a robust strategy for meeting this need. Their ability to learn generalized representations from vast amounts of unlabeled data makes them particularly valuable for diverse genomic applications.

We present Pathogen Encoder Base, a transformer-based encoder tailored for the analysis of multi-species pathogen genomic sequences. With 350 million parameters, it provides strong performance in benchmark tasks while requiring significantly fewer computational resources than larger alternatives. By leveraging a curated subset of an open-source, multi-species genomic dataset, Pathogen Encoder Base is pre-trained to capture critical sequence patterns and features.

This training pipeline employs Keras 3 with multi-backend support, offering flexibility in reproducibility and deployment. Unlike current large-scale models, our solution uses fewer tokens and can be trained effectively on readily available, consumer grade hardware, underscoring its cost-effectiveness. The model’s ability to produce meaningful learned representations at a reduced scale is facilitated by a two-stage pretraining approach—combining masked language modeling with a shuffled and random identification task.

In this technical report, we outline the model’s underlying design, describe the training methodology, and present empirical results that underscore its utility. We demonstrate Pathogen Encoder Base’s efficiency, highlight its applicability to downstream pathogen genomics tasks, and discuss the broader implications for research settings that operate under limited resources.

Average Matthew’s Correlation Coefficient (MCC) Across 18 Genomic Bencharks

Performance

Pathogen Encoder Base was evaluated against several open-source models, including the leading NT-Multispecies (2.5B). Despite having only 350 million parameters—about 14% of the size of NT-Multispecies (2.5B)—and being trained on a fraction of its token count, Pathogen Encoder Base consistently demonstrates competitive performance. On average, it achieves around 92% of the top model’s results while requiring significantly fewer computational resources.

Pathogen Encoder Base outperforms BPNet (original) and Enformer in 17 of 18 benchmarks (94.4%), HyenaDNA–1KB in 15 of 18 (83.3%), and HyenaDNA–32KB in 14 of 18 (77.7%). In comparison with InstaDeep’s Nucleotide Transformer series (all trained on 300B tokens), Pathogen Encoder Base surpasses NT–HumanRef (500M) in 5 of 18 tasks (27.7%), NT–1000G (500M) in 3 of 18 (16.6%), and matches or exceeds NT–1000G (2.5B) in 3 of 18 (16.6%). These results underscore that a carefully designed, compact foundation model can still deliver robust outcomes on demanding genomics tasks.

Pathogen AI Encoder Base Across Open Source Genomic Benchmarks

Architecture

The Pathogen Encoder Base model is built on an encoder-only transformer architecture. First, an embedding layer maps each input token to a dense embedding vector. To incorporate positional information, learnable positional encodings are added to these embeddings, accommodating sequence lengths of up to 350 tokens (2100 bp). We use 6-mer tokens, balancing the overall sequence length (up to 6 bp) with embedding dimensionality. This approach follows the configuration described by InstaDeep’s Nucleotide Transformer.

These embeddings then pass through a series of 32 transformer blocks. Each block begins with layer normalization, followed by a multi-head self-attention mechanism. The self-attention output is added to the block’s input via a residual connection, then undergoes a second layer normalization. Subsequently, it is processed by a feed forward layer with GELU activations.

Overall, the model contains  350 million parameters. Each transformer block has an embedding dimension of 1024, a hidden layer size of 4096, and 16 attention heads. For self-supervised pre-training, the final representation from the top layer is fed to a classification head, which outputs a probability distribution over the valid class space.

Pathogen AI Encoder Base
Vocabulary 6-mers
Number of Layers 28
Embedding Dimension 1024
Feed Forward Dimension 4096
Number of Attention Heads 16
Activation GELU
Number of Parameters 357262336
Tokens Trained On 12.9 Billion
Dataset Multispecies

Pre-Training

For Pathogen Encoder Base, we used a diverse multi-species genomic dataset from InstaDeep, streamed via Hugging Face. The pre-training process was divided into two phases, each focusing on a specialized self-supervised learning objective.

Phase 1: Masked Language Modeling (MLM)

Masked language modeling aims to predict tokens masked out of a sequence based on the surrounding context. We applied a 30% masking rate, where each masked token was replaced and later predicted from the valid class space (the full tokenizer vocabulary). This phase used 7.6 billion tokens, allowing the model to learn fundamental genomic representations.

Phase 2: Shuffle + Random Objective

In the second phase, we introduced a three-class classification task. For each sequence, 15% of tokens were shuffled, another 15% were randomly replaced, and the rest remained unaltered. The model was then trained to identify whether each token was shuffled, randomly replaced, or original—over 5.3 billion tokens. This approach encouraged a deeper understanding of token identity and order.

Optimization and Hyperparameters

We employed the Adam optimizer with a warmup–cosine decay learning rate schedule. During the first 16,000 steps, the learning rate increased linearly from 5×10⁻⁵ to 1×10⁻⁴, followed by a cosine decay for the remainder of training. The Adam parameters were β₁=0.9, β₂=0.999, and ϵ=1×10⁻⁸. Additionally, we used gradient accumulation of 100,000 tokens per step to achieve stable updates without requiring large-scale hardware resources.

Fine tuning

During the fine-tuning stage, Pathogen Encoder Base is updated in its entirety, including all parameters, token embeddings, and positional encodings. Each benchmark task employs a tailored learning rate scheduler, selected to maximize performance based on task-specific constraints. Although these schedulers are tuned per benchmark, additional work—particularly for the warmup schedule and decay parameters—could lead to further gains.

Given practical constraints, including computation time and hardware availability, fine-tuning remains a balance between exploring optimal hyperparameters and completing training within limited resources. Consequently, the results presented here could represent an unoptimized view of the model’s full potential. More extensive fine-tuning experiments could unlock further improvements in accuracy and efficiency.

Benchmark Comparison Between Pathogen AI Encoder Base and Nucleotide Transformers Multispecies (2.5B)

Applications

Embedding Model for Genomic Analysis

One major application of Pathogen Encoder Base is generating high-quality embeddings for genomic sequences. By extracting representations from the top layer, researchers can leverage these embeddings to perform similarity searches, clustering, or classification across diverse pathogen datasets. This approach simplifies downstream analyses by converting raw sequences into numerical representations that capture essential biological features.

t-SNE Plot of Splice Site Acceptors Training Set Embeddings

Fine-Tuning for Enhanced Performance

Pre-trained foundation models, such as Pathogen Encoder Base, excel when adapted to specific downstream tasks through fine-tuning. These tasks can include pathogen classification, antimicrobial resistance prediction, or variant detection. Because the model’s underlying representations are already robust, fine-tuning typically requires fewer labeled examples, reduces overall training time, and achieves higher accuracy compared to training from scratch.

Sequence-to-Sequence Translation for Vaccine Development

Beyond encoder-only applications, Pathogen Encoder Base can serve as the encoder component in a sequence-to-sequence (Seq2Seq) model. In this setup, a decoder network attends to the encoder outputs to generate target sequences—particularly useful for analyzing virus-binding genes in vaccine research. By focusing on these specific genomic regions, the model aids in predicting how these viruses will evolve, ultimately accelerating the design of targeted immunization strategies.

Conclusion

Pathogen Encoder Base is a robust model that provides strong performance at a fraction of the training cost. By leveraging a specialized encoder-only architecture and a two-phase pre-training strategy, the model effectively captures genomic patterns across diverse pathogen species. Its streamlined design, combined with efficient optimization techniques, enables robust downstream applications—from generating high-quality embeddings to fine-tuning for specific predictive tasks. Moreover, its flexibility as an encoder for sequence-to-sequence models opens new avenues for vaccine development and other translational research efforts. By balancing performance and resource efficiency, Pathogen Encoder Base offers a cost-effective, powerful solution that supports innovative research and practical advancements in pathogen genomics.

Results Per Benchmark (scroll with arrows)

Normalized Results Per Benchmark

Previous
Previous

Ebola 1.0

Next
Next

GeneEmbed