Papers
arxiv:2507.13077

Continued domain-specific pre-training of protein language models for pMHC-I binding prediction

Published on Jul 16
Authors:
,

Abstract

Continued pre-training of protein language models on HLA-associated peptides improves pMHC-I binding affinity prediction, especially for underrepresented alleles.

AI-generated summary

Predicting peptide--major histocompatibility complex I (pMHC-I) binding affinity remains challenging due to extreme allelic diversity (sim30,000 HLA alleles), severe data scarcity for most alleles, and noisy experimental measurements. Current methods particularly struggle with underrepresented alleles and quantitative binding prediction. We test whether domain-specific continued pre-training of protein language models is beneficial for their application to pMHC-I binding affinity prediction. Starting from ESM Cambrian (300M parameters), we perform masked-language modeling (MLM)-based continued pre-training on HLA-associated peptides (epitopes), testing two input formats: epitope sequences alone versus epitopes concatenated with HLA heavy chain sequences. We then fine-tune for functional IC_{50} binding affinity prediction using only high-quality quantitative data, avoiding mass spectrometry biases that are inherited by existing methods.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.13077 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.13077 in a Space README.md to link it from this page.

Collections including this paper 1