当前位置:网站首页>Russian Airi Research Institute, etc. | SEMA: prediction of antigen B cell conformation characterization using deep transfer learning

Russian Airi Research Institute, etc. | SEMA: prediction of antigen B cell conformation characterization using deep transfer learning

2022-06-25 03:53:00 Zhiyuan community

【 title 】SEMA: Antigen B-cell conformational epitope prediction using deep transfer learning

【 The author team 】Tatiana I. Shashkova, Dmitriy Umerenkov, Mikhail Salnikov,  Pavel V. Strashnov,  Alina V. Konstantinova, Ivan Lebed,  Dmitrii N. Shcherbinin,  Marina N. Asatryan,  Olga L. Kardymon,  Nikita V. Ivanisenko

【 Time of publication 】2021/06/21

【 machine structure 】 Russia AIRI Research institutes, etc

【 Thesis link 】https://doi.org/10.1101/2022.06.20.496780

【 Code link 】https://github.com/AIRI-Institute/SEMAi

【 Web link 】http://sema.airi.net

One of the main tasks of vaccine design and immunotherapeutic drug development is to predict the binding sites of major antibodies in the tertiary structure of antigens B Cell conformational epitope . up to now , There have been many ways to solve this problem . However , For a wide range of antigens , Their accuracy is limited . This paper applies the method of transfer learning , Develop a model using a pre trained deep learning model , Prediction of conformation based on major antigenic sequences and tertiary structures B Cell epitope . In this paper, the pre - trained protein language model ESM-1b And an unfolding model ESM-IF1 fine-tuning , To quantitatively predict antibodies - The characteristics of antigen interaction and the distinction between epitope and non epitope residues . The resulting name is SEMA Our model performs best on independent test sets , Compared to peer-reviewed tools ,ROC AUC by 0.76. This article shows that ,SEMA It can be done to SARS-CoV-2 Of RBD The immune dominant regions in the domain were quantitatively sequenced with good results .

The figure above shows the generation process of epitope data set . This article screened PDB Database to select epitope residues that interact with antibodies . For each antigen residue , To calculate the contact numbe( Contact number )r features , This feature represents the antigen residue and the distance radius R 1 Number of contacts of antibody residues in . If the distance from the interacting antibody is lower than the specified threshold R 1, The antigenic residues are considered as epitopes .R 1 stay 4.5、6.0 and 8.0 Å Range selection .4.5 Å The cut-off value reflects the existence of direct interaction with antibody residues .6.0 Å and 8.0 Å The radius value of also includes the residues involved in the long-range interaction .

as everyone knows , Antigenic epitopes can be spatially distributed on the antigenic structure , In some cases , These experimental information may be lost . Consider this , Based on the interaction with antibody R 2 The non epitope residues are split into “ A little distance ”( R < R 2) and “ Remote ”( R > R2). This article chooses R 2 be equal to 12.0、14.0 or 16.0 Å To analyze the influence of epitope boundary region information on the accuracy of the model .

The figure above shows the model SEMA.SEMA Involves using sequence based (SEMA-1D) And structure based (SEMA-3D) To predict conformation B Cell epitopes and provide interpretable fractions

The figure above shows the use of SEMA forecast RBD Results of immunodominant epitopes .

SARS-CoV-2 Of S Albuminous RBD Domain is one of the most characteristic antigens in structure so far . This paper deals with RBD Domain instead of full-length S Protein analysis , To exclude the present SEMA The hypothetical effects of glycosylation were not considered . To evaluate SEMA Performance of , During model training , This article excludes S- All homologous sequences of proteins ( To the same extent >70%), especially MERS and SARS-CoV Of S- protein . Yes SEMA-3D An assessment was made , To solve three problems :(1) Allocate epitope and non epitope residues correctly ;(2) Correctly predict contact number characteristics ;(3) Prediction of immunodominant epitope residues .RBD The immunodominant residue of is based on PDB In the database RBD/ The proportion of antibody complexes , among RBD Residues in direct contact with the antibody . This paper assumes that the calculated ratio can estimate RBD Immunogenicity of residues , High ratios correspond to immunodominant residues .

(A)) SARS-CoV-2 Of RBD Domain (PDB ID 7KS9,B chain ) according to SEMA Forecast score ( Left )、 Immunogenicity score ( in )、 Contact value ( Right ) Coloring . The color of the residues ranges from brown ( Low value ) To cyan ( High value ). The immunogenicity is PDB In the database RBD/ The ratio of antibody complexes , among RBD Residues in 8.0 Å In contact with antibodies .

(B) SEMA Correlation between scores and characteristics of antigen contact number .

(C) SEMA Correlation between score and immunogenicity score .

(D) Different epitopes based on immunogenicity score threshold / Classification of non epitope residues ROC AUC value .

SEMA-3D It provides a high correlation coefficient between exposure values and estimated immunogenicity scores . Besides , Based on the ratio threshold, the ROC AUC indicators , To distinguish immunodominant residues ( High ratio ) And other residues ( Low ratio ). This provides a more reliable estimate of model performance , because RBD Most solvent exposed residues of the domain are labeled as epitopes , Because at least one corresponding residue interacts with the antibody . From the score cut-off, we can see ,SEMA-3D On this task, the average ROC AUC The index is 0.75.

 

Innovation points

  • This article generates a benchmark , Including antigens that classify epitope residues according to two distance cutoff values . The first distance ,R1, Defines the positive epitope label category , And the second distance ,R2, If the residue is too far away from the epitope and is ignored in the metric calculation . Limited R2 Radius makes it possible to evaluate the model's ability to predict epitope boundaries . Besides , For each antigen residue , In this paper, the characteristics of contact number are calculated , Corresponding to the radius of the antigen residue R1 Number of antibody atoms in . This feature is introduced into model training , Provide additional spatial information for the interaction between antibodies and antigens .
  • This paper presents a fine tuned protein language model (ESM-1v) And an unfolding model (ESM-IF1) It performs well in predicting conformational epitopes . More specifically , The model is based on 783 Fine tuned on the non redundant set of antigen records , Its epitope residues are based on PDB Antigens available in the database / Antibody structure and selected R1 and R2 Radius value assigned .
  • This article finally shows the model SEMA; It includes SEMA-1D( Fine tuned ESM-1v) and SEMA-3D( Fine tuned ESM-IF1) Model , For sequence based and structure based conformations, respectively B Cell epitope prediction .SEMA High performance in all benchmark tasks , And in R1=8.0 Å and R2=16.0 Å The shielding data set is trained .
  • Besides , This article shows that SEMA Predictable RBD Immunogenicity of domain residues . under these circumstances , This paper evaluates RBD Immunogenicity of domain residues , That is, in all available RBD/ Antibody complex , Ratio of the corresponding residues to the complexes in direct contact with the antibody .

 

 

 

原网站

版权声明
本文为[Zhiyuan community]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/176/202206250102477775.html