当前位置:网站首页>Interpretation of the paper: a convolutional neural network for identifying N6 methyladenine sites in rice genome using dinucleotide one hot encoder
Interpretation of the paper: a convolutional neural network for identifying N6 methyladenine sites in rice genome using dinucleotide one hot encoder
2022-07-23 12:08:00 【Windy Street】
A Convolutional Neural Network Using Dinucleotide One-hot Encoder for identifying DNA N6-Methyladenine Sites in the Rice Genome
Article address :https://www.sciencedirect.com/science/article/abs/pii/S0925231220315137
DOI:https://doi.org/10.1016/j.neucom.2020.09.056
Periodical :Neurocomputing( Two district )
Influencing factors :5.719
Release time :2020 year 9 month 2 Japan
Web The server :http://iRicem6A-CNN.aibiochem.net
data :1760 sample Download link ;154000 sample Download link
1. An overview of the article
N6- Methyladenine (N6-Methyladenine,m6A) Is one of the important epigenetic modifications , With all kinds of DNA Process control . Through traditional methods, the whole genome m6A Analysis is the foundation , But it will take a long time . The author puts forward a new scheme :iRicem6A-CNN, It is used to identify m6A site , This scheme uses dinucleotide (2-mer)One-hot Coding technology , The input tensor is generated by convolution neural network for prediction , Five times the prediction accuracy of cross validation and independent testing (ACC) Respectively reached 93.82% and 96.19% , Performance is better than other available predictors . Experimental results show that , Only based on dinucleotide One-hot Of iRicem6A-CNN It can show high performance , And compared with single nucleotide (1-mer)One-hot The model is more stable 、 More robust performance .
2. background
N6- Methyladenine (N6-Methyladenine,m6A) yes DNA An important chemical modification product of , It widely exists in various organisms from eukaryotes to prokaryotes , And with DNA Copy 、DNA Repair is related to transcriptional regulation .DNA Methylated genome analysis has become the next generation sequencing technology , In particular, single molecule real-time sequencing technology is more and more widely used .m6A The genome-wide distribution of loci has been better characterized , This has led to a better understanding of its biological function . for example , Genome wide m6A Site studies revealed m6A Different regulatory functions in different eukaryotes , And show that in prokaryotes m6A As a sign , Used to distinguish intruders DNA And hosts DNA.
2018 year ,zhou Used by others smrt It is proved that 0.2% Adenine of is m6A Methylated , Since this discovery , Various machine learning based computational rice genomes m6A Methods have sprung up .2019 year ,chen And others have developed im6A-Pred, One is based on support vector machine (SVM) Methods , This method is based on including 1760 The benchmark data set of samples is trained , Accuracy rate (ACC) achieve 83.13% . And then , In traditional machine learning algorithms ( Such as svm、 Random forests (RF) And Markov chain model (markov chain model) Other methods have been developed to identify the rice genome m6A site , These methods include :im6A-DNCP、MM-m6Apred、SDM6A、iN6-methylat and iDNAm6A-rice, among iDNAm6A-rice Of ACC The highest , by 91.7% .
In this study , The author developed a new method :iRicem6A-CNN, To improve the prediction of rice genome m6A Accuracy of loci . The author used a dinucleotide One-hot code , take DNA Sequence is transformed into tensor , Then enter a well-designed CNN Model optimization . stay 5 Times cross validation ACC by 93.82% , In independent testing ACC by 96.19%. Experimental results show that , Use binary One-hot Coded iRicem6A-CNN Than single nucleotide One-hot Coded iRicem6A-CNN It has higher robustness and accuracy . The result of index comparison shows ,iRicem6A-CNN It has good performance , This is not only because it can stably identify positive samples , Also because it can identify negative samples more accurately .
3. data
By chen Et al lv Et al. Established two widely used rice genomes m6A Benchmark data set , They are marked as Chen-rice-m6A and Lv-m6A-rice.Chen-rice-m6A Data set from 1760 Sample composition , Half of them are positive samples , The other half is negative , It has been widely used in the reporting model based on non deep learning algorithm .Lv-m6A-rice Data set from 154000 Positive samples and 154000 Negative samples , And be lv Others used it iDNAm6A-rice On the way , By Yu Others used it SNNRicem6A On the way . The sequence length in the two data sets is 41 Base (bp), There is an adenine in the center (A). The author considers CNN The model needs a lot of data , So using Lv-m6A-rice Data set for model training , use Chen-rice-m6A Data sets for independent testing , To facilitate comparison .
4. Method
4.1 Feature code

4.2 Model framework

5. result
5.1 Model comparison based on different encoders


5.2 Compare with the most advanced models

6. Conclusion
ad locum , For rice genome m6A Locus developed a new method based on deep learning :iRicem6A-CNN. This method inputs DNA The sequence is first converted to a dinucleotide One-hot Coding tensor . The author proved by experiment , Dinucleotide One-hot The performance of the coding model is better than that of the one split one hot coding model , And it shows stronger robustness under different prediction probability thresholds . The model is applied to rice genome m6A Site detection , It turns out that , This model has high 5 Times the accuracy of cross validation (93.82%) And independent test accuracy (96.19%) , It's the rice genome m6A One of the best predictors of loci . The author's analysis and comparison show that ,iRicem6A-CNN It can not only accurately predict m6A Positive samples , And it can reduce the error rate of negative sample recognition . Besides , Also for iRicem6A-CNN It provides a user-friendly web server .
边栏推荐
猜你喜欢

MySQL invalid conn troubleshooting

Gartner调查研究:中国的数字化发展较之世界水平如何?高性能计算能否占据主导地位?

方法的定义应用

All kinds of ice! Use paddegan of the propeller to realize makeup migration

UE4 solves the problem that the WebBrowser cannot play H.264

笔记 | 百度飞浆AI达人创造营:深度学习模型训练和关键参数调优详解

如何进行强制类型转换?

11、多线程

3. DQL (data query statement)

百变冰冰!使用飞桨的PaddleGAN实现妆容迁移
随机推荐
Notes | (station B) Adult Liu: pytorch deep learning practice (code detailed notes, suitable for zero Foundation)
“東數西算”下數據中心的液冷GPU服務器如何發展?
MySQL index
机器学习/深度学习必备数学知识
Object类
Binary tree
MySQL invalid conn troubleshooting
Chain queue
Stage 1 Review
Modify the root password of MySQL
Vio --- boundary adjustment solution process
打印直角三角型、等腰三角型、菱形
After the VR project of ue4.24 is packaged, the handle controller does not appear
Static linked list
11、多线程
How to cast?
笔记|(b站)刘二大人:pytorch深度学习实践(代码详细笔记,适合零基础)
保存实质审查请求书出现Schema校验失败的解决方法
ninja文件语法学习
virtual function