基于XGBoost算法的水稻基因组6mA位点识别研究

doi:10.11896/jsjkx.210700262

Computer Science ›› 2022, Vol. 49 ›› Issue (6A): 309-313.doi: 10.11896/jsjkx.210700262

• Image Processing & Multimedia Technology • Previous Articles Next Articles

Identification of 6mA Sites in Rice Genome Based on XGBoost Algorithm

SUN Fu-quan^1,2, LIANG Ying¹

1 College of Information Science and Engineering,Northeastern University,Shenyang 110819,China
2 School of Mathematics and Statistics,Northeastern University,Qinhuangdao,Hebei 066004,China

Online:2022-06-10 Published:2022-06-08
About author:SUN Fu-quan,born in 1964,Ph.D,professor.His main research interests include big data analysis and medical image processing.
LIANG Ying,born in 1996,postgra-duate.Her main research interests include bioinformatics and genomic functional site recognition.
Supported by:
National Key Research and Development Project(2018YFB1402800),Hebei Higher Education Research and Practice Project(2018GJJG422) and Hebei Provincial High-level Talents Funding Project(A202101006).

Abstract

Abstract: N6-methyladenine(6mA) sites plays an important role in regulating gene expression of eukaryotes organisms.Accurate identification of 6mA sites may helpful to understand genome 6mA distributions and biological functions.At present,various experimental methods have been used to identify 6mA sites in different species,but they are too expensive and time-consuming.In this paper,a novel XGBoost-based method,P6mA-Rice,is proposed for identifying 6mA sites in the rice genome.Firstly,DNA sequence coding method based on sequence,which introduces and emphasizes the position specificity information,is first employed to represent the given sequences.Effective feature extraction criteria is proposed from seven aspects to make the expression of DNA information more comprehensive.Then,the selected feature set PS6mA based on the XGBoost feature importance is put into the integrated tree boosting algorithm XGBoost to construct the proposed model P6mA-Rice.The jackknife test on a benchmark dataset demonstrates that P6mA-Rice could obtain 90.55% sensitivity,88.48% specificity,79.00% Mathews correlation coefficient,and a 89.49% accuracy.Extensive experiments validate the effectiveness of P6mA-Rice.

Key words: DNA, N6-methyladenine, Position specificity, Sequence, XGBoost

CLC Number:

TP391

SUN Fu-quan, LIANG Ying. Identification of 6mA Sites in Rice Genome Based on XGBoost Algorithm[J].Computer Science, 2022, 49(6A): 309-313.

References

[1] LI Y,ZHANG X M,LUAN M W,et al.Distribution Patterns ofDNA N6-Methyladenosine Modification in Non-coding RNA Genes[J].Frontiers in Genetics,2020,11.
[2] O'BROWN Z K,GREER E L.N6-Methyladenine:A Conserved and Dynamic DNA Mark[J].Advances in Experimental Medicine & Biology,2016,945:213-246.
[3] TSAI K,COURTNEY D G,CULLEN B R,et al.Addition of m6A to SV40 late mRNAs enhances viral structural gene expression and replication[J].Plos Pathogens,2018,14(2):e1006919.
[4] FRELON S,DOUKI T,RAVANAT J L,et al.High-perfor-mance liquid chromatography--tandem mass spectrometry mea-surement of radiation-induced base damage to isolated and cellular DNA[J].Chemical Research in Toxicology,2000,13(10):1002-1010.
[5] FLUSBERG B A,WEBSTER D R,LEE J H,at al.Direct detection of DNA methylation during single-molecule,real-time sequencing[J].Nature Methods,2010,7(6):461-465.
[6] FENG P,YANG H,DING H,et al.iDNA6mA-PseKNC:Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC[J].Genomics,2019,111:96-102.
[7] CHEN W,LV H,NIE F,et al.i6mA-Pred:identifying DNA N6-methyladenine sites in the rice genome[J].Bioinformatics,2019,35(11):2796-2800.
[8] TAHIR M,TAYARA H,CHONG K T.iDNA6mA(5-steprule):Identification of DNA N6-methyladenine sites in the rice genome by intelligent computational model via Chou's 5-step rule[J].Chemometrics & Intelligent Laboratory Systems,2019,189:96-101.
[9] HAO L,DAO F Y,GUAN Z X,et al.iDNA6mA-Rice:A Computational Tool for Detecting N6-Methyladenine Sites in Rice[J].Frontiers in Genetics,2019,10:793.
[10] FU L,NIU B,ZHU Z,et al.CD-HIT:accelerated for clustering the next-generation sequencing data[J].Bioinformatics Oxford,2012,28:3150-3152.
[11] ZHANG X,LIU S.RBPPred:predicting RNA-binding proteins from sequence using SVM[J].Bioinformatics,2016,33(6):854-862.
[12] HOFACKER I L,STADLER P F.Automatic Detection of Conserved Base Pairing Patterns in RNA Virus Genomes[J].Computers & Chemistry,1999,23(3/4):401-414.
[13] MANAVALAN B,BASITH S,SHIN T H,et al.Meta4mC-pred:A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation[J].Molecular Therapy.Nucleic Acids,2019,16:733-744.
[14] MANAVALAN B,SHIN T H,LEE G.DHSpred:support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest[J].Oncotarget,2018,9(2):1944.
[15] XU R,ZHOU J,WANG H,et al.Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation[J].BMC Systems Biology,2015,9:S10.
[16] CHEN T,GUESTRIN C.XGBoost:A Scalable Tree Boosting System[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:785-794.
[17] KONG L,ZHANG L.i6mA-DNCP:Computational Identifica-tion of DNA N6-Methyladenine Sites in the Rice Genome Using Optimized Dinucleotide-Based Features[J].Genes,2019,10(10).

Related Articles 15

[1]	ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[2]	CHEN Hui-pin, WANG Kun, YANG Heng, ZHENG Zhi-jie. Visual Analysis of Multiple Probability Features of Bluetongue Virus Genome Sequence [J]. Computer Science, 2022, 49(6A): 27-31.
[3]	LI Jing-tai, WANG Xiao-dan. XGBoost for Imbalanced Data Based on Cost-sensitive Activation Function [J]. Computer Science, 2022, 49(5): 135-143.
[4]	ZHAO Geng, WANG Chao, MA Ying-jie. Study on PAPR Reduction Based on Correlation of Chaotic Sequences [J]. Computer Science, 2022, 49(5): 250-255.
[5]	ZHAO Geng, LI Wen-jian, MA Ying-jie. Chaotic Sequence Cipher Algorithm Based on Discrete Anti-control [J]. Computer Science, 2022, 49(4): 376-384.
[6]	CHEN Wei, LI Hang, LI Wei-hua. Ensemble Learning Method for Nucleosome Localization Prediction [J]. Computer Science, 2022, 49(2): 285-291.
[7]	WU Li-bo, HUANG Yu-fang. Logical Reasoning Based on DNA Strand Displacement [J]. Computer Science, 2022, 49(1): 259-263.
[8]	CHEN Jing-jie, WANG Kun. Interval Prediction Method for Imbalanced Fuel Consumption Data [J]. Computer Science, 2021, 48(7): 178-183.
[9]	CHENG Si-wei, GE Wei-yi, WANG Yu, XU Jian. BGCN:Trigger Detection Based on BERT and Graph Convolution Network [J]. Computer Science, 2021, 48(7): 292-298.
[10]	YANG Ping, SHU Hui, KANG Fei, BU Wen-juan, HUANG Yu-yao. Generating Malicious Code Attack Graph Using Semantic Analysis [J]. Computer Science, 2021, 48(6A): 448-458.
[11]	DENG Li, WU Jin-da, LI Ke-xue, LU Ya-kang. SpaRC Algorithm Hyperparameter Optimization Methodology Based on TPE [J]. Computer Science, 2021, 48(2): 70-75.
[12]	GONG Zhui-fei, WEI Chuan-jia. Complex Network Link Prediction Method Based on Topology Similarity and XGBoost [J]. Computer Science, 2021, 48(12): 226-230.
[13]	YU Shi-yuan, GUO Shu-ming, HUANG Rui-yang, ZHANG Jian-peng, SU Ke. Overview of Nested Named Entity Recognition [J]. Computer Science, 2021, 48(11A): 1-10.
[14]	WANG Mao-guang, YANG Hang. Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble [J]. Computer Science, 2021, 48(11A): 71-76.
[15]	WANG Xiao-di, LIU Xin, YU Xiao. Adaptive Frequency Domain Model for Multivariate Time Series Forecasting [J]. Computer Science, 2021, 48(11A): 204-210.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Identification of 6mA Sites in Rice Genome Based on XGBoost Algorithm

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0