摘 要:RNA甲基化在生物体尤其是高等生物体中扮演着非常重要的角色,且其在肥胖、生殖发育、恶性肿瘤的检测以及脑发育异常等疾病中都有着很高的研究价值。随着后基因时代中越来越多的RNA序列被挖掘出来,想通过化学实验来鉴别这些RNA序列是否甲基化是费时又费力的,这就迫切需要我们开发一些机器学习方法,来让计算机处理并预测这些RNA序列是否甲基化。
本文首先详细介绍了单分类器识别预测RNA甲基化的方法,并描述了基准数据集,特征提取,支持向量机(SVM),评价指标等。其中对特征提取专列一章做详细地介绍,共介绍了四种不同的RNA序列特征提取方式,分别是RFH方法,CpG方法与PseKNC方法(包括PseDNC方法与PseTNC方法)。为了方便比较,本文接着复现了Wei Chen等人所提出的PseDNC方法,剩余的几种方法都采用跟PseDNC方法中相同的数据集与验证方式。最后实验表明基于RFH方法提取的特征向量所得到的SVM分类器模型在RNA甲基化判断上有着最优的判断结果,其MCC高达0.47,PseTNC方法与PseDNC方法其次,CpG方法结果最差。
当然本文的重点是对分类器进行融合,在第四章中我们将采用三种融合方式(算术平均法,加权平均法,多数投票法)融合四种不同的分类器模型以获得更好的分类结果。实验表明,使用加权平均法有着更优的融合结果,在融合PseTNC+RFH方法时MCC高达0.51,相对于单分类器中最优的RFH方法使用Jackknife验证时其MCC提高了4个百分点。
关键词:分类器融合;RNA甲基化;RNA特征提取;Jackknife
Abstract:RNA methylation plays a very important role in organisms, especially higher organisms, and has high research value in diseases such as obesity, reproductive development, malignancy detection and brain dysplasia. With the increasing number of RNA sequences in the post-gene era being excavated, it is imperative that we develop some machine learning methods if we want to identify whether these RNA sequences are time-consuming and laborious by chemical experiments. To process and predict whether these RNA sequences are methylated.
In this paper, we first introduce the method of single classifier to predict RNA methylation, and describe some of these concepts, benchmark data set, feature extraction, support vector machine (SVM), evaluation index and so on. In this paper, four different methods of RNA sequence feature extraction are introduced, including RFH method, CpG method and PseKNC method (including PseDNC method and PseTNC method). For the sake of comparison, this paper then reproduces the PseDNC method proposed by Wei Chen et al., And the remaining methods use the same data set and authentication method as the PseDNC method for comparison. Finally, the experimental results show that the SVM classifier model based on the eigenvector extracted by the RFH method has the best judgment result on the RNA methylation judgment, the MCC is up to 0.47, the PseTNC method is the same as the PseDNC method, and the CpG method is the worst.
Of course, this paper focuses on the integration of the classifier, in the fourth chapter we will use three fusion methods (arithmetic average method, weighted average method, the majority of voting method) fusion of four different classifier model to get a better classification result. The experimental results show that the MCC is up to 0.51 when the PseTNC + RFH method is fused with the weighted average method, and the MCC is improved by 4 percentage points compared with the optimal RFH method in the single classifier.
Keywords: Classifier fusion; RNA methylation; RNA characterization; Jackknife
目 录
第一章 绪论 1
1.1 研究背景及意义 基于分类器融合的RNA甲基化识别研究+源程序:http://www.youerw.com/shuxue/lunwen_205050.html