统计特征的RNA序列甲基化识别方法研究+源代码

摘要六甲基腺嘌呤（N6-methyladenosine）是位于 RNA 分子腺嘌呤第六位氮原子上的甲基化修饰，这一修饰将参与到很多转录调控中去。最近的研究表明，m6A 具有调控作用，其参与了 mRNA 剪接、导出、稳定、免疫耐受和转录等生物进程，在全基因组中对其进行准确定位将对生物信息学起到至关重要的作用。现如今，各种科学的实验以及严密的交叉验证已经表明 RNA 序列甲基化识别是具有可行性的。随着 RNA 序列在后基因组时代的发展，人们努力开发各种矢量方法来表示序列的特征，这是因为现有的机器学习方法只能处理向量而不能处理序列，并且生物实验方法既耗费时间又耗费金钱，满足不了批量处理 RNA 序列的需求。78077

针对以上问题，本文主要通过酿酒酵母基因组上 1183 个基因构造出了 RNA 序列甲基化位点，以此作为基准数据集，采用基于统计方法的特征提取，以 SVM 分类器作为预测引擎，构建了一个基于统计特征的预测器，并通过 10 重交叉验证进行寻优，利用 Jackknife 验证来衡量预测器的性能，以减少实验成本并优化预测结果。常用的预测方法主要有基于物化属性和基于统计特征两种，而本文采用的正是基于统计特征，对位置特异性，核酸组成成分，累计核苷酸频率等属性进行对比预测。实验方法表明，将位置特异性，核酸组成成分和累计核苷酸频率这三种属性组合在一起进行预测后得到马修斯系数为 0。5180，比基于物化属性的 iRNA-Methyl 方法的马修斯系数高出 0。228，比今年刚提出的 pRNAm-PC 方法的马修斯系数高出 0。118。由此，本文方法在预测性能方面有显著提高，具有可行性，这说明本文选取的统计特征比物化属性更有代表性，精度更高。

毕业论文关键词：特征提取；SVM 分类器；位置特异性；累计核苷酸频率；PSNP；PSDP；

Abstract N6-methyladenosine is a modification which occurs on the sixth nitrogen atom of adenine。 In addition, recent studies have shown that this modification is involved in the transcriptional regulation, such as mRNA splicing, export, stable, immune tolerance and transcription。 To better understand the adjustment mechanism of m6A, it is necessary to identify the right sites of m6A, which will be vital to bioinformatics。 Nowadays, various scientific experiments and rigorous cross-validation have proved that the recognition of RNA sequence methylation sites is feasible。 With the development of RNA sequence in the post-genomic era, many efforts have been made to develop a variety of ways about feature vectors to represent RNA sequence。 The reason is that the present machine learning methods can process only vectors but not sequence and most traditional methods cost time and money, which cannot meet the need of batch processing RNA sequence。

To solve above problems, this paper extract features on statistic and construct a classifier to detect methylation sites of RNA sequence samples to reduce cost and optimize experimental predictions。 Common methods of prediction are mainly based on chemical and physical properties or based on statistical characteristics。 The method used in this paper is based on statistical characteristics。 And this paper predicts the sites of m6A based on the location-specific and accumulated nucleotide frequency, and then MCC is 0。5180, which is 0。228 higher than that of iRNA-Methyl and 0。118 higher than that of pRNAm-PC。 Results have shown that this method has made significant improvements in prediction, which indicates that statistical features are better than Physical and chemical properties in representation of RNA sequence。

Keywords: Feature extraction; SVM classifier; location-specific; Accumulated nucleotide frequency; PSNP; PSDP;

目

上一篇：概率论中几个不等式的推广及应用

下一篇：标准特征值导数的算法研究

统计特征的RNA序列甲基化识别方法研究+源代码

浅谈中学数学函数最值问题的求解方法

基于决策树算法的篮球联赛预测

数形结合在中学数学中的...

浙江省工业企业发展的因子分析

中美小学数学课堂教学的比较

杭州历年中考三角形的题型分析

论数形结合在中学数学教育中的应用

安康汉江网讯

LiMn1-xFexPO4正极材料合成及充放电性能研究

新課改下小學语文洧效阅...

张洁小说《无字》中的女性意识

ASP.net+sqlserver企业设备管理系统设计与开发

我国风险投资的发展现状问题及对策分析

网络语言“XX体”研究

互联网教育”变革路径研究进展【7972字】

麦秸秆还田和沼液灌溉对...

老年2型糖尿病患者运动疗...