摘要六甲基腺嘌呤(N6-methyladenosine)是位于 RNA 分子腺嘌呤第六位氮原子上 的甲基化修饰,这一修饰将参与到很多转录调控中去。最近的研究表明,m6A 具有 调控作用,其参与了 mRNA 剪接、导出、稳定、免疫耐受和转录等生物进程,在 全基因组中对其进行准确定位将对生物信息学起到至关重要的作用。现如今,各种 科学的实验以及严密的交叉验证已经表明 RNA 序列甲基化识别是具有可行性的。 随着 RNA 序列在后基因组时代的发展,人们努力开发各种矢量方法来表示序列的 特征,这是因为现有的机器学习方法只能处理向量而不能处理序列,并且生物实验 方法既耗费时间又耗费金钱,满足不了批量处理 RNA 序列的需求。78077
针对以上问题,本文主要通过酿酒酵母基因组上 1183 个基因构造出了 RNA 序 列甲基化位点,以此作为基准数据集,采用基于统计方法的特征提取,以 SVM 分 类器作为预测引擎,构建了一个基于统计特征的预测器,并通过 10 重交叉验证进 行寻优,利用 Jackknife 验证来衡量预测器的性能,以减少实验成本并优化预测结果。 常用的预测方法主要有基于物化属性和基于统计特征两种,而本文采用的正是基于 统计特征,对位置特异性,核酸组成成分,累计核苷酸频率等属性进行对比预测。 实验方法表明,将位置特异性,核酸组成成分和累计核苷酸频率这三种属性组合在 一起进行预测后得到马修斯系数为 0。5180,比基于物化属性的 iRNA-Methyl 方法的 马修斯系数高出 0。228,比今年刚提出的 pRNAm-PC 方法的马修斯系数高出 0。118。 由此,本文方法在预测性能方面有显著提高,具有可行性,这说明本文选取的统计 特征比物化属性更有代表性,精度更高。
毕业论文关键词:特征提取 ;SVM 分类器;位置特异性;累计核苷酸频率;PSNP;PSDP;
Abstract N6-methyladenosine is a modification which occurs on the sixth nitrogen atom of adenine。 In addition, recent studies have shown that this modification is involved in the transcriptional regulation, such as mRNA splicing, export, stable, immune tolerance and transcription。 To better understand the adjustment mechanism of m6A, it is necessary to identify the right sites of m6A, which will be vital to bioinformatics。 Nowadays, various scientific experiments and rigorous cross-validation have proved that the recognition of RNA sequence methylation sites is feasible。 With the development of RNA sequence in the post-genomic era, many efforts have been made to develop a variety of ways about feature vectors to represent RNA sequence。 The reason is that the present machine learning methods can process only vectors but not sequence and most traditional methods cost time and money, which cannot meet the need of batch processing RNA sequence。
To solve above problems, this paper extract features on statistic and construct a classifier to detect methylation sites of RNA sequence samples to reduce cost and optimize experimental predictions。 Common methods of prediction are mainly based on chemical and physical properties or based on statistical characteristics。 The method used in this paper is based on statistical characteristics。 And this paper predicts the sites of m6A based on the location-specific and accumulated nucleotide frequency, and then MCC is 0。5180, which is 0。228 higher than that of iRNA-Methyl and 0。118 higher than that of pRNAm-PC。 Results have shown that this method has made significant improvements in prediction, which indicates that statistical features are better than Physical and chemical properties in representation of RNA sequence。
Keywords: Feature extraction; SVM classifier; location-specific; Accumulated nucleotide frequency; PSNP; PSDP;
目 统计特征的RNA序列甲基化识别方法研究+源代码:http://www.youerw.com/shuxue/lunwen_89891.html