摘要随着近些年来测序技术的飞速发展,人类产生了海量的生物序列数据,亟需通过有效的计算手段进行分析和处理。DNA序列的k-mer index问题是生物序列信息处理中一个非常重要的问题。目前国内外有许多学者提出了相应的模型与算法,这些模型建立在各式各样的理论基础上,但是实际效果却不是特别理想。41083
本文针对DNA序列中k-mer的索引问题开展了相关研究。提出了基于哈希表的DNA序列k-mer索引模型,但与传统的索引模型不同,该模型预先对于不同长度的k-mer片段分别建模,随后再将其通过计算得出的最近临界值合并模型,得到最终的索引模型。这样使得模型可以应对更宽的k-mer长度范围,也可应对更大的数据量,并能更高效的查找出所需的k-mer片段。
本文首先对哈希表进行了综述和理论研究;其次,论文对DNA序列与k-mer片段的特点进行分析;然后,对于不同长度的k-mer分别进行建模,并求出相应的时间与空间复杂度;最后,对于一组具体的数据,求出了两个模型的最佳临界值,并编程实现展示结果。 毕业论文关键字:k-mer索引; 时间复杂度; 空间复杂度; 哈希表
Abstract
In recent years, with the rapid development of sequencing technology, the massive biological sequence data have been generated and we urgently need effective calculation method to analysis and process these data. The k-mer index in DNA sequence is a very important issue in biological sequence information processing. At present, many scholars at home and abroad have proposed the corresponding model and algorithm based on a variety of theory, but the actual effect is not particularly desirable.
In this paper, we perform related research to the k-mer index issues in DNA sequence. Therefore, the k-mer indexing model in DNA sequence based on the hash table is proposed, but it is other than the traditional index models. Firstly, we build models to the k-mer fragments with different lengths separately, and then we combine model according to the calculated recent critical value. In the final, we get the final index model. The model is corresponding to the wider k-mer length range and a greater amount of data, what’s more, we can find required the k-mer fragments efficiently.
Firstly we perform overview and theoretical research to the hash table in this paper. Secondly, we analyses the features of DNA sequences and k-mer fragments. Then, for the k-mer fragments with different length, we build models separately and get the corresponding time and space complexity. Finally, for a set of specific data, we get the optimal cutoff of two models, and achieve the results by programming.
Keywords: k-mer index; time complexity; space complexity; hash table
目录
第一章 绪论 1
第一节 研究背景 1
第二节 主要研究内容 1
第三节 论文的组织架构 2
第二章 相关理论分析 2
第一节 哈希表与哈希函数 2
2.1.1哈希表的定义 2
2.1.2常用的哈希函数 2
第二节 哈希表中碰撞问题 3
2.2.1碰撞的意义 3
2.2.2 碰撞的常用解决办法 4
第三节 本章小节 4
第三章 基于哈希表的k-mer索引-查找模型 5
第一节 对于小片段k-mer索引-查找模型 5
3.1.1 小片段的k-mer索引模型建立 5
3.1.2 小片段的k-mer查找模型建立 6
第二节 对于大片段k-mer的改进模型 7
3.2.1 改进的k-mer索引模型建立 7
3.2.2 改进的k-mer查找模型建立 9