摘要互联网的时代,信息如同大海般没有边际,甚至获取信息的方法已经发生改变: 从传统的翻书查字典,继而变成通过搜索引擎进行检索。如今的问题不是信息太少, 而是太多,多得无从分辨,无从选择。因此,提供一个能够在网络上抓取数据, 并 且能够自动分析数据的工具,具有非常重要的意义。 79127
传统的搜索引擎所获得的信息,通常是以网页的形式展现的,这样的信息人工 阅读起来自然亲切,但计算机却很难进行加工和再利用。而且检索到的信息量太大, 很难在大量的检索结果中抽取出最需要的信息。采用自动识别关键词技术,把需要 的信息从海量的信息中筛选出来。
本文以“博客园”为例,使用 C#开发语言,抓取“博客园”知名博主的相关信 息和技术文章。通过火狐浏览器来分析网页的布局结构,获取数据的标签标识;使 用 NSoup 将下载的 Html 解析成 Document 文档类,并通过标识来抓取数据所在的节 点;通过 MSSQL 数据库来存储捕获到的数据集合,并通过 MVC 网站开发技术,将数 据库中的数据合理的呈现数来,最后利用制图插件 ECharts 来进行数据的可视化操 作。
毕业论文关键词:博客园 ;C#;NSoup; Echarts;Asp。Net MVC
Abstract The era of the Internet, the huge information is full of the Web。 Even the method of access to information has changed: from traditional book dictionary, and then into through the search engine retrieval。 The question now is not too little information, but too much, it let you don’t know how to distinguish and how to choose the data。 Therefore, it is very important to provide a tool that can grasp data on the network, and analyze the data automatically。
Information obtained by the traditional search engines, usually in the form of web pages, such information artificial to read natural kind, but the computer is difficult to carry on the processing and recycling。 And the retrieved information is too big, it is difficult to extract in a large number of search results the needed information。 By the technology of automatic identification keywords, filtering the needed information from the vast amounts of information 。
This paper will take the "cnblogs。com" as an example, using c# development language, fetching "cnblogs。com" well-known blogger information and technical articles。 Web page layout structure was analyzed and data label was accessed through the Firefox browser。 Using NSoup。dll will parsing Html into Document class, and by identifying to grab the data in the node; Through the MSSQL database to store the captured data collection, and through the MVC website development technology, will be by the number of present data in the database is reasonable。 ECharts plugin was used to make data visualized and operated。。
Keywords: cnblogs。com; C#; NSoup; Echarts; Asp。Net MVC;
目 录
第一章 绪论 1
1。1 研究背景 1
1。2 项目流程 2
1。3 项目架构 3
第二章 网站分析 4
2。1 网页分析 4
2。2 页面分析 4
2。3 网站分析的结果 6
2。3。1 DOM 元素的标识提取 6
2。3。2 特殊博主样式的处理 7
第三章 数据抓取