多搜索引擎信息采集分析系统的设计与实现

摘要：搜索引擎作为互联网最大的数据共享中心，已经成为各种数据应用的主要信息来源，即数据应用首先到搜索引擎上采集数据，然后经过分析加工后向用户提供服务。不同的搜索引擎各有自己的特色，所提供的数据也不完全相同。例如百度搜索对国内网站的覆盖率较好，而Google搜索对国外网站的覆盖率更高。因此，同时从多个搜索引擎上采集数据，并进行分析，则能够获取更加全面的信息，为用户提供更加优质的服务。为了实现该目标，本课题设计并实现了一个多搜索引擎信息采集分析系统。该系统主要包括四个功能模块，分别是任务管理模块、信息采集模块、搜索结果分析模块和用户管理模块。任务管理模块包括搜索任务的创建、查询和删除；信息采集模块负责根据搜索任务到各搜索引擎上执行搜索，并获取搜索结果；搜索结果分析模块对各搜索引擎返回的结果进行分析以得到每一条搜索结果的标题、URL、图片、内容等信息；用户管理模块实现对用户基本信息的管理。该系统不直接为用户服务，而是为其他应用程序提供数据服务。系统根据搜索任务，到各搜索引擎上采集信息，并将搜索结果分析后放入数据库，然后提供给其他应用程序使用。经过测试，该系统运行良好，达到了预期的设计目标。80742

关键词：搜索引擎；信息采集；HTML分析；正则表达式

Design and Implementation of Information Collection and Analysis System through multiple Search Engines

Abstract: As the biggest data sharing center on the Internet, search engine has become the main information source of a variety of data applications。 The data applications collect data from the search engine, analysis the information and serve for users。 Different search engines have their own characteristics, therefore the data provided is not entirely the same。 For example, the Baidu works well on Chinese websites, while Google works well on foreign websites。 Therefore, collecting data at the same time from multiple search engines and then carry on the analysis can obtain more comprehensive information to provide users with more high-quality service。 In order to achieve this goal, this topic designs and implements an information collection and analysis system through multiple Search Engines。 This system mainly includes task management module, information collection module, search results analysis module and the user management module。 Task management module includes the creation, query and delete of search tasks。 Information acquisition module performs search tasks, and gets the results from various search engines。 Search result analysis module analyzes the search results to get the title, URL, image, content and other information。 User management module manages the basic information of the user。 The system does not directly serve for the user。 It serves for other applications to provide data services。 According to the search task, the system gathers information from search engines, and saves it into the database, and then makes it available to other applications。 The testing results show that the system runs well。

Keywords: Search engine; Information acquisition; HTML analysis; Regular expression

1 绪论 1

1。1 课题的研究背景 1

1。2 国内外研究现状与存在的问题 1

1。3 本篇论文结构 2

2 系统需求分析 3

2。1 多搜索引擎信息采集分析系统的设计与实现:http://www.youerw.com/jisuanji/lunwen_93936.html