摘要:一般的搜索引擎搜索到的内容广泛而杂乱,大大的降低了用户得到有效信息的效率。为了解决这个问题,本课题设计了一个定向的网络爬虫,它可以根据用户的特定需求来爬取网站内的信息。网络爬虫的主程序是用python编程语言编写的,使用python中的Scrapy框架能够简化开发,同时在爬虫程序中实现多线程来优化爬虫的爬取速度。使用广度优先的搜索策略可以尽可能的在目标网站全面的爬取信息,然后将提取到的有效的信息存储到数据库中,mysql数据库是开源免费的数据库且功能强大,是最佳的选择。为了方便日后的使用,最后将数据库中的数据输出到本地文本文档中进行整合。38812 毕业论文关键词:网络爬虫;python;数据库;PyCharm
Web crawler directional crawling text information
Abstract:General search engines search for the content featuring too wide and miscellaneous, reducing the efficiency for the user to get effective information.In order to solve this problem, this topic designes a targeted web spider,which can cater to the specific needs of users to gain the information within the website.The web spider's main program is written in python, of which the Skrapy framework can simplify the process of development.Meanwhile, multithreading is implemented to optimize the speed of getting information.Applying searching strategies oriented in the scale of information can find as much as possible information on the targeted website ,and then we can store the extracted valid information into the database.Among all the categories,mysql stands out as the best option for its free access and powerful functions.In order to facilitate the future use , the data in the database should be output to the local text document for integration.
Key words: web crawler;. python;database;PyCharm
目 录