摘要随着互联网的飞速发展,人们对于网络信息的需求越来越多,同时互联网中信心量变得十分巨大也颇为复杂。自一九九零年搜索引擎的建立以来,人们对于网络信息的获取变得方便起来。现在比较有名的例如Google,百度等。而网络爬虫技术就是搜索引擎中不可或缺的一环,能够让搜索引擎变得更快,更准确以及更加方便。86165
本课题的主要研究内容是网络爬虫的设计与实现。本文研究了网络爬虫主要采用的策略,工作流程及建立方法。通过Python实现一个基于深度优先策略的单线程网络爬虫程序。通过导入Python中的urillb2模板中的urlopen函数实现一个可以爬取给定网页源代码的网络爬虫程序。通过建立这个网络爬虫来学习Python语言,了解互联网相关协议的工作方式以及了解网络爬虫的建立方法和工作方式。
本文首先概述了网络爬虫的背景,接着介绍了网络爬虫的工作原理及所用技术,最后实现了一个简易网络爬虫软件,实验验证该软件可获取给定网页的源代码。
毕业论文关键词:网络爬虫;Python;网络协议;源程序;爬虫
Abstract With the rapid development of the Internet, people need more and more network information。what’s more,Internet information become very quite complex。 Since the establishment of search engine in 1990, it is more convenient for people to look for network information。 At present,well-known search engines have become popular,such as Google, Baidu and so on。 And the web crawler technology is the integral part of a search engine,which can make search engine faster, more accurate and more convenient。
The main content of this paper is the design and implementation of the web crawler。 In this paper, the topic mainly study the strategy, the work flow and the establishment method of the web crawler。 The topic achieve a depth first strategy based on a single threaded network reptiles procedures through Python。 The topic achieve a given web page which can climb the source code of the web crawler program through the introduction of urillb2 in the python template in the urlopen function。 The topic learn Python language, understand the work of the Internet related protocols, and understand of the establishment of web crawler methods and work through the establishment of this web crawler。
At first,this paper outlines the background of the web crawler。Then introduced the working principle of the web crawler and the technology used, and finally realized a simple web crawler software, the experiment proved that the software can get the source code of a given web page。
Keywords:Web crawler; Python; network protocol;source program;crawlers
目录
第一章 绪论 1
1。1 网络爬虫的背景 1
1。2 研究方法,步骤和措施等 1
第二章 相关技术介绍 2
2。1 Python介绍 2
2。2 Python软件的安装 2
2。3 Python常用模块介绍 3
2。3。1 urillb2模板详解 5
2。4 Python 常用库介绍 7
2。5 在Python运行中经常遇到的17个错误 9
2。6 网络爬虫原理 11
2。7 网络爬虫策略 11
2。7。1 基于爬虫的策略 11