基于HTMLParser的网页信息提取与分析

摘要网页信息提取与分析是针对用户某一确定的网络页面进行的操作，即对其页面上的信息进行提取并进行某些分析活动。而通常的方法有两种，一种是在联网的情况下直接抓取网络上用户确定的某一网页的信息进行分析，而另一种方法是先将确定的某一网页的HTM保存在本地，再对此HTM进行分析。最后将分析后的信息以HTML的格式进行存储。65500

本文首先对超文本标记语言进行了介绍，其次对HTMLParser的原理、分类和使用进行简单的叙述，论文的重点就是详细研究基于HTMLParser提取页面的方法，着重是设计模块和提取流程,最后则是调试并实现页面提取的工作并完成分析。

毕业论文关键词 HTML 提取 Parser 解析

毕业设计说明书（论文）外文摘要

Title Web Information Extraction and Analysis based on HTMLParser

Abstract

Web information extraction and analysis is conducted to determine a Web page for the user to an operation, i.e. its extract information on the page and some analysis activities. The usual method, there are two, one is retrieved directly in the case of networked users on the network to determine a page of information for analysis, HTM Another method is to first determine a page is stored in the local then shall HTM analyzed. Finally, the analysis information stored in HTML format.

This paper first introduces the HTML, then the principle of HTMLParser, classification and use a simple narrative, then the paper a brief description of how extracted and analyzed based on the HTMLParser web, and then describes how the extraction of the page and the introduction of the analysis of several functional modules, and finally through the certification test its ability to operate.

Keywords HTML extraction Parser analysis

一引言 1

1.1 研究目的与意义 1

1.2 论文的研究内容 5

1.3 论文的组织结构 5

二相关原理和技术 6

2.1 HTML语言 6

2.1.1 HTML语言的概念 6

2.1.2 HTML文档的编写方法和网页文件命名 6

2.1.3 HTML语言的基本结构 7

2.1.4 HTML的语言特点 9

2.2 HTML解析器 10

2.2.1 解析器的概念 10

2.2.2 HTMLParser的文法与结构 10

2.2.3 HTMLParser对HTML页面处理的方法 14

三基于HTMLParser的网页信息提取与分析系统设计 15

3.1 系统体系结构设计 15

3.2 功能模块设计 17

3.2.1 页面抓取 17

3.2.2 页面解析 17

3.2.3 显示模块 19

3.2.4 文件管理 19

四系统实现与运行 20

4.1 系统实现 20 基于HTMLParser的网页信息提取与分析:http://www.youerw.com/jisuanji/lunwen_73289.html