摘要如今的互联网已经迈进了社会媒体时代,比如 BBS、电商网站、新浪微博。基 于微博数据,可以进行用户偏好分析、话题分析、用户关系网络挖掘、情感分析、 热点统计、舆情分析等工作。然而,在新浪微博中,人人可以发出自己的声音, 也可以倾听别人的声音,因此微博含有海量却碎片化的数据信息。这种情况下, 如何从微博的海量数据中抽取有效文本进行后续分析、挖掘和管理成为研究人员 研究的重点。目前尚没有比较成熟的系统进行根据主题词筛选过滤微博文本数据。 本文主要针对新浪微博,分析了 API 接口实现抓取的可能性、设计代码实现了 PC 端、WAP 端抓取微博,将三种策略对比分析,重点针对 wap 端设计并实现了面向 新浪微博的网络爬虫和信息采集系统。用户可以通过系统查找某关键字的某一时 期的微博信息。68478
毕业论文关键词 新浪微博 主题词 过滤 信息抽取 信息采集 python
Web Crawler and The System of Information Acquisition Based on Sina Weibo
Now the Internet has entered the age of social media , such as BBS, electric
business website and Sina Weibo . Based the data of Sina Weibo ,we can analysis the preference of a user and topics and excavate the relational network of users etc . However , in Sina Weibo , everybody can not only have a voice ,but also listen to the voices of others ,thus Weibo has a huge amount but pieces of information . In this case, the focus of researchers is to study how to extract effective information from the mass data of Weibo , and do the follow-up analysises .There is no mature system extracting text of Weibo data according to the key word .In this paper ,I analysis the possibility of grabbing the data according to the API,designed codes of grabbing the weibo data through the PC terminal and WAP terminal ,then compared and analysised the three kinds of strategies,at last ,the stratege of WAP terminal is designed and implemented for the system of Information Acquisition Based on Sina Weibo .The users can view the weibo content for some keywords in a certain period of time through
the system.
目录
1 引言(或绪论) 1
1.1 研究背景和意义 1
1.3 研究目标和内容 3
2 相关技术及可行性分析 4
2.1 人工复制 4
2.2 API 接口技术 5
2.3 网络爬虫技术 7
2.4 技术总结分析 10
2.5 开发工具 11
2.6 可行性分析 13
3 爬虫系统设计与实现 13
3.1 系统整体框架 14
3.2 微博网络爬虫