中文摘要 新浪微博是一款为大众提供娱乐休闲生活服务的信息分享和交流平台。作为 人口较为聚集的地点,是社会舆情检测的一个重要聚点。新浪微博是独家信息发 布的最好平台之一。新浪微博从某种意义上讲就是“永不闭幕的新闻发布会”, 新浪微博已经成为媒体监控和跟踪突发消息的重要来源之一。 然而,互联网十分庞大,信息纷杂,如果只凭人工检测社会舆情,工作量很 大,并且效率低下,所以设计一个针对新浪微博的舆情检测系统是十分必要的。 本系统主要是设计并实现了该社会舆情系统的网络爬虫部分,将与关键词主题相 关的状态和评论录入到数据库,为了保证系统的完整性,用 jsp 技术,将所得数 据呈现在数据库。本论文的主要内容为: 1.研究了几种典型的爬行策略,分析了它们各自的特点,针对的普通爬虫存 在的不足,提出了一种实用的、高效的用于新浪微博的精确主题爬虫设计方法, 该方法针对新浪微博的特点,采用模版配置的方法,对新浪微博信息进行分析和 抓取。 2.对本系统所使用到的各个组件进行简单介绍。并对系统的设计思想及具体 模块的实现做了介绍。 3.对本文所提出的舆情预测算法进行了模拟测试,通过测试,验证了该算法 的可行性。 本文实现的爬虫不仅能够满足开发组相关人员的日常工作需要,而且在性能、 准确率及全面率方面上都优于通用爬虫。本系统可为快速应对网络突发事件提供 支持与帮助。60416 毕业论文关键词:新浪微博;社会舆情;舆情系统;主题爬虫;信息采集
Title Social information gathering public opinion research public opinion research community
Abstract Sina microblogging is a providing the public with entertainment life service information sharing and exchange platform. As the population more gathering place, a social public opinion detected a significant accumulation point. Sina microblogging is exclusive information released by the best platforms. Sina microblogging sense is "never closed the press conference," Sina microblogging has become a media monitoring and tracking breaking news of the important sources. However, the Internet is enormous information confused, if just by manual inspection of social public opinion, heavy workload, and inefficient, so the design one for Sina microblogging public opinion detection system is very necessary. The system is designed and implemented the social public opinion Systems Web crawler part with keywords relevant to the subject of the state and comments entered into the database, in order to ensure the integrity of the system, using jsp technology, the resulting data is presented in the database. In this thesis, the main contents are: (1)A study of several typical crawling strategies, analyzes their characteristics, for common reptiles shortcomings, we propose a practical and efficient for Sina Weibo precise theme reptile design method for Sina microblogging characteristics, using the template configuration method, the Sina microblogging information for analysis and crawl. (2) of the system used by the various components of a brief introduction. And system design concepts and the specific module implementation is introduced. (3)for the proposed public sentiment prediction algorithm was simulated tests, the test to verify the feasibility of the algorithm. This realization reptiles not only meet the development team related to their daily work needs, but in terms of performance, accuracy, and overall rates are superior to common reptiles. The system can quickly respond to network emergencies to provide support and help.