XML URL Classification based on their semantic structure orientation for Web Mining Applications
Nowadays, as we all know, research on Web is the emerging field. For example, improving the quality of Web by analyzing Usability Test, Web Information Extraction, Browsing Web on Small Screen Devices like mobile, PDA (Personal Digital Assistance) etc. Tracking Product Opinions by analyzing user reviews etc. In general, we call it as Web Mining. According to analysis targets, Web Mining can be pided into three different types, which are Web Usage Mining, Web Structure Mining and Web Content Mining.50590
WWW (World Wide Web) consortium stated that, HTML has a lot of drawbacks such as limited defined tags, not case sensitive, semi-structured and designed for only to display data with limited options. Later to overcome these difficulties few technologies have been introduced such as XML, Flash (with good design options) and so on. Therefore, Web developers started to migrate to develop Web pages on these kinds of emerging Web Technologies to provide a better description of semantic structure of the web page contents. Therefore, these days we can see more web pages on Web which are developed using XML and Flash technologies3.
There are many research fields which have been opened on these new technologies. We proposed dataset creation technique for XML URLs4. After that we analyzed the data set based on XML semantic structure orientation type. Here, we have categorized our dataset into four types: Pure XML Web pages, RSS XML Web pages, HTML Embedded XML Web pages. Code Based/Sitemap XML Web pages. Fig.1 depicts the clear view of XML URL categories. In this article we mainly focus on XML URL classification by proposing a new method based on their semantic orientation for future Web mining applications such as Web page segmentation, Noise Removal, Web page adaptation, Search Engine Optimization (SEO) and so on.
Fig. 1 Dataset Analysis and Classification
Contribution: In light of deficiency of the above mentioned manual process, in this paper we propose an algorithm to Classify the XML URLs based on their semantic structure orientation. Then, we analyze the system accuracy by conducting extensive experiments based on the accuracy measures such as Precision, Recall Experimental results show that proposed method achieves overall accuracy level of 97.36%.
Organization: After providing the basic information's about XML URLs and its need in research area in Section1, we present related works in Section 2. We present knowledge base creation method for XML URL classification in Section 3. In Section 4, we describe about training and testing phase of proposed system and in Section 5, we present the result and analysis of conducted experiments on proposed system by using our XML URL Dataset 4.
2. Related Works
In 2003, Vision Based Page Segmentation (VIPS) algorithm3 proposed to extract the semantic structure of a Web page. Semantic structure is a hierarchical structure in which each node will correspond to a block and each node will be assigned a value to indicate degree of coherence based on visual perception. It may not work well and in many cases the weights of visual separators are inaccurately measured, as it does not take into account the document object model (DOM) tree information and when the blocks are not visibly different.
Gestalt Theory5: a psychological theory that can explain human’s visual perceptive process. The four basic laws, Proximity, Similarity, Closure and Simplicity are drawn from Gestalt Theory and then implemented in a program to simulate how human understands the layout of Web pages. A graph-theoretic approach6 is introduced based on DOM tree should be placed together. 7 people proposed a novel Web page segmentation algorithm based on finding the Gomory-Hu tree in a planar graph. The algorithm initially distils vision and structure information from a Web page to construct a weighted undirected graph, whose vertices are the leaf nodes of the DOM tree and the edges represent the visible position relationship between vertices. It then partitions the graph with the Gomory-Hu tree based clustering algorithm. Since the graph is a planar graph, the algorithm is very efficient. Web数据挖掘的应用英文文献和中文翻译:http://www.youerw.com/fanyi/lunwen_38929.html