Efficient Web Content Mining using DOM Intersection Method: A Website Template Detection Approach

Abstract
Authors
Keywords
Conclusion
References

The web data extraction in an unsupervised way from millions of websites available on World Wide Web is a challenging and difficult job. Many methodologies and work have been done earlier in this area, but these presents a serious limitation for enormously growing information on the internet and hence requires some robust mechanism for dealing with this issue. Many commercial web sites page displays unwanted information in the user is not at all interested. Such information is called as Noise. Hence before proceeding further for actual data extraction this noise must remove to achieve the efficiency of data extraction at later stages. This paper presents a Document Object Model (DOM) tree intersection technique for template detection for a particular website. In detected template, only useful information will be available and all common information will be removed. In this method, a training web page is considered from a particular websites and Document Object Model tree is prepared. The DOM of training web page will be used for intersecting the other web page of the same website, hence only giving information about actual data presented on that web page. Once the web page template is detected then wrapper generation rules for data extraction can be formed easily to avoid the further intersection of the webpage for grabbing the information. The wrapper will be applied directly for all other web pages of that websites for data extraction. This method is more efficient than the web page segmentation process which has been used by the earlier researcher for automatic web data extraction.

Published In : IJCAT Journal Volume 2, Issue 10

Date of Publication : October 2015

Pages : 419 - 426

Figures :05

Tables : 05

Publication Link :Efficient Web Content Mining using DOM Intersection Method: A Website Template Detection Approach

Shaikh Phiroj Chhaware : Research Scholar, G.H. Raisoni College of Engineering, Nagpur 440019 (MS) INDIA

Dr. Mohammad Atique : Associate Professor, Dept. of Computer Science & Engineering, S.G.B. Amravati University, Amravati (MS) INDIA

Dr. L. G. Malik : Professor, Department of Computer Science & Engineering, G.H. Raisoni College of Engineering, Nagpur 440019 (MS) INDIA

Webpage Template

Web Data Extraction

Document Object Model Tree

Intersection, Wrapper Generation

Dynamic Web Pages

Structured Data

Tree Matching

Information Filtering

Noise Removal

The work presented here accomplished the task of web page template detection thereby enhancing the result for further content mining work. The results achieved are significantly correct but there are many areas where it needs to be enhanced. With the enormous growth of the websites, more commercial data are coming and more no of web data bases are getting attached to the internet day by day to serve the data demands of the people. Hence more robust techniques for efficient web content mining are needed.

[1] Yossef, Z. B. and Rajgopalan, S. Template Detection via Data Mining and its Applications. Proceedings of the 11th international conference on World Wide Web, pp. 580-591, 2002. [2] Chakrabarti, D., Kumar, R., and Punera, K. Page Level Template Detection via Isotonic Smoothing. Proceedings of the 16th international conference on World Wide Web, pp. 61-70, 2007. [3] Ma, L., Goharian, N., Chowdhury, A., and Chung M. Extracting Unstructured Data from Template Generated Web Document. Proceedings of the 12th international conference on Information and /knowledge Management, pp. 512-515, 2003. [4] Bar-Yossef, Z. and Rajagopalan, S. Template Detection via Data Mining and its Applications, WWW 2002, 2002. [5] Shian-Hua Lin and Jan-Ming Ho. Discovering Informative Content Blocks from Web Documents, KDD-02, 2002. [6] Lee, M.L., Ling, W. and Low, W.L. Intelliclean: A knowledge-based intelligent data cleaner. KDD- 2000, 2000. [7] Nahm, U.Y., Bilenko, M. and Mooney R.J. Two Approachesto Handling Noisy Variation in Text Mining. ICML-2002 Workshop on Text Learning, 2002 [8] Cooley, R., Mobasher, B. and Srivastava, J. Data preparation for mining World Wide Web browsing patterns. Journal of Knowledge and Information Systems, (1) 1, 1999. [9] Yang, Y. and Pedersen, J.O. A comparative study on feature selection in text categorization. ICML-97, 1997. [10] Davision, B.D. Recognizing Nepotistic links on the Web. Proceeding of AAAI 2000. [11] Jushmerick, N. Learning to remove Internet advertisements, AGENT-99, 1999. [12] Kao, J.Y., Lin, S.H. Ho, J.M. and Chen, M.S. Entropy-based link analysis for mining web informative structures, CIKM 2002. [13] Kleinberg, J. Authoritative Sources in a Hyperlinked Environment. ACM-SIAM Symposium on Discrete Algorithms, 1998. [14] Yi, L., Liu, B., and Li, X. Eliminating Noisy Information in Web Pages for Data Mining. Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 296-305, 2003. [15] Anderberg, M.R. Cluster Analysis for Applications, Academic Press, Inc. New York, 1973. [16] L. Yi, B. Liu and X. Li. Eliminating Noisy Information in Web Pages for Data Mining. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data