The web data extraction in an unsupervised way
from millions of websites available on World Wide Web is a
challenging and difficult job. Many methodologies and work
have been done earlier in this area, but these presents a
serious limitation for enormously growing information on the
internet and hence requires some robust mechanism for
dealing with this issue. Many commercial web sites page
displays unwanted information in the user is not at all
interested. Such information is called as Noise. Hence before
proceeding further for actual data extraction this noise must
remove to achieve the efficiency of data extraction at later
stages. This paper presents a Document Object Model
(DOM) tree intersection technique for template detection for
a particular website. In detected template, only useful
information will be available and all common information
will be removed. In this method, a training web page is
considered from a particular websites and Document Object
Model tree is prepared. The DOM of training web page will
be used for intersecting the other web page of the same
website, hence only giving information about actual data
presented on that web page. Once the web page template is
detected then wrapper generation rules for data extraction
can be formed easily to avoid the further intersection of the
webpage for grabbing the information. The wrapper will be
applied directly for all other web pages of that websites for
data extraction. This method is more efficient than the web
page segmentation process which has been used by the earlier
researcher for automatic web data extraction.
Published In : IJCAT Journal Volume 2, Issue 10
Date of Publication : October 2015
Pages : 419 - 426
Figures :05
Tables : 05
Publication Link :Efficient Web Content Mining using DOM Intersection
Method: A Website Template Detection Approach
Shaikh Phiroj Chhaware : Research Scholar, G.H. Raisoni College of Engineering,
Nagpur 440019 (MS) INDIA
Dr. Mohammad Atique : Associate Professor, Dept. of Computer Science & Engineering,
S.G.B. Amravati University, Amravati (MS) INDIA
Dr. L. G. Malik : Professor, Department of Computer Science & Engineering, G.H. Raisoni College of Engineering,
Nagpur 440019 (MS) INDIA
Webpage Template
Web Data Extraction
Document Object Model Tree
Intersection, Wrapper
Generation
Dynamic Web Pages
Structured Data
Tree
Matching
Information Filtering
Noise Removal
The work presented here accomplished the task of web
page template detection thereby enhancing the result for
further content mining work. The results achieved are
significantly correct but there are many areas where it
needs to be enhanced. With the enormous growth of the
websites, more commercial data are coming and more no
of web data bases are getting attached to the internet day
by day to serve the data demands of the people. Hence
more robust techniques for efficient web content mining
are needed.
[1] Yossef, Z. B. and Rajgopalan, S. Template Detection
via Data Mining and its Applications. Proceedings of
the 11th international conference on World Wide Web,
pp. 580-591, 2002.
[2] Chakrabarti, D., Kumar, R., and Punera, K. Page Level
Template Detection via Isotonic Smoothing.
Proceedings of the 16th international conference on
World Wide Web, pp. 61-70, 2007.
[3] Ma, L., Goharian, N., Chowdhury, A., and Chung M.
Extracting Unstructured Data from Template
Generated Web Document. Proceedings of the 12th
international conference on Information and
/knowledge Management, pp. 512-515, 2003. [4] Bar-Yossef, Z. and Rajagopalan, S. Template
Detection via Data Mining and its Applications,
WWW 2002, 2002.
[5] Shian-Hua Lin and Jan-Ming Ho. Discovering
Informative Content Blocks from Web Documents,
KDD-02, 2002.
[6] Lee, M.L., Ling, W. and Low, W.L. Intelliclean: A
knowledge-based intelligent data cleaner. KDD-
2000, 2000.
[7] Nahm, U.Y., Bilenko, M. and Mooney R.J. Two
Approachesto Handling Noisy Variation in Text
Mining. ICML-2002 Workshop on Text Learning,
2002
[8] Cooley, R., Mobasher, B. and Srivastava, J. Data
preparation for mining World Wide Web browsing
patterns. Journal of Knowledge and Information
Systems, (1) 1, 1999.
[9] Yang, Y. and Pedersen, J.O. A comparative study on
feature selection in text categorization. ICML-97,
1997.
[10] Davision, B.D. Recognizing Nepotistic links on the
Web. Proceeding of AAAI 2000.
[11] Jushmerick, N. Learning to remove Internet
advertisements, AGENT-99, 1999.
[12] Kao, J.Y., Lin, S.H. Ho, J.M. and Chen, M.S.
Entropy-based link analysis for mining web
informative structures, CIKM 2002.
[13] Kleinberg, J. Authoritative Sources in a Hyperlinked
Environment. ACM-SIAM Symposium on Discrete
Algorithms, 1998.
[14] Yi, L., Liu, B., and Li, X. Eliminating Noisy
Information in Web Pages for Data Mining.
Proceedings of the 9th ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining, pp. 296-305, 2003.
[15] Anderberg, M.R. Cluster Analysis for Applications,
Academic Press, Inc. New York, 1973.
[16] L. Yi, B. Liu and X. Li. Eliminating Noisy
Information in Web Pages for Data Mining.
Proceedings of the Ninth ACM SIGKDD
International Conference on Knowledge Discovery
and Data