With the proliferation of online repositories (e.g.,
databases or document corpora) hidden behind proprietary web
interfaces, e.g., keyword-/form-based search and
hierarchical/graph-based browsing interfaces, efficient ways of
exploring contents in such hidden repositories are of increasing
importance. There are two key challenges: one on the proper
understanding of interfaces, and the other on the efficient
exploration, e.g., crawling, sampling and analytical processing,
of very large repositories. In this tutorial, we focus on the
fundamental developments in the field, including web interface
understanding, crawling, sampling, and data analytics over
web repositories with various types of interfaces and
containing structured or unstructured data. Our goal is to
encourage audience to initiate their own research in these
exciting areas.
We shall summarize how the challenging problems of
crawling, sampling and analytics over hidden web
repositories require expertise in traditional query
processing, IR, social networks, data mining as well as
algorithms. We shall conclude by identifying open
challenges.
[1] Claudia Elena Dinuca, Association and Sequence
Mining in Web Usage, Economics and Applied
Informatics, 2011. [2] Hsinchun Chen, Xin Li, Michael Chau, Yi-Jen Ho,
Chunju Tseng, Using Open Web APIs in Teaching
Web Mining, ACM, 2009.
[3] Sachin Pardeshi, Ujwala Patil, Central web mining
services–public and free access log files, WJST, 2012.
[4] B.Naveena Devi, O.Sreevani, Dynamic Modelling
Approach for Web Usage Mining Using Open Web
Resources, IJEST, 2010.
[5] Sanket Nagone, Bharat Kapse, Mayur Bhagwat,
Ecommerce Application using Web API and Apriori
Algorithm of Data Mining, IJCA, 2011.
[6] Claudia Elena Dinuca, The process of data preprocessing
for Web Usage Data Mining through a
complete example, Annals of the “Ovidius” 2011.