Webpage Understanding: an Integrated Approach by Jun Zhu @VideoLectures

English

Webpage Understanding: an Integrated Approach

Jun Zhu

Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting webpages and labeling HTML elements. However, how to effectively segment and label the text contents inside HTML elements is still an open problem. Since many text contents on a webpage are often text fragments and not strictly grammatical, traditional natural language processing techniques, that typically expect grammatical sentences, are no longer directly applicable. In this paper, we examine how to use layout and tag-tree structure in a principled way to help understand text contents on webpages. We propose to segment and label the page structure and the text content of a webpage in a joint discriminative probabilistic model. In this model, semantic labels of page structure can be leveraged to help text content understanding, and semantic labels of the text phrases can be used in page structure understanding tasks such as data record detection. Thus, integration of both page structure and text content understanding leads to an integrated solution of webpage understanding. Experimental results on research homepage extraction show the feasibility and promise of our approach.

Jun Zhu

@VideoLectures.net

Find OpenCourseWare Online Exams!

Attribution: The Open Education Consortium
http://www.ocwconsortium.org/courses/view/00532fd78e8fc37d383de84d321b12a2/
Course Home http://videolectures.net/kdd07_zhu_wu/

©flickr: Luis	Chemistry Final Review By Madison Christian Start Exam
	3 Microeconomics 03 Demand Supply By OpenStax Start Flashcards
	5 Neuroanatomy 05 Somesthetic Sensation By Stephen Voron Start Quiz
	Young Economist MCQ Test By Robert Murphy Start Test
	2 Gastrointestinal Pathophysiology Self-Assessment By Laurence Bailen Start Assessment
	How much do you love him? By Zarina Chocolate Start Quiz
©flickr: Abraham	1 Biology 1 By Sarah Warren Start Test
©flickr: Abraham	2 Biology 2 By Sarah Warren Start Test
	22 Biology 22 Prokaryotes Bacteria and Archaea MCQ By OpenStax Start Quiz
	1 Gastrointestinal Pathophysiology By Laurence Bailen Start Exam