AAAI 1999
AI & the World Wide WebRecognizing Structure in Web Pages Using Similarity Queries
Abstract
Wepresent general-purpose methodsfor recognizing certain types of structure in HTML documents. The methodsare implementedusing WHIRL, a "soft" logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, the structure ranked first byour methodis "meaningful"--i. e. , a structure that wasused in a hand-coded"wrapper", or extraction program, for the page--nearly 70%of the time. This improveson a value of 50%obtained by an earlier method. With appropriate backgroundinformation, the structure-recognition methodswedescribe can also be used to learn a wrapper from examples, or for maintaining a wrapper as a Web page changes format. In these settings, the top-rankedstructure is meaningfulnearly 85%of the time.
Authors
Keywords
No keywords are indexed for this paper.
Context
- Venue
- AAAI Conference on Artificial Intelligence
- Archive span
- 1980-2026
- Indexed papers
- 28718
- Paper id
- 728602220295830593