Programs that perform the task of IE are referred to as extractors or wrappers. A wrapper was originally defined as a component in an information integration system which aims at providing a single uniform query interface to access multiple information sources. In an information integration system, a wrapper is generally a program that “wraps” an information source (e.g. a database server, or a Web server) such that the information integration system can access that information source without changing its core query answering mechanism.
Wrapper induction (WI) or information extraction (IE) systems are software tools that are designed to generate wrappers.
Source: Survey of Web Information Extraction Systems
Traditional IE aims at extracting data from totally unstructured free texts that are written in natural language. Web IE, in contrast, processes online documents that are semi-structured and usually generated automatically by a server-side application program. As a result, traditional IE usually takes advantage of NLP techniques such as lexicons and grammars, whereas Web IE usually applies machine learning and pattern mining techniques to exploit the syntactical patterns or layout structures of the template-based documents.
There are five main tasks defined for text IE, including named entity recognition, coreference resolution, template element construction, template relation construction and scenario template production.
RISE (Repository of Online Information Sources Used in Information Extraction Tasks).
Classification
Message Understanding Conferences (MUCs) have classified as MUC Approaches and Post MUC Approaches.
MUC Approaches:
- AutoSolg
- LIEP
- PALKA
- HASTEN
- CRYSTAL
- WHISK
- RAPIER
- SRV
- WIEN
- SoftMealy
- STALKER
Chang classified based on the degree of automation, They classified IE tools into four distinct categories, including systems that need programmers, systems that need annotation examples, annotation-free systems and semisupervised systems.
Mulsea classified IE tools into 3 classes:
IE Tools based on
- Syntactic/Semantic Constraints
- Delimiters
- Both 1 and 2.
Laender proposed a taxonomy for data extraction tools based on the main technique used by each tool to generate a wrapper, which are
- Languages for Wrapper Development.
- HTML-Aware Tools.
- NLP-Based tools.
- Wrapper Induction tools.
- Modeling based tools.
- Ontology based tools.
Sarawagi classified HTML wrappers into 3 categories according to the kind of extraction tasks.
- Record level wrappers. - exploits regularities to discover record boundaries and then extract elements of a single list of homogeneous records from a page.
- Page level wrappers. - extracts elements of multiple kinds of records.
- Site level wrappers. - populate a database from pages of a Web site.