Friday, August 27, 2010

Information Extraction versus Information Retrieval

Information extraction (IE). Unlike information retrieval (IR), which concerns how to identify relevant documents from a document collection, IE produces structured data ready for post-processing, which is crucial to many applications of Web mining and searching tools.

Programs that perform the task of IE are referred to as extractors or wrappers. A wrapper was originally defined as a component in an information integration system which aims at providing a single uniform query interface to access multiple information sources. In an information integration system, a wrapper is generally a program that “wraps” an information source (e.g. a database server, or a Web server) such that the information integration system can access that information source without changing its core query answering mechanism.

Wrapper induction (WI) or information extraction (IE) systems are software tools that are designed to generate wrappers.

Source: Survey of Web Information Extraction Systems
Traditional IE aims at extracting data from totally unstructured free texts that are written in natural language. Web IE, in contrast, processes online documents that are semi-structured and usually generated automatically by a server-side application program. As a result, traditional IE usually takes advantage of NLP techniques such as lexicons and grammars, whereas Web IE usually applies machine learning and pattern mining techniques to exploit the syntactical patterns or layout structures of the template-based documents.

There are five main tasks defined for text IE, including named entity recognition, coreference resolution, template element construction, template relation construction and scenario template production.

RISE (Repository of Online Information Sources Used in Information Extraction Tasks).


Message Understanding Conferences (MUCs) have classified as MUC Approaches and Post MUC Approaches.

MUC Approaches:
  1. AutoSolg
  2. LIEP
  3. PALKA
Post MUC Approaches:
  1. WHISK
  3. SRV
  4. WIEN
  5. SoftMealy
Hsu and Dung classified into 4 categories, hand-crafted wrappers using general programming languages, specially designed programming languages or tools, heuristic-based wrappers, and WI approaches.

Chang classified based on the degree of automation, They classified IE tools into four distinct categories, including systems that need programmers, systems that need annotation examples, annotation-free systems and semisupervised systems.

Mulsea classified IE tools into 3 classes:
IE Tools based on
  1. Syntactic/Semantic Constraints
  2. Delimiters
  3. Both 1 and 2.
Kushmerick classified many of the IE tools into two distinct categories finite-state and relational learning tools.

Laender proposed a taxonomy for data extraction tools based on the main technique used by each tool to generate a wrapper, which are
  1. Languages for Wrapper Development.
  2. HTML-Aware Tools.
  3. NLP-Based tools.
  4. Wrapper Induction tools.
  5. Modeling based tools.
  6. Ontology based tools.
Laender compared among the tools by using the following 7 features: degree of automation, support for complex objects, page contents, availability of a GUI, XML output, support for non-HTML sources, resilience, and adaptiveness.

Sarawagi classified HTML wrappers into 3 categories according to the kind of extraction tasks.
  1. Record level wrappers. - exploits regularities to discover record boundaries and then extract elements of a single list of homogeneous records from a page.
  2. Page level wrappers. - extracts elements of multiple kinds of records.
  3. Site level wrappers. - populate a database from pages of a Web site.


Post a Comment