Friday, August 27, 2010

Information Extraction versus Information Retrieval

Information extraction (IE). Unlike information retrieval (IR), which concerns how to identify relevant documents from a document collection, IE produces structured data ready for post-processing, which is crucial to many applications of Web mining and searching tools.

Programs that perform the task of IE are referred to as extractors or wrappers. A wrapper was originally defined as a component in an information integration system which aims at providing a single uniform query interface to access multiple information sources. In an information integration system, a wrapper is generally a program that “wraps” an information source (e.g. a database server, or a Web server) such that the information integration system can access that information source without changing its core query answering mechanism.

Wrapper induction (WI) or information extraction (IE) systems are software tools that are designed to generate wrappers.


Source: Survey of Web Information Extraction Systems
Traditional IE aims at extracting data from totally unstructured free texts that are written in natural language. Web IE, in contrast, processes online documents that are semi-structured and usually generated automatically by a server-side application program. As a result, traditional IE usually takes advantage of NLP techniques such as lexicons and grammars, whereas Web IE usually applies machine learning and pattern mining techniques to exploit the syntactical patterns or layout structures of the template-based documents.

There are five main tasks defined for text IE, including named entity recognition, coreference resolution, template element construction, template relation construction and scenario template production.

RISE (Repository of Online Information Sources Used in Information Extraction Tasks).

Classification

Message Understanding Conferences (MUCs) have classified as MUC Approaches and Post MUC Approaches.

MUC Approaches:
  1. AutoSolg
  2. LIEP
  3. PALKA
  4. HASTEN
  5. CRYSTAL
Post MUC Approaches:
  1. WHISK
  2. RAPIER
  3. SRV
  4. WIEN
  5. SoftMealy
  6. STALKER
Hsu and Dung classified into 4 categories, hand-crafted wrappers using general programming languages, specially designed programming languages or tools, heuristic-based wrappers, and WI approaches.

Chang classified based on the degree of automation, They classified IE tools into four distinct categories, including systems that need programmers, systems that need annotation examples, annotation-free systems and semisupervised systems.

Mulsea classified IE tools into 3 classes:
IE Tools based on
  1. Syntactic/Semantic Constraints
  2. Delimiters
  3. Both 1 and 2.
Kushmerick classified many of the IE tools into two distinct categories finite-state and relational learning tools.

Laender proposed a taxonomy for data extraction tools based on the main technique used by each tool to generate a wrapper, which are
  1. Languages for Wrapper Development.
  2. HTML-Aware Tools.
  3. NLP-Based tools.
  4. Wrapper Induction tools.
  5. Modeling based tools.
  6. Ontology based tools.
Laender compared among the tools by using the following 7 features: degree of automation, support for complex objects, page contents, availability of a GUI, XML output, support for non-HTML sources, resilience, and adaptiveness.

Sarawagi classified HTML wrappers into 3 categories according to the kind of extraction tasks.
  1. Record level wrappers. - exploits regularities to discover record boundaries and then extract elements of a single list of homogeneous records from a page.
  2. Page level wrappers. - extracts elements of multiple kinds of records.
  3. Site level wrappers. - populate a database from pages of a Web site.

Information Extraction versus Information Retrieval

Information extraction (IE) is unlike Information Retrieval (IR), which concerns how to identify relevant documents from a document collection, IE produces structured data ready for post-processing, which is crucial to many applications of Web mining and searching tools.

Programs that perform the task of IE are referred to as extractors or wrappers. A wrapper was originally defined as a component in an information integration system which aims at providing a single uniform query interface to access multiple information sources. In an information integration system, a wrapper is generally a program that “wraps” an information source (e.g. a database server, or a Web server) such that the information integration system can access that information source without changing its core query answering mechanism.




Friday, August 13, 2010

Github Configurations

GETTING A PUBLIC KEY

First check to see if a ssh key directory exists.

$ cd ~/.ssh
$ ls
config                id_rsa.pub
id_rsa known_hosts
$ mkdir key_backup
$ cp id_rsa* key_backup
$ rm id_rsa*

Here we have an existing keypair, id_rsa and id_rsa.pub, which we’ve copied into ~/.ssh/key_backup before removing. By default, ssh will use keys in ~/.ssh that are named id_rsa, id_dsa or identity.

Generating a key

If you have an existing keypair you wish to use, you can skip this step.

Now that we’re certain ssh won’t use an existing key, it’s time to generate a new keypair. Lets make an RSA keypair:

$ ssh-keygen -t rsa -C "tekkub@gmail.com"
Generating public/private rsa key pair.
Enter file in which to save the key (/home/tekkub/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/tekkub/.ssh/id_rsa.
Your public key has been saved in /home/tekkub/.ssh/id_rsa.pub.
The key fingerprint is:
01:0f:f4:3b:ca:85:d6:17:a1:7d:f0:68:9d:f0:a2:db tekkub@gmail.com

At the first prompt you can just hit enter to generate the key with the default name. You should use a good passphrase with your key. See Working with SSH key passphrases for more details on why you should use a passphrase and how to avoid re-entering it every time you use your key.

Note: If you don’t use the default key names, or store your keys in a different path, you may need to run ssh-add path/to/my_key so that ssh knows where to find your key.

Adding the key to your GitHub account

Now launch your browser and open the account page. In the “SSH Public Keys” section click “add another public key”, then paste your public key into the “key” field. If you leave the title blank the key comment (your email) will be used for the title.

REPOSITORY SETUP:

Global setup:

 Download and install Git
git config --global user.name "Vamshi Krishna Reddy V"
git config --global user.email vamshi4001@gmail.com

Next steps:

 mkdir Twinsight
cd Twinsight
git init
touch README
git add README
git commit -m 'first commit'
git remote add origin git@github.com:vamshi4001/Twinsight.git
git push origin master

Existing Git Repo?

 cd existing_git_repo
git remote add origin git@github.com:vamshi4001/Twinsight.git
git push origin master

Import a Subversion Repository

  • If the repo you are importing is very large, your import may time out.
  • If your subversion repository contains a non-standard directory structure, this import process will probably not work for you.
  • This service currently only supports public subversion repositories.
  • You can find details on how to run a manual import here.

SVN Repository URL help

Thursday, August 12, 2010

Blog Related Stuff

http://blogcosm.com/ - A daily reference to the world of blogs.
Blog Classification NLP[pdf]

Twitter Bots

I love Twitter. Why? Quite simply, because of the amazing community of helpful, knowledgeable, and diverse people around Twitter. I’ve already met many interesting individuals, both on and offline, because we all participate in the Twitter community. And Twitter keeps getting better through innovation from the community. There is a growing trend here, the use of Twitter bots.

What are Twitter Bots? They are special Twitter accounts that perform a special function and provides you with useful information. Twitter bots come in 2 basic flavors:

1) Push Bots - These bots don’t do anything fancy. Once you start to follow them, they broadcast messages to you. The most common uses of push bots to date have been by sports teams (for scoring updates) and weather forecasts.

2) Pull Bots – These bots are more sophisticated than push bots. You can interact with them by using Direct Message commands. The features of pull bots greatly exceed push bots. Pull bots operate as micro-applications behind the scenes, processing commands from a Twitter user, doing some work and then sending the result/data back to the original user via a Direct Message.

A simple example is the Timer bot. It’s Twitters version of a personal reminder service. Here’s how it works:

  • Follow Timer bot
  • Send a direct message to timer like this “d timer 45 call mom’

The timer bot will save your request, wait 45 minutes before sending a direct message reminder back to your Twitter account that says ‘call mom’

Here is a list of some of the more popular Twitter Bots. If you have others you find useful, add them to the comments and I will update the list.

  • gCal - Add Google Calendar events
  • HappyTwitDay - Send Happy Birthday Wishes
  • WineTweets - Share what you’re drinking with other wine lovers on Twitter (check this out @garyvee)
  • TweetBeep - Track topics in Twitter and have them emailed to you
  • Tipr - Tells you the amount you should leave for a tip when dining out
Source: http://www.kenburbary.com/2008/06/twitter-bots-usage-steadily-growing/

Apache Cassandra

Cassandra was open sourced by Facebook in 2008, and is now developed by Apache committers and contributors from many companies.

SOME DEFINITIONS:


Stable releases

Cassandra stable releases are well tested and reasonably free of serious problems, (or at least the problems are known and well documented). If you are setting up a production environment, a stable release is what you want.

Betas and release candidates

Betas are prototype releases considered ready for user testing, and release candidates have the potential to become the next stable release. These releases represent the state-of-the-art so are often the best place to start, and since APIs and on-disk storage formats can change between major versions this can also save you from an upgrade. The testing and feedback is also highly appreciated.

Nightly builds

Nightly builds represent the current state of development as of the time of the build. They contain all of the previous day's new features, fixes, and newly introduced bugs. The only guarantee they come with is that they successfully build and the unit tests pass. Nightly builds are a handy way of testing recent changes, or accessing the latest features and fixes not found in beta or release candidates, but there is some risk of them being buggy.

Tuesday, August 3, 2010

Virus - USB files seen as shortcuts

Virus Problem and Solution

Issue


I caught a virus on my flash drive at work and it appears to have changed all my file names to short cuts. I believe I've cleaned the virus but how do i get my files back so that I can view them?

Solution

  • If you did not format your flash drive, then check whether the files are not in hidden mode.
  • Click on "Start" -->Run-->type cmd and click on OK.
  • Here I assume your pendrive as G:
  • Enter this command.
  • attrib -h -r -s /s /d g:\*.*
  • Note : Replace the letter g with your flash drive letter.