Entireweb Newsletter   *   January 30, 2007   *   ISSUE #305

Introduction to Information Retrieval - Search Engines

This article aims to provide readers with an overview of the very basics of information retrieval. Understanding these principles can help you to optimise your website content for the search engines and also help you to analyse search engine algorithm changes. However, the details in this article are not intended to describe how modern search engines work, as they use many additional factors, including link analysis.

Information retrieval (IR) is the science of searching for documents / within documents. Information retrieval techniques form some of the most fundamental elements of web search engine technology. This article will discuss information retrieval in the context of search engines.


It is unrealistic to remotely access documents in real-time when performing a search, as it would be exceptionally slow and unreliable. Therefore a local index is created, which for search engines is done by a crawler (aka spider). Thus, when you perform a search you are not actually searching the web, but are searching a version of the web as seen and stored by the crawler at some point in the past.

The index would not usually contain the whole document (this may, however, be stored in a separate document cache), but stores a representation of the terms relevant to the document that is quickly and easily searchable. There are various stages to this process (not all systems will include all of these stages):

1. Document
This is the document in its raw format with all text, structure and formatting.

2. Structure Analysis
Recognising headings, paragraphs, titles, bold text, lists, ..., etc.

3. Lexical Analysis
Converting the characters in the document into a list of words. This process may include analysing digits, hyphens, punctuation and the case of letters. Proper Noun Analysis can use the case and format of words/phrases to identify important information such as names, places, dates and organisations.

4. Stopwords Removal
The removal of words which occur very often and provide no ability to discriminate between documents. For example: "the", "it", "is". However, it can be seen that some search engines leave these words in the index and remove them at the user query level. This allows "+word" queries to be performed.

5. Stemming
This is a conflation procedure which reduces variations of a word into a single root. For example, both "worked" and "working" may be reduced to "work". The Porter Stemming Algorithm can be used to perform stemming.

After these processes have been performed we have a list of index terms for this particular document.

Index Term Weighting

We now need to calculate to what degree a term is relevant to a particular document. The following is an example of a weighting scheme:

* Index Term Frequency
This is the frequency of a term inside a document. The frequency is usually normalised within the particular document:
TermFrequency(term, document) = (no. occurrences of term in document) / (no. occurrence of term with max occurrences in document)

* Inverse Document Frequency
The inverse of the frequency of a term between all the documents in the set. Terms that appear in many documents are not very useful as they do not allow us to discriminate between documents.
IDF(term) = log([no. documents in collection] / [no. documents in collection containing term])

* Weight
This is the actual index term weight for a particular term in a particular document:
Weight(term, document) = TermFrequency(term, document) * IDF(term)

Other items may be a factor in deciding weight, such as: the terms position in the document, whether it was in the title, whether it was bold, whether it was in a list, ..., etc.

Reverse Index

We now have a list of terms (with their weights) for a given document. However, a list of documents that contain a particular word would be much more useful, rather than a list of words for a particular document. This is called a reverse index.

For example, if we had the following three documents:

1. This is a file about website search engine optimisation
2. A website design tutorial file
3. A file about bespoke software design and development

Then the index terms for each document may be as follows (weights would be in parentheses):

1. file(?), website(?), search(?), engine(?), optimisation(?)
2. website(?), design(?), tutorial(?), file(?)
3. file(?), bespoke(?), software(?), design(?), development(?)

However, the reverse index would be:

file: document1(?), document2(?), docuement3(?)
website: document1(?), document2(?)
search: document1(?)
engine: document1(?)
optimisation: document1(?)
design: document2(?), document3(?)
tutorial: document2(?)
bespoke: document3(?)
software: document3(?)
development: document3(?)

The reverse index then allows us to easily find the relevant documents for a particular word.

Similarity Matching

This is the process for computing the relevance of a document to a particular query. It can comprise:

* Query Term Weighting
Applies weights to each term in a query. For example, terms at the beginning of a query may be weighted more heavily.

* Similarity Coefficient
Uses the query term weights and document term weights to compute the similarity between a query and a document. The similarity could be calculated using the vector space model and calculating the cosine coefficient (this will not be discussed here).

Refreshing the Index

Documents can continually change, therefore the index needs to be continually refreshed. The crawler needs to decide how often to reindex particular documents, based on how often they are updated. If a document is not updated very often, then reindexing it very often would be a waste of resources. However, documents that are always changing need to be continually reindexed as they may no longer be relevant to terms they are currently indexed for.

Measuring Accuracy of IR Systems

Two of the simplest ways to assess the accuracy of a basic information retrieval system are Precision and Recall. These are calculated using the number of relevant documents and the number of retrieved documents (the documents perceived to be relevant by the system), the documents actually returned to the user are where these two sets of document overlap.

* Precision
Ratio of no. relevant documents returned to the total number of documents retrieved - i.e. the number of documents returned that are relevant.

* Recall
Ratio of no. relevant documents returned to the total number of relevant documents - i.e. the number of relevant documents that are returned.

The documents actually returned from the retrieved documents set will be decided using some form of ranking mechanism (discussion of this is beyond the scope of this article).

Generally, there is a compromise between precision and recall, as increasing the number of documents retrieved is likely to also increase the number of irrelevant documents in the set of retrieved documents.

Web Search Engines

Web search engines (such as Google, Yahoo! and MSN) usually combine information retrieval techniques with link structure analysis, as well as many other unknown techniques. Obviously, the above techniques are very easily spammed, so any useful search engine would need to try to filter out spamming where possible.

Michael Hawthornthwaite works at Acid Computer Services (Manchester) who specialize in web design, web development and bespoke software development.

News from Entireweb

SpeedyAds Contextual Ad Network
SpeedyAds, Entireweb's pay-per-click advertising solution, is a non-invasive, cost-efficient way of getting your site and products in front of thousands of people. We highly recommend you to try it out!

Express Inclusion for Sites
Entireweb recently introduced its most cost-efficient fast site inclusion program yet, Express Inclusion for Sites. We will crawl and list your entire site (up to 1,000 pages) in the Entireweb Network within 2 business days - for less than $0.40 US per page! If you have a medium to large website, this is an unbeatable submission method! Read more here!