Once the spiders have completed the task of finding information on web pages, the search engine
must store the information in a way that makes it useful. There are two key components involved in making the gathered data accessible
to web users:
The information stored with the data
The method by which the information is indexed
In the simplest scenario, a search engine could just store the word and the URL where it was
found. But in reality, this would make for an engine of limited use since there would be no way of telling whether the word was used in an
important way or a in a trivial way on the page, whether the word was used once or many times, or whether the page contained links to
other pages containing the word. In other words, there would be no way of building the ranking list that tries to present the most
useful pages at the top of the list of search results.
To make for more useful results, most search engines store more than just the word and URL. An engine might
store the number of times that the word appears on a page. The engine might assign a weight to each entry, with increasing values assigned
to words as they appear near the top of the document, in sub-headings, in links, in the meta tags or in the title of the page. Each
commercial search engine has a different formula for assigning weight to the words in its index. This is one of the reasons that a search
for the same word on different search engines will produce different lists, with the pages presented in different orders.
An index has a single purpose: It allows information to be found as quickly as possible. There are
quite a few ways for an index to be built, but one of the most effective ways is to build a hash table. In hashing, a formula is applied to
attach a numerical value to each word. The formula is designed to evenly distribute the entries across a predetermined number of divisions.
This numerical distribution is different from the distribution of words across the alphabet, and that is the key to a hash table's
effectiveness.
The Search Engine Program
The search engine software or program is the final part. When a person requests a search on a
keyword or phrase, the search engine software searches the index for relevant information. The software then provides a report back to the
searcher with the most relevant web pages listed first.
3.2 Top Search Engines
We studied how search engines work. An integral part of any Internet Marketing or Search Engine
Optimization campaign is to know exactly which search engines to target. This section discusses some of the top search engines
today.
Google
Google has increased in popularity tenfold the past several years. They have gone from beta
testing, to becoming the Internet's largest index of web pages in a very short time. Their spider, affectionately named "Googlebot", crawls
the web and provides updates to Google's index about once a month.
Google.com began as an academic search engine. Google, by far, has a very good algorithm of
ranking pages returned from a result, probably one of the main reasons it has become so popular over the years. Google has several methods
which determine page rank in returned searches.
Yahoo!
Yahoo! is one of the oldest web directories and portals on the Internet today, and the site went
live in August of 1994. Yahoo! is a 100% human edited directory, and provides secondary search results using Google.
Yahoo! is also one of the largest traffic generators around, as far as web directories and search
engines go. Unfortunately, however, it is also one of the most difficult to get listed in, unless of course you pay to submit your site. Even
if you pay it doesn't guarantee you will get listed.
Either way, if you suggest a URL, it is "reviewed" by a Yahoo! editor, and if approved will appear
in the next index update.