The good news about the Internet and its most visible
component, the World Wide Web, is that there are hundreds of
millions of pages available, waiting to present information on
an amazing variety of topics. The bad news about the Internet
is that there are hundreds of millions of pages available,
most of them titled according to the whim of their author,
almost all of them sitting on servers
with cryptic names. When you need to know about a particular
subject, how do you know which pages to read? If you're like
most people, you visit an Internet search engine.
Internet search engines are special sites on the Web that
are designed to help people find information stored on other
sites. There are differences in the ways various search
engines work, but they all perform three basic tasks:
- They search the Internet -- or select pieces of the
Internet -- based on important words.
- They keep an index of the words they find, and where
they find them.
- They allow users to look for words or combinations of
words found in that index.
Early search engines held an index of a few hundred
thousand pages and documents, and received maybe one or two
thousand inquiries each day. Today, a top search engine will
index hundreds of millions of pages, and respond to tens of
millions of queries per day. In this edition of HowStuffWorks,
we'll tell you how these major tasks are performed, and how
Internet search engines put the pieces together in order to
let you find the information you need on the Web.
Looking at the Web
When most people talk
about Internet search engines, they really mean World Wide Web
search engines. Before the Web became the most visible part of
the Internet, there were already search engines in place to
help people find information on the Net. Programs with names
like "gopher" and "Archie" kept indexes of files stored on
servers connected to the Internet, and dramatically reduced
the amount of time required to find programs and documents. In
the late 1980s, getting serious value from the Internet meant
knowing how to use gopher, Archie, Veronica and the rest.
Today, most Internet users limit their searches to the Web, so
we'll limit this article to search engines that focus on the
contents of Web
Before a search engine can tell you where
a file or document is, it must be found. To find information
on the hundreds of millions of Web pages that exist, a search
engine employs special software robots, called spiders,
to build lists of the words found on Web sites. When a spider
is building its lists, the process is called Web
crawling. (There are some disadvantages to calling part of
the Internet the World Wide Web -- a large set of
arachnid-centric names for tools is one of them.) In order to
build and maintain a useful list of words, a search engine's
spiders have to look at a lot of pages.
How does any spider start its travels over the Web? The
usual starting points are lists of heavily used servers
and very popular pages. The spider will begin with a popular
site, indexing the words on its pages and following every link
found within the site. In this way, the spidering system
quickly begins to travel, spreading out across the most widely
used portions of the Web.
"Spiders" take a Web page's content and
create key search words that enable online users to find
pages they're looking
began as an academic search engine. In the paper that
describes how the system was built, Sergey Brin and Lawrence
Page give an example of how quickly their spiders can work.
They built their initial system to use multiple spiders,
usually three at one time. Each spider could keep about 300
connections to Web pages open at a time. At its peak
performance, using four spiders, their system could crawl over
100 pages per second, generating around 600 kilobytes of data
Keeping everything running quickly meant building a system
to feed necessary information to the spiders. The early Google
system had a server dedicated to providing URLs to the
spiders. Rather than depending on an Internet
service provider for the domain name
server (DNS) that translates a server's name into an address,
Google had its own DNS, in order to keep delays to a minimum.
When the Google spider looked at an HTML
page, it took note of two things:
- The words within the page
- Where the words were found
Words occurring in the title, subtitles, meta
tags and other positions of relative importance were
noted for special consideration during a subsequent user
search. The Google spider was built to index every significant
word on a page, leaving out the articles "a," "an" and "the."
Other spiders take different approaches.
These different approaches usually attempt to make the
spider operate faster, allow users to search more efficiently,
or both. For example, some spiders will keep track of the
words in the title, sub-headings and links, along with the 100
most frequently used words on the page and each word in the
first 20 lines of text. Lycos
is said to use this approach to spidering the Web.
Other systems, such as AltaVista,
go in the other direction, indexing every single word on a
page, including "a," "an," "the" and other "insignificant"
words. The push to completeness in this approach is matched by
other systems in the attention given to the unseen portion of
the Web page, the meta tags.
tags allow the owner of a page to specify key words and
concepts under which the page will be indexed. This can be
helpful, especially in cases in which the words on the page
might have double or triple meanings -- the meta
tags can guide the search engine in choosing which of the
several possible meanings for these words is correct. There
is, however, a danger in over-reliance on meta tags, because a
careless or unscrupulous page owner might add meta tags that
fit very popular topics but have nothing to do with the actual
contents of the page. To protect against this, spiders will
correlate meta tags with page content, rejecting the meta tags
that don't match the words on the page.
All of this assumes that the owner of a page actually wants
it to be included in the results of a search engine's
activities. Many times, the page's owner doesn't want it
showing up on a major search engine, or doesn't want the
activity of a spider accessing the page. Consider, for
example, a game that builds new, active pages each time
sections of the page are displayed or new links are followed.
If a Web spider accesses one of these pages, and begins
following all of the links for new pages, the game could
mistake the activity for a high-speed human player and spin
out of control. To avoid situations like this, the robot
exclusion protocol was developed. This protocol,
implemented in the meta-tag section at the beginning of a Web
page, tells a spider to leave the page alone -- to neither
index the words on the page nor try to follow its links.
Building the Index
Once the spiders have
completed the task of finding information on Web pages (and we
should note that this is a task that is never actually
completed -- the constantly changing nature of the Web means
that the spiders are always crawling), the search engine must
store the information in a way that makes it useful. There are
two key components involved in making the gathered data
accessible to users:
- The information stored with the data
- The method by which the information is indexed
In the simplest case, a search engine could just store the
word and the URL where it was found. In reality, this would
make for an engine of limited use, since there would be no way
of telling whether the word was used in an important or a
trivial way on the page, whether the word was used once or
many times or whether the page contained links to other pages
containing the word. In other words, there would be no way of
building the ranking list that tries to present the
most useful pages at the top of the list of search results.
To make for more useful results, most search engines store
more than just the word and URL. An engine might store the
number of times that the word appears on a page. The engine
might assign a weight to each entry, with increasing
values assigned to words as they appear near the top of the
document, in sub-headings, in links, in the meta tags or in
the title of the page. Each commercial search engine has a
different formula for assigning weight to the words in its
index. This is one of the reasons that a search for the same
word on different search engines will produce different lists,
with the pages presented in different orders.
Regardless of the precise combination of additional pieces
of information stored by a search engine, the data will be
encoded to save storage space. For example, the
original Google paper describes using 2 bytes, of 8
each, to store information on weighting -- whether the word
was capitalized, its font size, position, and other
information to help in ranking the hit. Each factor might take
up 2 or 3 bits within the 2-byte grouping (8 bits = 1 byte).
As a result, a great deal of information can be stored in a
very compact form. After the information is compacted, it's
ready for indexing.
An index has a single purpose: It allows information to be
found as quickly as possible. There are quite a few ways for
an index to be built, but one of the most effective ways is to
build a hash table. In hashing,
a formula is applied to attach a numerical value to each word.
The formula is designed to evenly distribute the entries
across a predetermined number of divisions. This numerical
distribution is different from the distribution of words
across the alphabet, and that is the key to a hash table's
In English, there are some letters that begin many words,
while others begin fewer. You'll find, for example, that the
"M" section of the dictionary is much thicker than the "X"
section. This inequity means that finding a word beginning
with a very "popular" letter could take much longer than
finding a word that begins with a less popular one. Hashing
evens out the difference, and reduces the average time it
takes to find an entry. It also separates the index from the
actual entry. The hash table contains the hashed number along
with a pointer to the actual data, which can be sorted in
whichever way allows it to be stored most efficiently. The
combination of efficient indexing and effective storage makes
it possible to get results quickly, even when the user creates
a complicated search.
Building a Search
Searching through an index
involves a user building a query and submitting it
through the search engine. The query can be quite simple, a
single word at minimum. Building a more complex query requires
the use of Boolean
operators that allow you to refine and extend the
terms of the search.
The Boolean operators most often seen are:
- AND - All the terms joined by "AND" must appear
in the pages or documents. Some search engines substitute
the operator "+" for the word AND.
- OR - At least one of the terms joined by "OR"
must appear in the pages or documents.
- NOT - The term or terms following "NOT" must not
appear in the pages or documents. Some search engines
substitute the operator "-" for the word NOT.
- FOLLOWED BY - One of the terms must be directly
followed by the other.
- NEAR - One of the terms must be within a
specified number of words of the other.
- Quotation Marks - The words between the quotation
marks are treated as a phrase, and that phrase must be found
within the document or file.
Searching for SportSearch engines have become such an integral part
of our lives that at least one organized game has
evolved around this tool. In Googlewhacking, you
type two words into the Google
search engine in the hopes of receiving
exactly one result -- a single Web page on which
both of those words appear. This is a pure whack.
It's quite a difficult task -- you need to choose two
completely unrelated words or else you'll get a whole
lot more than one result, but with many completely
unrelated words you get zero results.
If you achieve a pure whack, you can submit it to http://www.howstuffworks.com/framed.htm?parent=search-engine.htm&url=http://www.googlewhack.com/,
where it is posted in The Whack Stack (along with
your name, or whatever you want to call yourself) for
all to see. One pure whack currently in The Whack Stack
The searches defined by Boolean
operators are literal searches -- the engine looks
for the words or phrases exactly as they are entered. This can
be a problem when the entered words have multiple meanings.
"Bed," for example, can be a place to sleep, a place where
flowers are planted, the storage space of a truck or a place
where fish lay their eggs. If you're interested in only one of
these meanings, you might not want to see pages featuring all
of the others. You can build a literal search that tries to
eliminate unwanted meanings, but it's nice if the search
engine itself can help out.
One of the areas of search engine research is
concept-based searching. Some of this research involves
using statistical analysis on pages containing the words or
phrases you search for, in order to find other pages you might
be interested in. Obviously, the information stored about each
page is greater for a concept-based search engine, and far
more processing is required for each search. Still, many
groups are working to improve both results and performance of
this type of search engine. Others have moved on to another
area of research, called natural-language queries.
The idea behind natural-language queries is that you can
type a question in the same way you would ask it to a human
sitting beside you -- no need to keep track of Boolean
operators or complex query structures. The most popular
natural language query site today is AskJeeves.com,
which parses the query for keywords that it then applies to
the index of sites it has built. It only works with simple
queries; but competition is heavy to develop a
natural-language query engine that can accept a query of great
For more information on search engines and related topics,
check out the links on the next page.
Lots More Information!
More Great LinksSearch Engines & Web