Home >
Computers >
Web Design
What is a Search Engine Spider?
An explanation of what a search engine spider is with an example showing Googlebot in web stats and raw logs.
A Spider or Search Engine Spider is a
program that automatically traverses the Web and requests documents from URLs.
A spider usually starts from a historical list of URLs and retrieves referenced
documents. As it visits new
Internet websites it checks to see
if the site is already listed in its
database. If the site is already listed
it usually updates any changes it finds.
Spiders are also commonly known as Robots
or Crawlers. Other names sometimes
used: Webwalkers, Wanderers, or
*Worms.
How
can I tell if a spider has visited
my web site?
One way
to see if a Spider or Search Engine
Crawler has visited your site is
to view your Web statistics.
Most complete statistics programs will
have this information listed under "User-Agents"
or "Robots or Spiders". If you are using a hosting company ask them how to access your web statistics. See
Example
1 below:
Example
1 - Web stats showing
search engine spiders like Googlebot
The
stats above are from a set time
increment. So during this time
period Googlebot
spider visited the site 55 times.
What this view does not show
is when they visited. Filters
and
options within Web stats programs
will sometimes allow you to see
this information from different
perspectives which can be helpful.
Example
2 - Finding Spiders in Server
Log Files (User-Agent
Logs)
It's sometimes necessary to view the
web server's log file data for more complete usage information. These logs are commonly
located in directories outside
of the Web root folder like var/log
but may differ depending on the
operating system and specific
server configuration. Here I
did a search in the log
file
for
instances
of
"google" and this was the first
result:
64.68.82.197
- - [16/Feb/2004:22:04:16 -0800] "GET /robots.txt
HTTP/1.0" 302 291 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
If you look
closely you will see IP address,
time of request, file requested
("robots.txt"), and the User-agent
requesting the file (Googlebot/2.1).