MonkeyDoIt! Simple Instructions
Home > Computers > Web Design



What is a Search Engine Spider?




Search Engine Spider
An explanation of what a search engine spider is with an example showing Googlebot in web stats and raw logs.


A Spider or Search Engine Spider is a program that automatically traverses the Web and requests documents from URLs. A spider usually starts from a historical list of URLs and retrieves referenced documents. As it visits new Internet websites it checks to see if the site is already listed in its database. If the site is already listed it usually updates any changes it finds. Spiders are also commonly known as Robots or Crawlers. Other names sometimes used: Webwalkers, Wanderers, or *Worms.

How can I tell if a spider has visited my web site?

One way to see if a Spider or Search Engine Crawler has visited your site is to view your Web statistics. Most complete statistics programs will have this information listed under "User-Agents" or "Robots or Spiders". If you are using a hosting company ask them how to access your web statistics. See Example 1 below:

Example 1 - Web stats showing search engine spiders like Googlebot

Web stats showing search engine spiders like Googlebot

The stats above are from a set time increment. So during this time period Googlebot spider visited the site 55 times. What this view does not show is when they visited. Filters and options within Web stats programs will sometimes allow you to see this information from different perspectives which can be helpful.

Example 2 - Finding Spiders in Server Log Files (User-Agent Logs)

It's sometimes necessary to view the web server's log file data for more complete usage information. These logs are commonly located in directories outside of the Web root folder like var/log but may differ depending on the operating system and specific server configuration. Here I did a search in the log file for instances of "google" and this was the first result:

64.68.82.197 - - [16/Feb/2004:22:04:16 -0800] "GET /robots.txt HTTP/1.0" 302 291 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

If you look closely you will see IP address, time of request, file requested ("robots.txt"), and the User-agent requesting the file (Googlebot/2.1).