One thing I've wondered about is the syntax of the robots.txt file, where
it's placed, and how it's used. I've known that it is used to block spiders
from accessing your site, but that's about it. I've had to look into it
recently because we're offering free memberships at work, and we don't want
them indexed by search engines. I've also wondered how we can exclude
certain areas, such as where we collate our site statistics, from these
engines.
As it turns out, it's really dead simple. Simply create a
robots.txt file in your htmlroot, and the syntax is as follows:
User-agent: *
Disallow: /path/
Disallow: /path/to/file
The User-agent can specify specific agents or the wildcard; there
are so many spiders out there, it's probably safest to simply disallow all
of them. The Disallow line should have only one path or name, but
you can have multiple Disallow lines, so you can exclude any number
of paths or files.