PDA

View Full Version : blocking spiders from crawling sites



GeorgeB
October 9th, 2011, 07:48
I had an issue with this on my forum site. Got that fixed, but I just thought about something. How do you stop certain spiders from crawling your site?

Sain Cai
October 9th, 2011, 09:24
; block Google's image crawler completely
User-agent: Googlebot-Image
Disallow: /

; block all spiders and bots from those 2 directories
User-agent: *
Disallow: /cgi-bin/
Disallow: /pictures/

; allow Googlebot to access everything except /cgi-bin
; and all other bots can access nothing
; finally allow ia_archive (alexa.com) to access everything!
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
User-agent: ia_archiver
Allow: /

I would think you use this in the robots.txt
User-agent: *
Disallow: /

sander k
October 9th, 2011, 10:56
Thats right, but you could also add this to your meta tags:


<meta name="googlebot" content="noarchive,noindex,nofollow,nosnippet" />
<meta name="robots" content="noarchive,noindex,nofollow" />

Aaron Gregory
October 9th, 2011, 11:22
Can you do that for certain pages aswell? So it'd index the homepage, but not the other sub-pages?

sander k
October 9th, 2011, 11:31
Sure, but why would you want to do that?

Seraphim
October 10th, 2011, 22:53
Either way, remember that the bots/spiders don't always respect your meta tags or your robots.txt. Although most legitimate ones like Google MSN and Yahoo search indexers will obey, less than honest types will often completely ignore your restrictions unless they are backed up by a hard limit such as a .htaccess or some type of browsing rate limiter.

It can be beneficial for certain kinds of content- such as protecting an image gallery or file host from having the content leeched to death by hotlinking after getting indexed. You would also want to look into such methods if you have a forum with relatively personal content on it for instance that really doesn't need to be indexed to be found quickly by other people. I actually use the method myself for preventing certain files of my site from appearing in search engines. They're special purpose scripts that hook into other APIs, so although they are public facing they need not be indexed as only their corresponding APIs should ever need to access them.

Here's a piece of one of my robots.txt files. This blocks all clients that respect robots.txt from accessing the named folders, or the named php file- which contains code I am developing to be later implemented in my billing system.


User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: source.php