Method 1: The robots.txt file
This file will give instructions to the search engine robots that roam on websites. You can indicate the URL to be followed and those that must be ignored. You can even give different instructions set for each search engine.
Read this article: http://ccm.net/contents/web/robots-txt.php3
Problems related to robots rules
- The robots-rules method is not totally effective:
- Some engines ignore the robots rules..
- Some software can be configured to ignore these guidelines (e.g HTTrack)
- Some very bad robots will follow URLs that you have asked them not to follow.
Method 2: A little of programming
Here is a particularly effective method to prevent robots from indexing all of your site, but it requires the modification of every page of your site.
Here's the trick:
- Include in all your pages a 1x1 transparent GIF (at a random location)
- This image will link to a special URL of your site (e.g /dontclickme.php)
- In the dontclickme.php page, create a list of banned IP (e.g in a mySQL table) and add the IP of your choice. You can ban the IP 30 minutes or 1 hour.
- And on every page of your site, when a request arrives, make sure the IP is not part of the banned ones.
Using this method:
- Normal users will never tiny invisible image. They will not be banned and can navigate easily throughout your site.
- Robots trying to follow all the links will be detected and banned
Note that this second method is not 100% reliable: it is always possible to configure the HTTrack
software to avoid this type of pages (dontclickme.php).
Published by deri58
Latest update on October 30, 2012 at 07:35 AM by deri58.