Webmaster - Prevent your website from being copied (robots, spiders and web crawlers)

December 2016

Method 1: The robots.txt file

This file will give instructions to the search engine robots that roam on websites. You can indicate the URL to be followed and those that must be ignored. You can even give different instructions set for each search engine.

Read this article: http://ccm.net/contents/web/robots-txt.php3

Problems related to robots rules

  • The robots-rules method is not totally effective:
  • Some engines ignore the robots rules..
  • Some software can be configured to ignore these guidelines (e.g HTTrack)
  • Some very bad robots will follow URLs that you have asked them not to follow.

Method 2: A little of programming

Here is a particularly effective method to prevent robots from indexing all of your site, but it requires the modification of every page of your site.

Here's the trick:
  • Include in all your pages a 1x1 transparent GIF (at a random location)
  • This image will link to a special URL of your site (e.g /dontclickme.php)
  • In the dontclickme.php page, create a list of banned IP (e.g in a mySQL table) and add the IP of your choice. You can ban the IP 30 minutes or 1 hour.
  • And on every page of your site, when a request arrives, make sure the IP is not part of the banned ones.

Using this method:
  • Normal users will never tiny invisible image. They will not be banned and can navigate easily throughout your site.
  • Robots trying to follow all the links will be detected and banned

Note that this second method is not 100% reliable: it is always possible to configure the HTTrack software to avoid this type of pages (dontclickme.php).


Related :

This document entitled « Webmaster - Prevent your website from being copied (robots, spiders and web crawlers) » from CCM (ccm.net) is made available under the Creative Commons license. You can copy, modify copies of this page, under the conditions stipulated by the license, as this note appears clearly.