Webmaster - Prevent your website from being copied (robots, spiders and web crawlers)

February 2017



Method 1: The robots.txt file


This file will give instructions to the search engine robots that roam on websites. You can indicate the URL to be followed and those that must be ignored. You can even give different instructions set for each search engine.

Read this article: http://ccm.net/contents/web/robots-txt.php3

Problems related to robots rules

  • The robots-rules method is not totally effective:
  • Some engines ignore the robots rules..
  • Some software can be configured to ignore these guidelines (e.g HTTrack)
  • Some very bad robots will follow URLs that you have asked them not to follow.

Method 2: A little of programming


Here is a particularly effective method to prevent robots from indexing all of your site, but it requires the modification of every page of your site.

Here's the trick:
  • Include in all your pages a 1x1 transparent GIF (at a random location)
  • This image will link to a special URL of your site (e.g /dontclickme.php)
  • In the dontclickme.php page, create a list of banned IP (e.g in a mySQL table) and add the IP of your choice. You can ban the IP 30 minutes or 1 hour.
  • And on every page of your site, when a request arrives, make sure the IP is not part of the banned ones.



Using this method:
  • Normal users will never tiny invisible image. They will not be banned and can navigate easily throughout your site.
  • Robots trying to follow all the links will be detected and banned


Note that this second method is not 100% reliable: it is always possible to configure the HTTrack software to avoid this type of pages (dontclickme.php).

Links


Related


Published by deri58. Latest update on October 30, 2012 at 07:35 AM by deri58.
This document, titled "Webmaster - Prevent your website from being copied (robots, spiders and web crawlers)," is available under the Creative Commons license. Any copy, reuse, or modification of the content should be sufficiently credited to CCM (ccm.net).