How To Prevent Bots From Crawling Your Site

Last updated on September 3rd, 2024 at 04:08 am

Every website administrator needs to protect their site from malicious bots and users. You may even need to prevent bots from crawling your website, if you don’t want its content to be indexed online without your permission, or even stolen. Here’s how to prevent bots from crawling your site. You can also use these steps to stop all  spam bots and malicious bots from crawling your website.

How To Prevent Bots From Crawling Your Site

Here are the different ways to prevent bots from crawling your site.

1. Using robots.txt

Robots.txt is a text file that contains crawling instructions for incoming bots. Search bots, spam bots and other bots look for this file before they crawl your website. They proceed depending on the instructions present in this file. Robots.txt must be served at www.yourdomain.com/robots.txt URL. So if your website is www.helloworld.com, then robots.txt should be served at www.helloworld.com/robots.txt

You can use robots.txt to tell search bots to not crawl your entire website, or specific folders and pages in it.

There are quite a few rules available to instruct crawl bots. The most common ones are:

  • User-agent: Search bots user User-agent attribute to identify yourself. You can allow/disallow crawl bots by mentioning their user agent names.
  • Disallow: Specifies the files or folders that are not allowed to be crawled.
  • Crawl-delay: Specifies the number of seconds a bot should wait before crawling each page
  • Wildcard (*): Used to mean all bots

Bonus Read : NGINX SSL Configuration (Step by Step)

We will look at a few examples to disallow robots from crawling your site. Here are the user-agent names of common bots for your reference – Googlebot, Yahoo!, Slurp bingbot, AhrefsBot, Baiduspider, Ezooms, MJ12bot, YandexBot

Disallow all search engines to crawl website

Here’s what you need to add to your robots.txt file if you want to disallow all bots from crawling your website

User-agent: *
Disallow: /

In the above configuration, we use a wildcard * for user-agent rule to disallow all in robots.txt. We use home url (/) in Disallow rule to specify entire website.

In this case, we disallow all bots from crawling our entire website.

Bonus Read : Linux List All Processes by Name, User, PID

Allow all search engines to crawl website

Here’s what you need to add to your robots.txt file if you want to allow all bots from crawling your website

User-agent: *
Disallow:

In the above configuration, we use a wildcard * for user-agent to specify all crawl bots. We leave the Disallow rule as blank.

In this case, we allow all bots from crawling our entire website.

Bonus Read : How to Prevent Image Hotlinking in NGINX

Disallow One Specific Search Engine from crawling website

If you want to disallow only one specific crawl bot from crawling your website, mention its user name in User-name rule

User-agent: BaiduSpider
Disallow: /

Bonus Read : How to List all virtual hosts in Apache

Disallow All Search Engines from Crawling specific folders

If you want to disallow all search engines from crawling specific folders (e.g /product, /uploads), mention them separately in Disallow rule

User-agent: *
Disallow: /uploads
Disallow: /product

Disallow All Search Engines from Crawling specific files

If you want to disallow all search engines from crawling specific files (e.g /signup.html, /payment.php), mention them separately in Disallow rule

User-agent: *
Disallow: /signup.html
Disallow: /payment.php

You can always use a combination of the above configurations in your robots.txt file.

2. Using CAPTCHA

CAPTCHA stands for Completely Automated Public Turing tests to Tell Computers and Humans Apart. It uses programmatic approach to determine whether a website visitor is a human or a bot. They do this by requiring the visitor to complete a task such as solving a puzzle, or typing characters, that can be done only by humans for sure. They are very effective in blocking bots on websites.

3. By Blocking IP Address

In most cases, these bots operate using one or more specific IP addresses. If they are repeatedly hurting your site, then simply open your server log using an log management tool, identify the suspicious IP addresses, and block them using a security plugin or firewall. If your website is using a CMS such as WordPress, Magento, Joomla, etc. then you can easily use one of the many security plugins that are available our of the box.

You can also use firewall services such as iptables, ufw, and fail2ban to block malicious IP addresses. If you are using a cloud service such as AWS, then you can also block these IP addresses directly using the cloud provider’s service management console.

4. Using HTTP Authentication

HTTP Authentication is a basic authentication service provided by almost every web server such as Apache and NGINX. Website administrators can enable it to require users to enter a username and password to access one or more web pages on their site. Unless the user enters proper login credentials, they will not be able to access the requested web page. This will automatically block all bots.

5. Using .htaccess file

If your website or blog runs on Apache server, then you can use .htaccess file to block web traffic from bots. It is usually used for redirecting traffic and rewriting URLs. It allows you to easily identify and block users based on different parameters in their request URLs. Most bots can be identified using their USER-AGENT attribute in request header. For example, if you have already enabled mod_rewrite (.htaccess) on your site, then add the following lines to this file to block Google Bot from crawling your site.

Rewrite Engine On
RewriteCond %{HTTP_USER_AGENT}Googlebot{NC}
RewriteRule .* - [R=403, L]

Conclusion

Hopefully, now you can easily disallow bots from crawling your website. In this article, we have learnt several powerful ways to easily identify and block bot traffic from your website. You may need to use one or more of them to effectively stop this menace.

Ubiq makes it easy to visualize data in minutes, and monitor in real-time dashboards. Try it today!