Sometimes you may need to prevent SEO bots from crawling your website, if you don’t want its content to be indexed online. Here’s how to prevent SEO bots from crawling your site using robots.txt file. You can also use these steps to stop all spam bots and malicious bots from crawling your website.
How To Prevent SEO Bots From Crawling Your Site
Here are the steps to prevent SEO bots from crawling your site using robots.txt file.
What is robots.txt?
Robots.txt is a text file that contains crawling instructions for incoming bots. Search bots, spam bots and other bots look for this file before they crawl your website. They proceed depending on the instructions present in this file. Robots.txt must be served at www.yourdomain.com/robots.txt URL. So if your website is www.helloworld.com, then robots.txt should be served at www.helloworld.com/robots.txt
You can use robots.txt to tell search bots to not crawl your entire website, or specific folders and pages in it.
There are quite a few rules available to instruct crawl bots. The most common ones are:
- User-agent: Search bots user User-agent attribute to identify yourself. You can allow/disallow crawl bots by mentioning their user agent names.
- Disallow: Specifies the files or folders that are not allowed to be crawled.
- Crawl-delay: Specifies the number of seconds a bot should wait before crawling each page
- Wildcard (*): Used to mean all bots
Bonus Read : NGINX SSL Configuration (Step by Step)
How to Prevent Search Bots from crawling your website
We will look at a few examples to disallow robots from crawling your site. Here are the user-agent names of common bots for your reference – Googlebot, Yahoo!, Slurp bingbot, AhrefsBot, Baiduspider, Ezooms, MJ12bot, YandexBot
Disallow all search engines to crawl website
Here’s what you need to add to your robots.txt file if you want to disallow all bots from crawling your website
User-agent: * Disallow: /
In the above configuration, we use a wildcard * for user-agent rule to disallow all in robots.txt. We use home url (/) in Disallow rule to specify entire website.
In this case, we disallow all bots from crawling our entire website.
Bonus Read : Linux List All Processes by Name, User, PID
Allow all search engines to crawl website
Here’s what you need to add to your robots.txt file if you want to allow all bots from crawling your website
User-agent: * Disallow:
In the above configuration, we use a wildcard * for user-agent to specify all crawl bots. We leave the Disallow rule as blank.
In this case, we allow all bots from crawling our entire website.
Bonus Read : How to Prevent Image Hotlinking in NGINX
Disallow One Specific Search Engine from crawling website
If you want to disallow only one specific crawl bot from crawling your website, mention its user name in User-name rule
User-agent: BaiduSpider Disallow: /
Bonus Read : How to List all virtual hosts in Apache
Disallow All Search Engines from Crawling specific folders
If you want to disallow all search engines from crawling specific folders (e.g /product, /uploads), mention them separately in Disallow rule
User-agent: * Disallow: /uploads Disallow: /product
Disallow All Search Engines from Crawling specific files
If you want to disallow all search engines from crawling specific files (e.g /signup.html, /payment.php), mention them separately in Disallow rule
User-agent: * Disallow: /signup.html Disallow: /payment.php
You can always use a combination of the above configurations in your robots.txt file.
Hopefully, now you can easily disallow SEO bots from crawling your website.
Ubiq makes it easy to visualize data in minutes, and monitor in real-time dashboards. Try it today!