Detecting and blocking bad bots

If there’s one constant online in recent months, it’s that security is a must.
Whether because of significant security breaches, such as at Yahoo!; because of concerns of state-sponsored privacy invasions, such as by the NSA or GCHQ; or because of state-sponsored hacking, we have to take security seriously.

This is all the more important when we commonly expect a right to privacy and that it should be protected.
But much and all as we’d like to believe that we should be able to live in a world where we’re free from these, and other, malicious activities; that kind of thinking represents a utopia which just doesn’t exist.

Given that, we have to be proactive in defending ourselves from malicious, external, forces.
And in this article, I’m going to focus on one, specific, kind of malicious actor: the bad bot.

What is a Bad Bot?

There are all kinds of bots, for all nature of purposes.
A number exist for beneficial purposes, such as the bots from the major search engines, including Google and Bing.

These provide helpful services, such as making our online content available for people to find, based on their search queries.
However bad bots serve quite a different purpose.
They can be used for a range of different reasons, including:

  • Scraping sites for information that can be repackaged by a competitor, such as user data
  • Creating a distributed denial-of-service (DDOS) attack
  • Stealing information and re-posting it under another identity

In addition to the commercial and brand impacts, bots such as these also have other consequences, including:

  • Inflating search traffic results, effectively distorting site metrics
  • Increasing website traffic load, requiring sites to invest in unnecessary hardware infrastructure

Regardless of what the consequences are however, they have a direct impact on us; consequences which need to be minimized — if not prevented outright.

So How Do You Identify and Stop Them?

That’s what I’m going to show you from the remainder of this article.
Specifically, I’m going to walk you through four potential options.
These are:

  1. Homegrown solutions
  2. Plugins, Tools, and Extensions
  3. Relying on Hosting & Solution Providers
  4. Proactive Solutions

Before we dive into them, I’d like you to bear something in mind; Regardless of your budget, or the lack thereof, bad bots are not something that you’re ever likely going to defeat.
Yes, you can minimize the effect that they have, but it’s going to be a process that you have to be continuously proactive on.

You can’t establish a solution, set it, and forget it.
It doesn’t work this way.
To highlight this point, consider these four key takeaways from Distil Networks fourth annual Bad Bot Report:

  1. 40% of all web traffic in 2016 originated from bots.
  2. 97% of websites with proprietary content and/or pricing are being hit by unwanted scraping.
  3. 31% of websites with forms are hit by spam bots.
  4. 90% of websites were hit by bad bots that were behind the login page.
  5. 75% of bad bots were Advanced Persistent Bots (APBs). An Advanced Persistent Bot is one that can perform a sophisticated interaction with sites, such as by being able to “load JavaScript, hold onto cookies, and load up external resources, or persistent, in that they can randomize their IP address, headers, and user agents.”

And here’s another point to consider, while the 2016 Bad Bot Landscape Report reports that the frequency of bot attacks is diminishing, their sophistication is dramatically increasing.

CMSWire reported that:

88 percent of bots in the wild execute JavaScript and perform evasion techniques like rotating IPs and low-frequency attacks (i.e. fewer request from far more machines). We’re observing the human actors behind the bot attacks responding to anti-bot technologies and improving their methods accordingly.

So how do you diminish their effectiveness?
Let’s find out.

Homegrown Solutions

One of the first, and potentially least cost-intensive steps, is creating your own solution.
This involves three, broad, steps:

  1. Observe the bots that aren’t respecting the robots.txt directive
  2. Log them
  3. Use web server directives to block them

This may seem like a simplistic solution.
So let’s step through it in more detail.

Decent, or well-designed, bots will respect a website’s robots.txt file, such as Sqreen’s, which looks like this:

User-agent: *
Disallow: /private-beta-gem
Disallow: /gems
Disallow: /blog

This says that any bot, regardless of origin and intent, is allowed to parse any part of the Sqreen’s website, except for three areas: /private-beta-gem, /gems, and /blog.
Now you have an immediate way of differentiating good ones from bad ones.
A good bot will respect the rules and not browse the three disallowed areas.
A bad one will browse them regardless.

That’s step one complete.
Now for step two: finding out who’s browsing the disallowed areas, and logging them.

For that, there are two approaches you can take:

Parse your log files: When you review your log files, if you see requests for content in any of these three directories, you know that the bot may either be poorly developed or have malicious intent.

Create a honey pot (or trap): This involves creating a link (normally on your site’s home page) to a section of your site that shouldn’t be traversed. It will be invisible to humans, but not to bots. If the bot makes a request to the link, then a script (written in your scripting language of choice) will record details about the bot to a data file. This area of your site should be blocked in your robots.txt file as well.

Either way, you can now start collating a database of bots that you need to block from your site.

Now to step three: blocking them.
To block them, you need only turn to your web server’s configuration files and add the relevant directives to block that match the information in your database.

Perishable Press suggests four ways of blocking them:

  • By user agent: Helpful if bots identify themselves with a custom user agent string, different from the standard browsers and search engine bots.
  • By referrer: Helpful in cases where the referrer is a known bad bot or spam referrer.
  • By IP address: Helpful in cases where the IP address is known to be linked to bad bots.
  • By request URI: This is useful when none of the other three cases works.

And here’s some examples that they provide of how to configure Apache 2.x to block bots based on these approaches.

By User Agent


RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (EvilBot|ScumSucker|FakeAgent) [NC] RewriteRule (.*) - [F,L]

By Request URI


RewriteEngine On RewriteCond %{QUERY_STRING} (evil|spam) [NC] RewriteRule (.*) - [F,L]

By Referrer


RewriteEngine On RewriteCond %{HTTP_REFERER} ^http://(.*)spam-referrer\.org [NC,OR] RewriteCond %{HTTP_REFERER} ^http://(.*)content-thief\.biz [NC,OR] RewriteCond %{HTTP_REFERER} ^http://(.*)bandwidth-leech\.net [NC] RewriteRule (.*) - [F,L]

By IP Address


RewriteEngine On RewriteCond %{REMOTE_ADDR} ^123\.456\.789\.000 [OR] RewriteCond %{REMOTE_ADDR} ^111\.222\.333\.000 [OR] RewriteCond %{REMOTE_ADDR} ^444\.555\.777\.000 RewriteRule (.*) - [F,L]

In the above example, if the user agent string contains one of EvilBot, ScumSucker, or FakeAgent, then they will be denied access.
And that’s one way, albeit a simplistic one, of reducing the impact of bad bots on your website(s).

Use Plugins, Tools, and Extensions

The catch with the above, homegrown, solution, as with any homegrown solution, is that you have to dedicate resources to both creating and maintaining it.
Given the fact that bots are becoming increasingly sophisticated, you’ll have to consistently update your bad bot database and accompanying web server rules to keep them at bay.

As a result, a plugin or web server extension may be more efficient.
What’s more, depending on your skill set, they may be the better solution.

For example, if you run one or more websites using WordPress, then you can make use of plugins such as Blackhole for Bad Bots or StopBadBots.
Or if your site’s based on Drupal, you can use Badbot.

However, if you are either a seasoned systems administrator or have those resources available, then two other solutions may be more to your liking; these are:

bad-bot-blocker

This is a list of 223 Apache .htaccess rules for blocking bad bots.
It defines bad bots based on a range of categories, including:

  • E-mail harvesters
  • Content scrapers
  • Spam bots
  • Vulnerability scanners
  • Bots linked to viruses or malware
  • Government surveillance bots

While it’s quite similar to the homegrown solution there’s a key difference.
That is that it’s not a homegrown solution, something that you created yourself and that your team would have to maintain.
It’s a solution developed and maintained by others — and by your team too, if you choose to help out.
As a result, for at least the same functionality, the investment would be significantly less.
What’s more, it has the potential for a much wider array of experience and input, which would make it a better solution — for much less cost.

To install it, you can copy the .htaccess file to your application’s directory, or add the configuration to your existing .htaccess file.

Note: If you do use it, I’d encourage you to add the configuration to your Apache’s core configuration, as .htaccess files, while effective, can significantly increase web server load.

apache-ultimate-bad-bot-blocker

This package is self-described as:

The Ultimate Bad Bot, User-Agent and Spam Referrer Blocker for Apache Web Servers (2.2 > 2.4+). It is designed to be an Apache include file and uses the Apache BrowserMatchNoCase directive

It aims to stop bots that are:

  • Bad Referrers
  • Spam Referrers
  • Linked to Lucrative Malware, Adware, and Ransomware Clickjacking Campaigns
  • Vulnerability scanners
  • Linked to Gambling and Porn Web Sites
  • E-mail harvesters
  • Image Hotlinking Sites and Image Thieves
  • Government surveillance bots
  • Part of Botnet Attack Networks, such as Mirai
  • SEO companies that your competitors use to try improve their SEO
  • Create Google Analytics Ghost Spam

Installing it takes more time and expertise than bad-bot-blocker.
However, it’s much more feature-rich and comprehensive in its level of protection.
So if you want a solution that you don’t need to maintain, one that’s comprehensive, and one that’s regularly updated, then this is the choice to make.

Use Your Hosting Provider and CDN

Now that we’ve looked at actions that you can take directly, what about making use of your hosting provider or related hosting infrastructure, such as Fastly, KeyCDN, Cloudflare, or Amazon CloudFront, to reduce the impact of bad bots on your site?

For example, back in April of 2016, KeyCDN released a new feature designed to block bad bots.
The feature:

…uses a comprehensive list of known bad bots and blocks them based on their User-Agent string.

While not comprehensive, it is a handy feature that you can make use of, to aid in reducing the effects of bad bots on your website. If you’re not with KeyCDN, what do your hosting providers offer which is similar, or better? Why not try Cloudflare? They have an excellent, free, starter option, which provides limited DDoS protection, along with comprehensive user documentation. If you are using CloudFlare (or will in the future), try this tutorial from DigitalOcean, which steps you through the process of mitigating DDOS attacks?

Web Application Firewalls

And finally, you could use a Web Application Firewall (WAF).
If you’re unfamiliar with them, a WAF is a special-purpose firewall for HTTP/S-based applications which:

Applies a set of rules to an HTTP conversation. These rules cover common attacks such as cross-site scripting (XSS) and SQL injection. While proxies generally protect clients, WAFs protect servers. A WAF is deployed to protect a specific web application or set of web applications. A WAF can be considered a reverse proxy.

That description aside, you can still use a WAF to protect your site against bots, based on the aforementioned criteria of user agents, IP address, geolocation and so on.

The catch with WAF’s however, as I’ve covered previously, is that, similar to homegrown solutions, they may require more of an investment than you’re prepared or able to make.
What’s more, they have other shortcomings, such as generating false positives and negatives.

In Conclusion

We’ve stepped through a number of recommendations in this guide. As a result, it may not be completely clear as to which one is the right one to choose. So to make it easier, have a look through the following table.
It’s hard to be 100% precise. But it should give you an approximate indication of each solution, based on complexity, cost, ease of setup and maintenance (easiness), and effectiveness.

Method Complexity Cost Easiness Effectiveness
Homegrown Solutions *** *** ** ***
bad-bot-blocker * * ***** **
apache-ultimate-bad-bot-blocker * ***** **** ***
Use Your Hosting Provider and CDN ** *** *** ****
Web Application Firewalls **** *** * ****

That said, this concludes the guide to what bad bots are, along with a variety of different ways for preventing them from detrimentally impacting your website(s).

I hope that you now have a better appreciation of just how significant bad bots can be to your website, business, brand, and reputation. I also hope that you are now both much better informed as well as in a much better position to reduce the impact of bots on your website.

Depending on your budget and resources, I’m confident that one, or a combination of the solutions outlined will help you improve the quality of your website’s traffic.

If you have any questions or want to share your experience, please do so in the comments.
I’d love to know what you think.

About the author

Matthew Setter is an independent software developer and technical writer. He specializes in creating test-driven applications and writing about modern software practices, including continuous development, testing, and security.

Robot Icon by Creaticca Creative Agency from the Noun Project