Bad bots are eating the world. But developers will save us all

Scrapers or bad bot protection is a big topic for a lot of company departments. However, it’s always developers or DevOps that end up implementing a bad bot mitigation solution. It’s a never-ending cat and mouse game against bots that increase in obstruction mechanisms.

In this article, we will explain the technical implementation of efficient bot detection and protection and see that there is a code alternative to the classical network detection.

But first, what’s a bad bot?

What’s a (bad) bot?

Bots can be used for a wide range of use cases.
A lot can be beneficial to you.

Good Bots

Search engine crawling that respect the WWW::RobotRules like:

  • GoogleBot
    • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
    • Googlebot/2.1 (+http://www.google.com/bot.html)
    • Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  • Bingbot
    • Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
    • Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
    • Mozilla/5.0 (Windows Phone 8.1; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 530) like Gecko (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
  • DuckDuckGo Bot
    • DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html)

Website health checks launched at the users request like:

  • PingDom
    • Pingdom.com_bot_version_1.4_(http://www.pingdom.com/);
  • Botify
    • Mozilla/5.0 (compatible; botify; http://botify.com)

Security bots launched at the users request like:

  • Detectify
    • Mozilla/5.0 (compatible; Detectify) +https://detectify.com/bot/
  • NMAP
    • Mozilla/5.0 (compatible; Nmap Scripting Engine; http://nmap.org/book/nse.html)
  • Sucuri
    • Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 MSIE 7.0

But others are just malicious and should be blocked.

Bad Bots

Scrapers

  • WebZip
    • WebZIP

DDos bots

  • Nitol
    • Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1)
  • Cyclone
    • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
    • Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

Vulnerability Scanning

  • Mars brute force attacks
    • Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
  • Blitz
    • Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)
  • Nessus
    • NESSUS::SOAP Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)

So how can we block these malicious bots?

How to block bad bots, the ineffective way

Blocking from basic network attributes

An old school technique to block bad bots is to use one or a combination of the following criteria:

  • User agents
  • Referrers
  • IP addresses
  • Request URIs

We’ve already detailed these techniques in a previous article. But they aren’t recommended as they create false positives and even more, false negatives. If you just look at the user agents of bad bots we listed above you can see that they aren’t really identifiable.

Of course, a request coming from a user agent that is known to be a vulnerability scanner should be blocked. But with all the new spoofing techniques it’s just impossible to rely on a powerful protection that just blocks using a list of user agents, referrers or any other static list. Bypassing your rule will take less than 5 minutes to develop and deploy.

Terminal change user agent

Whether you’re maintaining a list of blocks on your server yourself, or using a server plugin like Bad-Bot-Blocker, or apache-ultimate-bad-bot-blocker those solutions won’t help you stop badly behaving scrapers, bots or spiders.

Blocking via network redirection

An alternative, that is still very much popular today, is, to redirect your whole traffic to a third party that will block bad bots for you.

If you’re…

  • Not working in a privacy-sensitive domain and can just pass your whole traffic and data to a third party (think GDPR, HIPAA etc.)
  • If you’re not in a website-performance-driven business that requires fast responses for optimized conversions (think e-commerce, media)
  • Not afraid of having a single point of failure on your whole infrastructure

… then you might be able to use this type of solutions.

However, even if more powerful than a raw server plugin, the detection mechanism of DNS-redirection tools is still just based on flat network data. This approach is limited to the surface of your application and cannot work on your business logic – which is what we actually need to protect.

But, damn. Bot fingerprinting is hard. These bot protection solutions are easily bypassed at the network level: crawlers can fake a browser, change their user agent & headers, language, adapt their response & query time to simulate a human, even rely on botnets of hacked browsers to hide behind legit browsers & IP addresses…

From the code: a modern approach to bad bot mitigation

We will take a fictive case study to illustrate more advanced bad bot mitigation implementations.
It illustrates that an implementation closer to your business logic could resolve the bot mitigation issue in a much more efficient way (would be hard to achieve from HTTP headers, right?).

Consumption-based content scraping protection

Imagine your company is a content-intensive platform that centralizes a lot of data about other companies. Think Linkedin, Capterra, Crunchbase, and others. You allow logged-in users to crawl your website and read your data. It uses Ruby on Rails and MongoDB as a database.

Blocking HTTP requests – at the end of the day – is not what you’re looking for. So what you really want to do is to avoid your own data – from your MongoDB database – to be retrieved (and not just to block bots).

If you know that a typical user checks an average of 10 companies, then the logic you want to implement is:

x = count(“company_info”, current_user)
if x = 10
    logger.debug("OK")
else
    raise "GO AWAY"
end
Companies.find()

To implement this in your app, you could define something like:

def count(what, request, user)
    Counter.where(name: what, dimension: “user”, value: user).inc(:count)
end

This data can be stored on a specific user and an action could be done every time the user passes a specific threshold.

class="g-recaptcha"
          data-sitekey="your_site_key"
          data-callback="onSubmit"
          data-size="invisible"
IF count(“company_info”, current_user) > 10
THEN recaptcha.execute();

The only “complexity” here is to gather those events, from your instances, in a single DB.

A malicious bot will get a Captcha that he wouldn’t be able to solve and the bypass method would certainly be to create a new account and restart the process.

If you create a similar code-related rule to avoid fake account creations this problem should be solved.

You could, of course, increase the complexity of this model to offer the best user experience possible to your users.

Conclusion

What we illustrated here with this example is that bot protection is a business-specific topic. And in order to implement such business logic you need to do deeper code implementations. Relying on simple bot detection solutions that look at network data can’t be enough. Of course, maintaining this in your code for every critical asset is a pain. The other big drawback of such implementations is the heavy performance that those solutions could bring.

This is where Sqreen comes in and allows you to outsource this management with just a simple SDK to add to your code.
You just need to add these two lines to the business logic you want to protect:
To track an event, a simple line in the code could do the job:

Sqreen.track("custom_event", {"user" => "424242", ip => "8.8.8.8"})

To ask Sqreen whether the dimension is triggering the alarm, you can just test the boolean

Sqreen.alarm_triggered?("custom_event", {"user" => "424242", ip => "8.8.8.8"})

And in the interface you just say:

Sqreen security rule

Yes – you understood that bot protection is a cat & mouse game, right? The more context you can use to work on business logic – the better you’ll be at catching those and reacting. So, no, developers will not save the world 😢. But they’ll definitely do way better than a regular expression…

Create your account today to implement bad bod protection in your app.