Skip to main content

Using Fail2Ban to protect a Varnished site from scrapers

I'm using Varnish to cache a fairly busy property site.  Varnish works like a bomb for normal users and has greatly improved our page load speed.

For bots that are scraping the site, presumably to add the property listings to their own site, though the cache is next to useless since the bots are sequentially trawling through the whole site.

I decided to use fail2ban to block IP's who hit the site too often.

The first step in doing so was to enable a disk based access log for Varnish so that fail2ban will have something to work with.

This means setting up varnishncsa.  Add this to your /etc/rc.local file:

 varnishncsa -a -w /var/log/varnish/access.log -D -P /var/run/varnishncsa.pid  

This starts up varnishncsa in daemon mode and appends Varnish access attempts to /var/log/varnish/access.log

Now edit or create /etc/logrotate.d/varnish and make an entry to rotate this access log:

  /var/log/varnish/*log {   
     create 640 http log   
     compress   
     postrotate   
       /bin/kill -USR1 `cat /var/run/varnishncsa.pid 2>/dev/null` 2> /dev/null || true   
     endscript   
   }   

Install fail2ban. On Ubuntu you can apt-get install fail2ban

Edit /etc/fail2ban/jail.conf and add a block like this:

 [http-get-dos]  
 enabled = true  
 port = http,https  
 filter = http-get-dos  
 logpath = /var/log/varnish/access.log  
 maxretry = 300  
 findtime = 300  
 #ban for 5 minutes  
 bantime = 600  
 action = iptables[name=HTTP, port=http, protocol=tcp]  

This means that if a person has 300 (maxretry) requests in 300 (findtime) seconds then a ban of 600 (bantime) seconds is applied.

 We need to create the filter in /etc/fail2ban/filter.d/http-get-dos.conf to create the pattern to match the jail:

 # Fail2Ban configuration file  
 #  
 # Author: http://www.go2linux.org  
 #  
 [Definition]  
 # Option: failregex  
 # Note: This regex will match any GET entry in your logs, so basically all valid and not valid entries are a match.  
 # You should set up in the jail.conf file, the maxretry and findtime carefully in order to avoid false positives.  
 failregex = ^<HOST>.*"GET  
 # Option: ignoreregex  
 # Notes.: regex to ignore. If this regex matches, the line is ignored.  
 # Values: TEXT  
 #  
 ignoreregex =  

Now lets test the regex against the log file so that we can see if it is correctly picking up the IP addresses of the visitors:

fail2ban-regex /var/log/varnish/access.log /etc/fail2ban/filter.d/http-get-dos.conf 

You should see a list of IP addresses and times followed by summary statistics.

When you restart fail2ban your scraper protection should be up and running.

Comments

Popular posts from this blog

Solving Doctrine - A new entity was found through the relationship

There are so many different problems that people have with the Doctrine error message: exception 'Doctrine\ORM\ORMInvalidArgumentException' with message 'A new entity was found through the relationship 'App\Lib\Domain\Datalayer\UnicodeLookups#lookupStatus' that was not configured to cascade persist operations for entity: Searching through the various online sources was a bit of a nightmare.  The best documentation I found was at  http://www.krueckeberg.org/  where there were a number of clearly explained examples of various associations. More useful information about association ownership was in the Doctrine manual , but I found a more succinct explanation in the answer to this question on StackOverflow . Now I understood better about associations and ownership and was able to identify exactly what sort I was using and the syntax that was required. I was implementing a uni-directional many to one relationship, which is supposedly one of the most simpl...

Grokking PHP monolog context into Elastic

An indexed and searchable centralized log is one of those tools that once you've had it you'll wonder how you managed without it.    I've experienced a couple of advantages to using a central log - debugging, monitoring performance, and catching unknown problems. Debugging Debugging becomes easier because instead of poking around grepping text logs on servers you're able to use a GUI to contrast and compare values between different time ranges. A ticket will often include sparse information about the problem and observed error, but if you know more or less when a problem occurred then you can check the logs of all your systems at that time. Problem behaviour in your application can occur as a result of the services you depend on.  A database fault will produce errors in your application, for example. If you log your database errors and your application errors in the same central platform then it's much more convenient to compare behaviour between...

Translating a bit of the idea behind domain driven design into code architecture

I've often participated in arguments discussions about whether thin models or thin controllers should be preferred.  The wisdom of a thin controller is that if you need to test your controller in isolation then you need to stub the dependencies of your request and response. It also violates the single responsibility principal because the controller could have multiple reasons to change.   Seemingly, the alternative is to settle on having fat models. This results in having domain logic right next to your persistence logic. If you ever want to change your persistence layer you're going to be in for a painful time. That's a bit of a cargo cult argument because honestly who does that, but it's also a violation of the single responsibility principal.   One way to decouple your domain logic from both persistence and controller is to use the "repository pattern".   Here we encapsulate domain logic into a data service. This layer deals exclusively with imple...