Showing posts with label Apache. Show all posts
Showing posts with label Apache. Show all posts

Friday, June 5, 2009

How to block robots.. before they hit robots.txt - ala: mod_security

As many of you know, robots (in their many forms) can be quite pesky when it comes to crawling your site, indexing things that you don't want indexed. Yes, there is the standard of putting a robots.txt in your webroot, but that is often not highly effective. This is due to a number of facts... the least of which is not that robots tend to be poorly written to begin with and thus simply ignore the robots.txt anyway.

This comes up because a friend of mine that runs a big e-com site recently asked me.. "J, how can I block everything from these robots, I simply don't want them crawling our site." My typical response to this was "you know that you will then block these search engines and keep them from indexing your site"... to whit "yes, none of our sales are organic, they all come from referring partners and affiliate programs".... That's all that I needed to know... as long as it doesn't break anything that they need heh.

After puting some thought into it, and deciding that there was no really easy way to do this on a firewall, I decided that the best way to do it was to create some mod_security rules that looked for known robots and returned a 404 whenever any such monster hit the site. This made the most sense because they are running an Apache reverse proxy in front of their web application servers with mod_security (and some other fun).

A quick search on the internet found the robotstxt.org site that contained a listing (http://www.robotstxt.org/db/all.txt) of quite a few common robots. Looking through this file, all that I really cared about was the robots-useragent value. As such, I quickly whipped up the following perl that automaticaly creates a file named modsecurity_crs_36_all_robots.conf. Simply place this file in the apt path (for me /usr/local/etc/apache/Includes/mod_security2/) and restart your apache... voila.. now only (for the most part) users can browse your webserver. I'll not get into other complex setups, but you could do this on a per directory level also, from your httpd.conf, and mimic robots.txt (except the robots can't ignore the 404 muahahaha).

#####################Begin Perl#######################
#!/usr/bin/perl

##
## Quick little routine to pull the user-agent string out of the
## all.txt file from the robots project, with the intention of creating
## regular expression block rules so that they can no longer crawl
## against the rules!
## Copyright JJ Cummings 2009
## cummingsj@gmail.com
##

use strict;
use warnings;
use File::Path;

my ($line,$orig);
my $c = 1000000;
my $file = "all.txt";
my $write = "modsecurity_crs_36_all_robots.conf";
open (DATA,"<$file");
my @lines = ;
close (DATA);

open (WRITE,">$write");
print WRITE "#\n#\tQuick list of known robots that are parsable via http://www.robotstxt.org/db/all.txt\n";
print WRITE "#\tgenerated by robots.pl written by JJ Cummings \n\n";
foreach $line(@lines){
if ($line=~/robot-useragent:/i){
$line=~s/robot-useragent://;
$line=~s/^\s+//;
$line=~s/\s+$//;
$orig=$line;
$line=~s/\//\\\//g;
#$line=~s/\s/\\ /g;
$line=~s/\./\\\./g;
$line=~s/\!/\\\!/g;
$line=~s/\?/\\\?/g;
$line=~s/\$/\\\$/g;
$line=~s/\+/\\\+/g;
$line=~s/\|/\\\|/g;
$line=~s/\{/\\\{/g;
$line=~s/\}/\\\}/g;
$line=~s/\(/\\\(/g;
$line=~s/\)/\\\)/g;
$line=~s/\*/\\\*/g;
$line=~s/X/\./g;
$line=lc($line);
chomp($line);
if (($line ne "") && ($line !~ "no") && ($line !~ /none/i)) {
$c++;
$orig=~s/'//g;
$orig=~s/`//g;
chomp($orig);
print WRITE "SecRule REQUEST_HEADERS:User-Agent \"$line\" \\\n";
print WRITE "\t\"phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Automated Web Crawler Block Activity',id:'$c',tag:'AUTOMATION/BOTS',severity:'2'\"\n";
}
}
}
close (WRITE);
$c=$c-1000000;
print "$c total robots\n";


#####################End Perl#######################

To use the above, you have to save the all.txt file to the same directory as the perl.. and of course have +w permissions so that the perl can create the apt new file. This is a pretty basic routine... I wrote it in about 5 minutes (with a few extra minutes for tweaking of the ruleset format output (displayed below). So please, feel free to modify / enhance / whatever to fit your own needs as best you deem. **yes, I did shrink it so that it would format correctly here**

#####################Begin Example Output#######################
SecRule REQUEST_HEADERS:User-Agent "abcdatos botlink\/1\.0\.2 \(test links\)" \
"phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Automated Web Crawler Block Activity',id:'1000001',tag:'AUTOMATION/BOTS',severity:'2'"
SecRule REQUEST_HEADERS:User-Agent "'ahoy\! the homepage finder'" \
"phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Automated Web Crawler Block Activity',id:'1000002',tag:'AUTOMATION/BOTS',severity:'2'"
SecRule REQUEST_HEADERS:User-Agent "alkalinebot" \
"phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Automated Web Crawler Block Activity',id:'1000003',tag:'AUTOMATION/BOTS',severity:'2'"
SecRule REQUEST_HEADERS:User-Agent "anthillv1\.1" \
"phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Automated Web Crawler Block Activity',id:'1000004',tag:'AUTOMATION/BOTS',severity:'2'"
SecRule REQUEST_HEADERS:User-Agent "appie\/1\.1" \
"phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Automated Web Crawler Block Activity',id:'1000005',tag:'AUTOMATION/BOTS',severity:'2'"

#####################End Example Output#######################

And that folks, is how you destroy robots that you don't like.. you can modify the error that returns to fit whatever suits you best.. 403, 404.....

Cheers,
JJC

Thursday, January 15, 2009

New IDS/IPS technologies

Recently while parusing the intertubes I ran across a new IDS/IPS technology (PHPIDS) "http://www.php-ids.org". This is an interesting and simple concept that can add an additional layer of security to your web application(s). This being said, I am not sure that I would run it solely, but I will be testing it over the week and posting the results subsequently.

Friday, May 16, 2008

How are your "Debian" SSL certs doing

Last night, while interviewing with Paul and Larry on the pauldotcom.com podcast, I had an interesting thought whilst bashing Debian and the latest OpenSSL party that they have created.

How many root Certificate Authorities run debian and generate signed ssl keys?

Obviously the implications on this are substantial.. I get in the middle of an affected ecom server/application and grab credit card numbers and identity info for a day or so.. then meander on my way. Alarming because of course it does not produce any real auditable trail for analysts to follow... I mean, there was no real break in as with TJX or Advance Auto....

So, the moral of this story is that you need to check with your CA and see if they issued you any certs/keys from any affected systems. If that is the case then they of course need to re-issue a known good cert/key to you.

I *hope* but doubt that it will happen, that any affected CA would notify their customer base if they had issued anything from an affected system.

Cheers,
JJC

Thursday, August 30, 2007

Trac on FreeBSD6.2 w/ Subversion.

Recently I investigated using Trac (http://trac.edgewall.org/) integrating subversion built on FreeBSD front ended by Apache22. The reason behind this is simple, several of the projects that I am involved with need to use svn and also house web / wiki / forum capabilities. I have written this with the intent of helping FreeBSD users get a base functional install using the aforementioned technologies.

First things first, let's build Apache and Subversion with the appropriate options:
secure# cd /usr/ports/www/apache22
secure# make WITH_AUTH_MODULES=yes WITH_DAV_MODULES=yes \
WITH_SSL_MODULES=yes WITH_BERKELEYDB=db42 install clean
secure# cd /usr/ports/devel/subversion
secure# make -DWITH_SVNSERVE_WRAPPER -DWITH_MOD_DAV_SVN \
-DWITH_APACHE2_APR install clean
Now, let's prepare and build or repository
secure# mkdir -p /svn/repos
secure# svnadmin create /svn/repos
secure# chown -R www:www /svn/repos
After we build our repo and set permissions for www to access them, we need to setup our apache to use dav_svn_module and authz_svn_module. You will need to edit /usr/local/etc/apache22/httpd.conf and modify as noted in the excerpt from mine. Note the commented out dav_module (don't forget to do this or it's gonna break stuff later on)
.....
LoadModule usertrack_module libexec/apache22/mod_usertrack.so
LoadModule unique_id_module libexec/apache22/mod_unique_id.so
LoadModule setenvif_module libexec/apache22/mod_setenvif.so
LoadModule version_module libexec/apache22/mod_version.so
LoadModule ssl_module libexec/apache22/mod_ssl.so
LoadModule mime_module libexec/apache22/mod_mime.so
LoadModule dav_module libexec/apache22/mod_dav.so
LoadModule status_module libexec/apache22/mod_status.so
LoadModule autoindex_module libexec/apache22/mod_autoindex.so
LoadModule asis_module libexec/apache22/mod_asis.so
LoadModule info_module libexec/apache22/mod_info.so
.......
LoadModule alias_module libexec/apache22/mod_alias.so
LoadModule rewrite_module libexec/apache22/mod_rewrite.so
#LoadModule dav_module libexec/apache22/mod_dav.so
LoadModule dav_svn_module libexec/apache22/mod_dav_svn.so
LoadModule authz_svn_module libexec/apache22/mod_authz_svn.so
Next we will be creating our /usr/local/etc/apache22/Includes/svn.conf
secure# vi /usr/local/etc/apache22/Includes/svn.conf

DAV svn
SVNPath /svn/repos
AuthType Basic
AuthName "Feloo Subversion Repository"
AuthUserFile /etc/svn-auth-file
Require valid-user
Create our auth file using htpasswd
secure# htpasswd -cm /etc/svn-auth-file JJC
Build Trac from the ports tree
secure# cd /usr/ports/www/trac && make install clean
Create and initialize our environment
secure# mkdir -p /trac/projects/
secure# trac-admin /trac/projects initenv
secure# chown -R www:www /trac/projects/
Build mod_python3
secure# cd /usr/ports/www/mod_python3 && make install clean
Add one last module to our /usr/local/etc/apache22/httpd.conf
secure# vi /usr/local/etc/apache22/httpd.conf
LoadModule python_module libexec/apache22/mod_python.so
Define our trac location in /usr/local/etc/apache22/Includes/trac.conf (you'll have to create it)
secure# vi /usr/local/etc/apache22/Includes/trac.conf


SetHandler mod_python
PythonHandler trac.web.modpython_frontend
PythonOption TracEnv /trac/projects
PythonOption TracUriRoot /trac


AuthType Basic
AuthName "JJC Trac Projects"
AuthUserFile /etc/svn-auth-file
Require valid-user

Now, start (or restart) your apache daemon
apachectl start
You should now be able to access Trac at http://theinstallediporhostname/trac

Cheers,
JJC