Wednesday, August 12, 2009

Reverse DNS and Whitelists

Ok, these blogs seem to be more about technical issues than contemplations.

Since I have been watching the traffic on the dictionary, I have been very busy with bots, scraping, and overall dictionary usage. As a result, I have been reading what other people suggest.

Both Google and Yahoo suggest using a reverse DNS to validate their bots. I have seen several techniques for PHP. One of them looks like this:

$first_ip=$_SERVER['REMOTE_ADDR'];
$hostname=gethostbyaddr($first_ip);
$second_ip=gethostbyname($hostname);
if ($second_ip == $first_ip) {
// IPs matched
// Add to white list.

}

At first, this seems like it should just work. If $second_ip == $first_ip, all is good, right? Well, not really. If you monitor the output of gethostbyaddr() you will see that it often returns the original string if it cannot resolve it into a host name. Below, I have included a small sample of values I got from using gethostbyaddr() and gethostbyname().


















IPHostReverse IP
$first_ip$hostname$second_ip
66.249.65.232crawl-66-249-65-232.googlebot.com66.249.65.232
72.30.79.95llf531274.crawl.yahoo.net72.30.79.95
92.70.112.242static.kpn.netstatic.kpn.net
80.27.102.8880.27.102.8880.27.102.88
32.154.39.98mobile-032-154-039-098.mycingular.netmobile-032-154-039-098.mycingular.net


After looking at the table above, you can see that simply checking to see if $first_ip == $second_ip is not a good check. In the case of 80.27.102.88, you can see that it will pass the test. In the case of a spam bot, the hostname will most likely be the ip address. In effect, we just whitelisted the spam bot. By the way, getting hostnames with just the ip address is very common. It does not mean that the user is a bot. It only mean the hosting company didn't register a more human readable hostname. In that case we shouldn't backlist them.

What is a blacklist vs. a whitelist? A blacklist is a list of hosts we always want to block. They have been determined to be hostile. A whitelist is a list of hosts we always want to let in. For example we would always want to let Google, Yahoo, or Bing bots scan our site. They should be on the whitelist. Everyone else is allowed in, unless they try something not allowed.

1 comment:

  1. Er, sounds good, but how do you know that 80.27.102.88 is a spam bot if the test you're performing is intended to weed out the spam bots. What you're then saying is that you first need to know which are the spam bots in order to work out how to detect spam-bots. That's a bit circular isn't it?

    ReplyDelete