Wednesday, August 12, 2009

Reverse DNS and Whitelists

Ok, these blogs seem to be more about technical issues than contemplations.

Since I have been watching the traffic on the dictionary, I have been very busy with bots, scraping, and overall dictionary usage. As a result, I have been reading what other people suggest.

Both Google and Yahoo suggest using a reverse DNS to validate their bots. I have seen several techniques for PHP. One of them looks like this:

$first_ip=$_SERVER['REMOTE_ADDR'];
$hostname=gethostbyaddr($first_ip);
$second_ip=gethostbyname($hostname);
if ($second_ip == $first_ip) {
// IPs matched
// Add to white list.

}

At first, this seems like it should just work. If $second_ip == $first_ip, all is good, right? Well, not really. If you monitor the output of gethostbyaddr() you will see that it often returns the original string if it cannot resolve it into a host name. Below, I have included a small sample of values I got from using gethostbyaddr() and gethostbyname().


















IPHostReverse IP
$first_ip$hostname$second_ip
66.249.65.232crawl-66-249-65-232.googlebot.com66.249.65.232
72.30.79.95llf531274.crawl.yahoo.net72.30.79.95
92.70.112.242static.kpn.netstatic.kpn.net
80.27.102.8880.27.102.8880.27.102.88
32.154.39.98mobile-032-154-039-098.mycingular.netmobile-032-154-039-098.mycingular.net


After looking at the table above, you can see that simply checking to see if $first_ip == $second_ip is not a good check. In the case of 80.27.102.88, you can see that it will pass the test. In the case of a spam bot, the hostname will most likely be the ip address. In effect, we just whitelisted the spam bot. By the way, getting hostnames with just the ip address is very common. It does not mean that the user is a bot. It only mean the hosting company didn't register a more human readable hostname. In that case we shouldn't backlist them.

What is a blacklist vs. a whitelist? A blacklist is a list of hosts we always want to block. They have been determined to be hostile. A whitelist is a list of hosts we always want to let in. For example we would always want to let Google, Yahoo, or Bing bots scan our site. They should be on the whitelist. Everyone else is allowed in, unless they try something not allowed.

Tuesday, August 11, 2009

Google Gadgets

This is going to be a very short blog. I have been playing with displaying the Mobile Glot dictionary as a Google Gadget. It seems to work. It could use some customization to display a little cleaner. Even so, it isn't too bad for a first pass.

If you have iGoogle setup, you can try it out for yourself by following this link:
http://www.google.com/ig/adde?hl=en&moduleurl=http://m.glot.com/igGlot.xml&source=imag

Here it is (it works, try it):


Please let me know what you think.

Friday, August 7, 2009

Bad Bots

Today I am going to write about my experience with bad bots.

What is a bot? A bot is an automated program used to scan websites. Not all bots are bad. For example the Google indexer is a bot. I want Google, yahoo, msn, and the like to scan my site. That way other people can find it.

What makes a bot bad? A bad bot is a bot that doesn't follow the rules or play nice. For example, if a bot doesn't follow the rules in the robots.txt file on a site, it is considered bad. A bot is considered malicious if it tries to look for security holes or insert sql statements to gain access to the database. A bot doesn't play nice if it scans too many pages at a time. For example, google scans about 1 page every 5 seconds. A bad bot tries to scan 10 or a 100 pages per second. This can have a couple of negative effects. One, it could slow down the performance of the site to the point of making it unusable or non-responsive. Two, it could use up my monthly data transfer quota costing me more maintain the site.

So, how do you prevent bots from causing too much damage without hurting regular visitors. I took a few steps to do this. I decided to limit pages served to a single device to 1 page per second. If a device asks for multiple pages within a second, it will be put on hold so that it only gets 1 page per second. The hard part is figuring out when it is the same device.

All normal browsers allow for cookies unless the user has turned them off. A cookie is a small amount of information that a website gives to a browser to be sent back when the user returns. In general, cookies are used to keep session information and some personal preferences. For example I use cookies to remember the user's preferred language settings. That way they don't have to set the language every time they want to look a word up in the dictionary. Cookies are also useful for tracking a users path through the website. You can track which pages they viewed. I use this information to check the flow of the design of my website. If users get stuck on a page and leave, then something is wrong and I need to investigate. As you can see, cookies can be very useful.

Cookie are a great way to identify a device and return visitors. There is one hitch, as a rule, bots don't allow cookies. Now what. The very device we want to track doesn't allow us to track it by giving it a cookie. Hmmm, maybe we can use that property to our advantage. I have grouped every visitor into two groups. Those with cookies and those without. Those without are put on "probation" and have less privileges than those with.

Sounds good, we group all those without cookies and make them wait. If a bot doesn't have a cookie, it can sit in the waiting room until its number gets called. But wait, new users also don't have cookies and they will end up waiting with all the bots. A 20 second wait to get your first page will not make for a good first time user experience. We need someway to separate the bots from the new users. Ideally we could even separate out the bots from each other. Maybe there is some other piece of information we could use.

Each browser sends some useful information to the server when it asks for a page. Things like the default language, which version of HTML it can understand, the IP address of the computer, and the agent string. Huh? The IP address tells the server where to send the web page. The agent string tells the server a little bit about the browser requesting the page. Here is an example of an agent string for an iPhone:

Mozilla/5.0 (iPhone; U; CPU iPhone OS 2_2_1 like Mac OS X; en-us) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5H11 Safari/525.20

Here is an agent string for a Google bot:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Now Google clearly marks their bots in the agent string. With others it is not so clear.

So, how is the agent string and the IP address going to help? I use them to group the visitors that are on "probation" together. Everyone with the same IP address and agent string goes into the same waiting room where they are served 1 per second. This effectively gives all new users their own private waiting room. One draw back to this approach is that if a group of students all sitting in the same class room with the same type of computer running the same version of the browser, are told by the teacher to look up a word in our dictionary, they may see a long delay before they see the first page. Subsequent pages will be served as fast our server can handle them. So it is only a problem the first time they visit our website.

So, that is it. I don't prevent bots from visiting our site. I just limit how fast they are served pages, effectivly limiting any damages they may cause. If you have any questions or suggestions please let me know. I am curious if you found this useful.

You can test it for yourself against our site: m.glot.com. But please, don't try and bring down our site.

Wednesday, August 5, 2009

Why bother with a translation dictionary

It may seem a bit strange to be developing a translation dictionary after 10 years of game programming. Well that sort of thing happens after traveling around the world.

I am an American living in the Netherlands. I fall under the classic programmer stereotypes: cannot spell, dyslexic, poor at languages. Now these traits do not help with speaking a foreign language. I am grateful English is my mother tongue. It would be hell for me to learn it. Never the less, I found myself trying to learn Dutch.

Learning Dutch isn't the easiest thing in the world. It sits somewhere between English and German on the language spectrum with a bit of French thrown in just for fun. To most English speakers, it sounds like German. To most Germans, it sounds like a strange non-understandable singsong dialect, with some growling and throat clearing added for emphasis.

After making the decision to stay in the Netherlands, I signed up for the beginner Dutch course offered by the local community college. Although I did well in the class, I struggled with remembering the words. So, I started writing all of them down. I wrote down every word in every chapter. I started keeping the list in Excel.

As I was doing this, I found that the Dutch-English dictionaries sucked. 90% of the time, I couldn't find the Dutch word I was looking for. The dictionaries were designed for Dutch speakers trying to learn English, not the other way around. So, only the root verb form was included in the dictionary. I found this very frustrating.

It was because of this frustration, I started making my own dictionary. I initially made it for the Windows Mobile CE, since that was the platform I was most familiar with. Yet, once I saw the iPhone for the first time, I knew it was time to switch. My goal was to make a dictionary that you always have with you; that was easy to use; and you could find everyday words from the newspaper. Thus Street Dutch was born. It was my first iPhone web application.

I believe web applications are the thing of the future. For rapid deployment and development, it is the way to go. As mobile phone data plans become cheaper and the data speed increase, it will become the platform of choice. Yet, things are not there yet. I could not get the performance levels I wanted. So I started investigating developing a native iPhone application.

During my investigation, I came across another online translation dictionary, Interglot with six languages. As it turns out, the owner, Arnout van der Kamp, lives just a short distance from me. We met and made a deal. I develop the mobile versions he supplies the dictionary.

As a trial run, we made a mobile version of the Interglot website: m.glot.com. It is a light weight website designed for all modern mobile devices. It is not a smooth as a webapp, but that is the price you pay when you want it to run on a wide range of mobile devices. The next step is to create a native iPhone version.

What about Street Dutch? The plan is to release the native iPhone version of Glot first, then create a version designed more for language learning than for translation.

In the next post I will go into a bit more about publishing the m.glot.com website.

Introduction

I am starting this blog to relate my experience in creating a language translation dictionary for the iPhone. It gives me a place to talk about some of the challenges, frustrations, and successes.

I plan to discuss:
  1. Why bother with a translation dictionary
  2. Publishing a mobile website
  3. Siterank 0 -- how did that happen?