Tuesday, September 1, 2009

Captcha got to love, got to hate it

I have to be honest, I don't really like filling out Captcha tests. I find many of them unreadable. So I have to guess at what they want, getting it wrong about half of the time.

Yet, this isn't a blog about what is wrong with Captcha tests. It is a blog about why I ended up putting one on my site, Mobile Glot. Now, if you go to my site, you most likely will never see the Captcha test. I reserve the test for visitors that act more like bots, but come with a browser agent claiming to be an iPhone or other common browser.

For example, there is someone behind 78-33-42-157.static.enta.net that is sending out a bot pretending to be an iPhone version 1.0. Yet it clearly acts like a bot. After search for his ip address on the net, I can see he has been hitting many other sites, so I have no qualms in letting other people know.

What do I mean by acts like a bot? In this particular case, his bot acts like a search bot in the way it follows the links. Mobile Glot is a translation dictionary site. There are a series of usage patterns. Most of them very predictable.
  1. First Timer -- first timers tend to show up, change the translation languages, look up a word or two, follow a couple of links, then leave. (hopefully bookmarking it for later)
  2. First Timer (teenager) -- the basic difference here is which words they look up. Mostly profanity and 4-letter words. (hopefully bookmarking it so they can broaden their vocabulary at a later date)
  3. Look-up and leave -- these visitors come a few times a week. They tend to look up one or two words and leave.
  4. The browser -- these visitors look through the top 10 lists and related words. They spend a lot of time browsing the site.
  5. The Vocabulary list -- these visitors go through their vocabulary list.
  6. The Reader -- these visitors are busy reading a book or article. They look up words to get a better understanding of what they are reading. They will follow related words and other links.
  7. The translator -- They tend to follow the Look-up and Leave pattern. They are busy and just want to confirm that they already knew the word.
How are bots different than people?
  1. They follow internal links 99.9% of the time.
  2. They request pages much faster and regular than a person can.
  3. The pages they ask for are not related to the previous page.
I don't "know" why these bots are on my site, but they act in two distinct ways.
  1. The scanner -- this bot scans the site looking for email addresses or other useful bits of information.
  2. The scrapper/skimmer -- this bot is trying to scrap the information off my site for its own use. Maybe even making its own translation dictionary.
Scanners are mostly annoying. They don't cause too much harm, because there is nothing for them to get. The only problem is that they screw with my stats. Google Analytics can be fooled by a Scanner pretending to be a normal browser. I have to give Google some credit here. They seem to be able to detect scanners with fake browser agents. I had scanner that visited several days in a row, reading thousands of pages. On the first day, Google Analytics treated it as thousands of new visitors. Yet, on the second day, it didn't get fooled. Good going Google!

The Scrappers are a bigger problem. First off, they are stealing the dictionary. It isn't theirs to take. Second off, they tend to look up words by directly manipulating the url. The most common method seems to be using a word list. They go through every word in the list and reading what they get back. In our dictionary, if a word isn't found, we will offer suggestions. Generating the list of suggestions is rather expensive, so we would like to keep it to a minimum. We keep a list of common mistakes that take care of most people. But, Scrappers tend to look up words that are not so common, possibly causing performance problems.

Whew. So, now you see the issue. What do I do about it?

The basic idea is to watch the usage pattern and present a Captcha test when I suspect it is a disguised bot. If it is a person, then they can answer the question and keep looking up words. If it is a bot, then it will get stuck.

Wednesday, August 12, 2009

Reverse DNS and Whitelists

Ok, these blogs seem to be more about technical issues than contemplations.

Since I have been watching the traffic on the dictionary, I have been very busy with bots, scraping, and overall dictionary usage. As a result, I have been reading what other people suggest.

Both Google and Yahoo suggest using a reverse DNS to validate their bots. I have seen several techniques for PHP. One of them looks like this:

$first_ip=$_SERVER['REMOTE_ADDR'];
$hostname=gethostbyaddr($first_ip);
$second_ip=gethostbyname($hostname);
if ($second_ip == $first_ip) {
// IPs matched
// Add to white list.

}

At first, this seems like it should just work. If $second_ip == $first_ip, all is good, right? Well, not really. If you monitor the output of gethostbyaddr() you will see that it often returns the original string if it cannot resolve it into a host name. Below, I have included a small sample of values I got from using gethostbyaddr() and gethostbyname().


















IPHostReverse IP
$first_ip$hostname$second_ip
66.249.65.232crawl-66-249-65-232.googlebot.com66.249.65.232
72.30.79.95llf531274.crawl.yahoo.net72.30.79.95
92.70.112.242static.kpn.netstatic.kpn.net
80.27.102.8880.27.102.8880.27.102.88
32.154.39.98mobile-032-154-039-098.mycingular.netmobile-032-154-039-098.mycingular.net


After looking at the table above, you can see that simply checking to see if $first_ip == $second_ip is not a good check. In the case of 80.27.102.88, you can see that it will pass the test. In the case of a spam bot, the hostname will most likely be the ip address. In effect, we just whitelisted the spam bot. By the way, getting hostnames with just the ip address is very common. It does not mean that the user is a bot. It only mean the hosting company didn't register a more human readable hostname. In that case we shouldn't backlist them.

What is a blacklist vs. a whitelist? A blacklist is a list of hosts we always want to block. They have been determined to be hostile. A whitelist is a list of hosts we always want to let in. For example we would always want to let Google, Yahoo, or Bing bots scan our site. They should be on the whitelist. Everyone else is allowed in, unless they try something not allowed.

Tuesday, August 11, 2009

Google Gadgets

This is going to be a very short blog. I have been playing with displaying the Mobile Glot dictionary as a Google Gadget. It seems to work. It could use some customization to display a little cleaner. Even so, it isn't too bad for a first pass.

If you have iGoogle setup, you can try it out for yourself by following this link:
http://www.google.com/ig/adde?hl=en&moduleurl=http://m.glot.com/igGlot.xml&source=imag

Here it is (it works, try it):


Please let me know what you think.

Friday, August 7, 2009

Bad Bots

Today I am going to write about my experience with bad bots.

What is a bot? A bot is an automated program used to scan websites. Not all bots are bad. For example the Google indexer is a bot. I want Google, yahoo, msn, and the like to scan my site. That way other people can find it.

What makes a bot bad? A bad bot is a bot that doesn't follow the rules or play nice. For example, if a bot doesn't follow the rules in the robots.txt file on a site, it is considered bad. A bot is considered malicious if it tries to look for security holes or insert sql statements to gain access to the database. A bot doesn't play nice if it scans too many pages at a time. For example, google scans about 1 page every 5 seconds. A bad bot tries to scan 10 or a 100 pages per second. This can have a couple of negative effects. One, it could slow down the performance of the site to the point of making it unusable or non-responsive. Two, it could use up my monthly data transfer quota costing me more maintain the site.

So, how do you prevent bots from causing too much damage without hurting regular visitors. I took a few steps to do this. I decided to limit pages served to a single device to 1 page per second. If a device asks for multiple pages within a second, it will be put on hold so that it only gets 1 page per second. The hard part is figuring out when it is the same device.

All normal browsers allow for cookies unless the user has turned them off. A cookie is a small amount of information that a website gives to a browser to be sent back when the user returns. In general, cookies are used to keep session information and some personal preferences. For example I use cookies to remember the user's preferred language settings. That way they don't have to set the language every time they want to look a word up in the dictionary. Cookies are also useful for tracking a users path through the website. You can track which pages they viewed. I use this information to check the flow of the design of my website. If users get stuck on a page and leave, then something is wrong and I need to investigate. As you can see, cookies can be very useful.

Cookie are a great way to identify a device and return visitors. There is one hitch, as a rule, bots don't allow cookies. Now what. The very device we want to track doesn't allow us to track it by giving it a cookie. Hmmm, maybe we can use that property to our advantage. I have grouped every visitor into two groups. Those with cookies and those without. Those without are put on "probation" and have less privileges than those with.

Sounds good, we group all those without cookies and make them wait. If a bot doesn't have a cookie, it can sit in the waiting room until its number gets called. But wait, new users also don't have cookies and they will end up waiting with all the bots. A 20 second wait to get your first page will not make for a good first time user experience. We need someway to separate the bots from the new users. Ideally we could even separate out the bots from each other. Maybe there is some other piece of information we could use.

Each browser sends some useful information to the server when it asks for a page. Things like the default language, which version of HTML it can understand, the IP address of the computer, and the agent string. Huh? The IP address tells the server where to send the web page. The agent string tells the server a little bit about the browser requesting the page. Here is an example of an agent string for an iPhone:

Mozilla/5.0 (iPhone; U; CPU iPhone OS 2_2_1 like Mac OS X; en-us) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5H11 Safari/525.20

Here is an agent string for a Google bot:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Now Google clearly marks their bots in the agent string. With others it is not so clear.

So, how is the agent string and the IP address going to help? I use them to group the visitors that are on "probation" together. Everyone with the same IP address and agent string goes into the same waiting room where they are served 1 per second. This effectively gives all new users their own private waiting room. One draw back to this approach is that if a group of students all sitting in the same class room with the same type of computer running the same version of the browser, are told by the teacher to look up a word in our dictionary, they may see a long delay before they see the first page. Subsequent pages will be served as fast our server can handle them. So it is only a problem the first time they visit our website.

So, that is it. I don't prevent bots from visiting our site. I just limit how fast they are served pages, effectivly limiting any damages they may cause. If you have any questions or suggestions please let me know. I am curious if you found this useful.

You can test it for yourself against our site: m.glot.com. But please, don't try and bring down our site.

Wednesday, August 5, 2009

Why bother with a translation dictionary

It may seem a bit strange to be developing a translation dictionary after 10 years of game programming. Well that sort of thing happens after traveling around the world.

I am an American living in the Netherlands. I fall under the classic programmer stereotypes: cannot spell, dyslexic, poor at languages. Now these traits do not help with speaking a foreign language. I am grateful English is my mother tongue. It would be hell for me to learn it. Never the less, I found myself trying to learn Dutch.

Learning Dutch isn't the easiest thing in the world. It sits somewhere between English and German on the language spectrum with a bit of French thrown in just for fun. To most English speakers, it sounds like German. To most Germans, it sounds like a strange non-understandable singsong dialect, with some growling and throat clearing added for emphasis.

After making the decision to stay in the Netherlands, I signed up for the beginner Dutch course offered by the local community college. Although I did well in the class, I struggled with remembering the words. So, I started writing all of them down. I wrote down every word in every chapter. I started keeping the list in Excel.

As I was doing this, I found that the Dutch-English dictionaries sucked. 90% of the time, I couldn't find the Dutch word I was looking for. The dictionaries were designed for Dutch speakers trying to learn English, not the other way around. So, only the root verb form was included in the dictionary. I found this very frustrating.

It was because of this frustration, I started making my own dictionary. I initially made it for the Windows Mobile CE, since that was the platform I was most familiar with. Yet, once I saw the iPhone for the first time, I knew it was time to switch. My goal was to make a dictionary that you always have with you; that was easy to use; and you could find everyday words from the newspaper. Thus Street Dutch was born. It was my first iPhone web application.

I believe web applications are the thing of the future. For rapid deployment and development, it is the way to go. As mobile phone data plans become cheaper and the data speed increase, it will become the platform of choice. Yet, things are not there yet. I could not get the performance levels I wanted. So I started investigating developing a native iPhone application.

During my investigation, I came across another online translation dictionary, Interglot with six languages. As it turns out, the owner, Arnout van der Kamp, lives just a short distance from me. We met and made a deal. I develop the mobile versions he supplies the dictionary.

As a trial run, we made a mobile version of the Interglot website: m.glot.com. It is a light weight website designed for all modern mobile devices. It is not a smooth as a webapp, but that is the price you pay when you want it to run on a wide range of mobile devices. The next step is to create a native iPhone version.

What about Street Dutch? The plan is to release the native iPhone version of Glot first, then create a version designed more for language learning than for translation.

In the next post I will go into a bit more about publishing the m.glot.com website.

Introduction

I am starting this blog to relate my experience in creating a language translation dictionary for the iPhone. It gives me a place to talk about some of the challenges, frustrations, and successes.

I plan to discuss:
  1. Why bother with a translation dictionary
  2. Publishing a mobile website
  3. Siterank 0 -- how did that happen?