Tuesday, September 1, 2009

Captcha got to love, got to hate it

I have to be honest, I don't really like filling out Captcha tests. I find many of them unreadable. So I have to guess at what they want, getting it wrong about half of the time.

Yet, this isn't a blog about what is wrong with Captcha tests. It is a blog about why I ended up putting one on my site, Mobile Glot. Now, if you go to my site, you most likely will never see the Captcha test. I reserve the test for visitors that act more like bots, but come with a browser agent claiming to be an iPhone or other common browser.

For example, there is someone behind 78-33-42-157.static.enta.net that is sending out a bot pretending to be an iPhone version 1.0. Yet it clearly acts like a bot. After search for his ip address on the net, I can see he has been hitting many other sites, so I have no qualms in letting other people know.

What do I mean by acts like a bot? In this particular case, his bot acts like a search bot in the way it follows the links. Mobile Glot is a translation dictionary site. There are a series of usage patterns. Most of them very predictable.
  1. First Timer -- first timers tend to show up, change the translation languages, look up a word or two, follow a couple of links, then leave. (hopefully bookmarking it for later)
  2. First Timer (teenager) -- the basic difference here is which words they look up. Mostly profanity and 4-letter words. (hopefully bookmarking it so they can broaden their vocabulary at a later date)
  3. Look-up and leave -- these visitors come a few times a week. They tend to look up one or two words and leave.
  4. The browser -- these visitors look through the top 10 lists and related words. They spend a lot of time browsing the site.
  5. The Vocabulary list -- these visitors go through their vocabulary list.
  6. The Reader -- these visitors are busy reading a book or article. They look up words to get a better understanding of what they are reading. They will follow related words and other links.
  7. The translator -- They tend to follow the Look-up and Leave pattern. They are busy and just want to confirm that they already knew the word.
How are bots different than people?
  1. They follow internal links 99.9% of the time.
  2. They request pages much faster and regular than a person can.
  3. The pages they ask for are not related to the previous page.
I don't "know" why these bots are on my site, but they act in two distinct ways.
  1. The scanner -- this bot scans the site looking for email addresses or other useful bits of information.
  2. The scrapper/skimmer -- this bot is trying to scrap the information off my site for its own use. Maybe even making its own translation dictionary.
Scanners are mostly annoying. They don't cause too much harm, because there is nothing for them to get. The only problem is that they screw with my stats. Google Analytics can be fooled by a Scanner pretending to be a normal browser. I have to give Google some credit here. They seem to be able to detect scanners with fake browser agents. I had scanner that visited several days in a row, reading thousands of pages. On the first day, Google Analytics treated it as thousands of new visitors. Yet, on the second day, it didn't get fooled. Good going Google!

The Scrappers are a bigger problem. First off, they are stealing the dictionary. It isn't theirs to take. Second off, they tend to look up words by directly manipulating the url. The most common method seems to be using a word list. They go through every word in the list and reading what they get back. In our dictionary, if a word isn't found, we will offer suggestions. Generating the list of suggestions is rather expensive, so we would like to keep it to a minimum. We keep a list of common mistakes that take care of most people. But, Scrappers tend to look up words that are not so common, possibly causing performance problems.

Whew. So, now you see the issue. What do I do about it?

The basic idea is to watch the usage pattern and present a Captcha test when I suspect it is a disguised bot. If it is a person, then they can answer the question and keep looking up words. If it is a bot, then it will get stuck.

No comments:

Post a Comment