Twolk: Bad Bots

Today I am going to write about my experience with bad bots.

What is a bot? A bot is an automated program used to scan websites. Not all bots are bad. For example the Google indexer is a bot. I want Google, yahoo, msn, and the like to scan my site. That way other people can find it.

What makes a bot bad? A bad bot is a bot that doesn't follow the rules or play nice. For example, if a bot doesn't follow the rules in the robots.txt file on a site, it is considered bad. A bot is considered malicious if it tries to look for security holes or insert sql statements to gain access to the database. A bot doesn't play nice if it scans too many pages at a time. For example, google scans about 1 page every 5 seconds. A bad bot tries to scan 10 or a 100 pages per second. This can have a couple of negative effects. One, it could slow down the performance of the site to the point of making it unusable or non-responsive. Two, it could use up my monthly data transfer quota costing me more maintain the site.

So, how do you prevent bots from causing too much damage without hurting regular visitors. I took a few steps to do this. I decided to limit pages served to a single device to 1 page per second. If a device asks for multiple pages within a second, it will be put on hold so that it only gets 1 page per second. The hard part is figuring out when it is the same device.

All normal browsers allow for cookies unless the user has turned them off. A cookie is a small amount of information that a website gives to a browser to be sent back when the user returns. In general, cookies are used to keep session information and some personal preferences. For example I use cookies to remember the user's preferred language settings. That way they don't have to set the language every time they want to look a word up in the dictionary. Cookies are also useful for tracking a users path through the website. You can track which pages they viewed. I use this information to check the flow of the design of my website. If users get stuck on a page and leave, then something is wrong and I need to investigate. As you can see, cookies can be very useful.

Cookie are a great way to identify a device and return visitors. There is one hitch, as a rule, bots don't allow cookies. Now what. The very device we want to track doesn't allow us to track it by giving it a cookie. Hmmm, maybe we can use that property to our advantage. I have grouped every visitor into two groups. Those with cookies and those without. Those without are put on "probation" and have less privileges than those with.

Sounds good, we group all those without cookies and make them wait. If a bot doesn't have a cookie, it can sit in the waiting room until its number gets called. But wait, new users also don't have cookies and they will end up waiting with all the bots. A 20 second wait to get your first page will not make for a good first time user experience. We need someway to separate the bots from the new users. Ideally we could even separate out the bots from each other. Maybe there is some other piece of information we could use.

Each browser sends some useful information to the server when it asks for a page. Things like the default language, which version of HTML it can understand, the IP address of the computer, and the agent string. Huh? The IP address tells the server where to send the web page. The agent string tells the server a little bit about the browser requesting the page. Here is an example of an agent string for an iPhone:

Mozilla/5.0 (iPhone; U; CPU iPhone OS 2_2_1 like Mac OS X; en-us) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5H11 Safari/525.20

Here is an agent string for a Google bot:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Now Google clearly marks their bots in the agent string. With others it is not so clear.

So, how is the agent string and the IP address going to help? I use them to group the visitors that are on "probation" together. Everyone with the same IP address and agent string goes into the same waiting room where they are served 1 per second. This effectively gives all new users their own private waiting room. One draw back to this approach is that if a group of students all sitting in the same class room with the same type of computer running the same version of the browser, are told by the teacher to look up a word in our dictionary, they may see a long delay before they see the first page. Subsequent pages will be served as fast our server can handle them. So it is only a problem the first time they visit our website.

So, that is it. I don't prevent bots from visiting our site. I just limit how fast they are served pages, effectivly limiting any damages they may cause. If you have any questions or suggestions please let me know. I am curious if you found this useful.

You can test it for yourself against our site: m.glot.com. But please, don't try and bring down our site.

Twolk

Friday, August 7, 2009

Bad Bots

No comments:

Post a Comment

Facebook Badge

Search This Blog

Followers

Blog Archive

About Me