Today I looked carefully to the W3C log entries of my messy web server. A lot of the HTTP queries are from web crawler gathering and collecting my (ugly) pages. Without looking at the logs, I was expecting to see a common behavior among the different web crawler. In theory that could be true but in real life is a different story. For example, the VoilaBot (an indexer owned by France Telecom) has a very strange behavior. That bot is reading the robots.txt too often. For two pages queries, he made one query to the robots.txt. For example, on 12227 queries for the robots.txt in 15 days for foo.be, 8532 is from the VoilaBot :
grep robots.txt unix-logs-full | wc -l 12282 grep robots.txt unix-logs-full | grep VoilaBot | wc -l 8532
Mmmmmm…. and it looks like that I'm not the only one having the behavior. It looks like that VoilaBot is short in memory, a caching of the robots.txt would do the job. The VoilaBot is not alone and has his friend the [VoilaBot link checker. The link checker made around 2307 requests in 15 days. It looks to be ok but the funny part is the checker is checking always the same link… I understand that's not an easy task to index the wild Internet (and by extension developing a bot ) but some other web crawler looks more clever. The GoogleBot is somewhat more clever and Google is using more and more the sitemap protocol. Google provides an interface to see the current status of indexing of your web pages. A very nice tool and efficient way to check your "indexing" visibility for your website. Does the web crawler behavior give information about the respective quality of the indexing service ? Maybe. That means the competitors of Google are still far away of having a good indexing service. At least, the competitor bots should be able to get the sitemaps of a website. Generating and pinging a search engine is an easy task and a lot of free software is including that by default (e.g. : the sitemap module in Oddmuse). For my static content, I made a quick script to export the directory structure of my htdocs in the sitemap format. All the web crawler should be able to understand the sitemap and use that information to crawl more efficiently. Maybe like that the different crawlers could focus on the really difficult information to index (like large libraries, revision in Wikis, mbox files,…).
Update 2006-12-02 : It looks like that Google, Yahoo! and Microsoft agreed around the sitemaps protocol proposed by Google mid November. For more information www.sitemaps.org. The only missing part (IMHO) is a simple META entry to specify the sitemaps location in the root document of a web site. That could ease the job of the crawler when gathering the root page and avoid the webmaster to ping the respective search engine.
Update 2007-03-27 : I have contacted Voila (France Telecom) about the strange bot behavior of their bot (always downloading the robots.txt, downloading document that never changes, looping on urls). The funny part is they don't understand what's the Voila bot is and their general answer is "Your computer is infected by virus"… arf arf.
Bonjour, Nous avons bien reçu votre mail et nous vous remercions de nous avoir contactés. Nous vous invitons à effectuer une recherche sur notre moteur Voila, en espérant que vous y trouverez la solution pour résoudre ce problème. Vous pouvez également effectuer une recherche en utilisant le service Webchercheurs disponible à l'adresse : http://webchercheurs.servicesalacarte.voila.fr Nous vous informons qu'il s'agit d'un virus qui affecte votre ordinateur. Bon surf et à bientôt ! Le Service Utilisateurs de Voila Notre engagement : être 100% à votre écoute. ---------------------------------------------------------------------- Faites un voeu et puis Voila ! http://www.voila.fr/ ---------------------------------------------------------------------- --- Original Message --- From: adulau@XXXXXXXX Received: 03/26/2007 03:23pm Romance Standard Time (GMT + 2:00 ) To: firstname.lastname@example.org Subject: Contact Voila - Moteur de recherche Sujet : VoilaBot BETA 1.2 Monsieur, Pourriez-vous informer le service technique de voila.com (France Telecom) que le bot Voila ne fonctionne pas correctement ? Il tÃ©lÃ©charge de facon rÃ©guliere (a chaque download) le robots.txt des site web et en plus il download dÃºne facon continue les pages qui ne changent pas... Il ne semble pas vraiment efficace. Merci d'informer le service responsable du Bot Voila, Bien Ã vous, Service : Moteur de recherche Provenance : http://search.ke.voila.fr
Update 2007-04-11 : The sitemap protocol has been extended to inform bot in the robots.txt where our Sitemaps are available. Great, you don't need to ping the search engine. You just need to put Sitemap: with the full url where the sitemap is located. For more information, just check out the sitemap protocol.
Update 2007-04-14 : Dave Winer made an interesting blog entry in Scripting News about sitemap and talked about a similar approach he made in 1997. The idea was a simple reverse and readable file with the changes of a specific web site. I hope that the sitemap protocol version 2.0 will include a similar approach in order to grab efficiently the content that has changed.