2006-11-05 Web Indexing and Bot Behavior

Web Indexing, an Easy Task ?

Today I looked carefully to the W3C log entries of my messy web server. A lot of the HTTP queries are from web crawler gathering and collecting my (ugly) pages. Without looking at the logs, I was expecting to see a common behavior among the different web crawler. In theory that could be true but in real life is a different story. For example, the VoilaBot (an indexer owned by France Telecom) has a very strange behavior. That bot is reading the robots.txt too often. For two pages queries, he made one query to the robots.txt. For example, on 12227 queries for the robots.txt in 15 days for foo.be, 8532 is from the VoilaBot :

grep robots.txt unix-logs-full | wc -l   
12282
grep robots.txt unix-logs-full | grep VoilaBot | wc -l    
8532

Mmmmmm…. and it looks like that I'm not the only one having the behavior. It looks like that VoilaBot is short in memory, a caching of the robots.txt would do the job. The VoilaBot is not alone and has his friend the [VoilaBot link checker. The link checker made around 2307 requests in 15 days. It looks to be ok but the funny part is the checker is checking always the same link… I understand that's not an easy task to index the wild Internet (and by extension developing a bot ) but some other web crawler looks more clever. The GoogleBot is somewhat more clever and Google is using more and more the sitemap protocol. Google provides an interface to see the current status of indexing of your web pages. A very nice tool and efficient way to check your "indexing" visibility for your website. Does the web crawler behavior give information about the respective quality of the indexing service ? Maybe. That means the competitors of Google are still far away of having a good indexing service. At least, the competitor bots should be able to get the sitemaps of a website. Generating and pinging a search engine is an easy task and a lot of free software is including that by default (e.g. : the sitemap module in Oddmuse). For my static content, I made a quick script to export the directory structure of my htdocs in the sitemap format. All the web crawler should be able to understand the sitemap and use that information to crawl more efficiently. Maybe like that the different crawlers could focus on the really difficult information to index (like large libraries, revision in Wikis, mbox files,…).

Update 2006-12-02 : It looks like that Google, Yahoo! and Microsoft agreed around the sitemaps protocol proposed by Google mid November. For more information www.sitemaps.org. The only missing part (IMHO) is a simple META entry to specify the sitemaps location in the root document of a web site. That could ease the job of the crawler when gathering the root page and avoid the webmaster to ping the respective search engine.

Update 2007-03-27 : I have contacted Voila (France Telecom) about the strange bot behavior of their bot (always downloading the robots.txt, downloading document that never changes, looping on urls). The funny part is they don't understand what's the Voila bot is and their general answer is "Your computer is infected by virus"… arf arf.

Bonjour,

Nous avons bien reçu votre mail et nous vous remercions de nous avoir contactés.

Nous vous invitons à effectuer une recherche sur notre moteur Voila, en espérant que vous y trouverez la solution pour résoudre ce problème.
Vous pouvez également effectuer une recherche en utilisant le service Webchercheurs disponible à l'adresse :
http://webchercheurs.servicesalacarte.voila.fr

Nous vous informons qu'il s'agit d'un virus qui affecte votre ordinateur.

Bon surf et à bientôt !

Le Service Utilisateurs de Voila
Notre engagement : être 100% à votre écoute.
----------------------------------------------------------------------
Faites un voeu et puis Voila ! http://www.voila.fr/
----------------------------------------------------------------------

--- Original Message ---
From: adulau@XXXXXXXX
Received: 03/26/2007 03:23pm Romance Standard Time (GMT + 2:00 )
To: contact@voila.net
Subject: Contact Voila - Moteur de recherche

Sujet : VoilaBot BETA 1.2

Monsieur,

Pourriez-vous informer le service technique de voila.com (France Telecom) que le bot Voila ne fonctionne pas correctement ? Il tÃ©lÃ©charge de facon rÃ©guliere (a chaque download) le robots.txt des site web et en plus il download dÃºne facon continue les pages qui ne changent pas... Il ne semble pas vraiment efficace.

Merci d'informer le service responsable du Bot Voila,

Bien Ã  vous,


Service : Moteur de recherche
Provenance : http://search.ke.voila.fr

Update 2007-04-11 : The sitemap protocol has been extended to inform bot in the robots.txt where our Sitemaps are available. Great, you don't need to ping the search engine. You just need to put Sitemap: with the full url where the sitemap is located. For more information, just check out the sitemap protocol.

Update 2007-04-14 : Dave Winer made an interesting blog entry in Scripting News about sitemap and talked about a similar approach he made in 1997. The idea was a simple reverse and readable file with the changes of a specific web site. I hope that the sitemap protocol version 2.0 will include a similar approach in order to grab efficiently the content that has changed.

Tags: searchengine blog bot sitemap crawler

2006-11-12 Djabberd The Versatile Jabber Server

Upgrading a Jabber server... and discovering a nice XMPP framework

Beside a lot of stuff to do for my daily and nightly work, I made a small and interesting discovery in the area of XMPP server. I planned to upgrade an old (and worst… unstable and unsecure) Jabber server. Finding the right XMPP server is not easy but I found the one created from scratch by Brad Fitzpatrick called djabberd. It's a very flexible XMPP framework written in (clean) Perl where everything is a plugin and supporting the XMPP standard quite well. If you are interested in my minimal configuration, just have a look (I'm using the standard Digest HTTP authentication but you are free to use the authentication approach that fits your needs). When digging in its operation, I found that modularity is real. I wrote a very small plugin in 2 minutes to query Wikipedia :

package DJabberd::Bot::Wikipedia;
use strict;
use warnings;
use base 'DJabberd::Bot';
use WWW::Wikipedia;
use DJabberd::Util qw(exml);

our $logger = DJabberd::Log->get_logger();

sub finalize {
    my ($self) = @_;
    $self->{nodename} ||= "wikipedia";
    $self->{bot} = WWW::Wikipedia->new;
    $self->SUPER::finalize();
}

sub process_text {
    my ($self, $text, $from, $cb) = @_;
    my $entry = $self->{bot}->search($text);
    if ($entry)  {
            $cb->reply($entry->text());
    } else {
            $cb->reply("Entry not existing in Wikipedia.");
    }
}

1;

If you want to test it, just send a Jabber/XMPP message to wikipedia@a.6f2.net with a word. Nifty and Easy…

Infobot back in Jabber ?

Playing with that, I remembered the discussion with Vincent about the Infobot running on the IRC where you can ask for "factoid" and get a result. Wikipedia is full of "factoid" (in the good sense), I mean of sentence structured like "X is Y" and full of karma (X++ when is a recurring sentence ). So why not building a database of factoid from Wikipedia ? That could be useful for building pseudo-AI bot in instant messaging. Imagine a chat room in Jabber where the bot is playing is role when a large discussion is taking place and some clarification is required on a term. It's really funny we are going back to a more textual society with such new tools… (IRC is not dead ?). So a quite good news.

2006-11-26 Transforming Internet In An Useless Medium

Society of "Authors"/Editors Against Digital Archiving ?

Thinking back of the non-sense copiepresse action and the belgian court ruling in their favor, I'm still wondering what such kind of editors society really want. We can read everywhere that copiepresse wants to have a part of the money revenue from Google. That seems to be one of the possible reason behind the legal action but in my humble opinion, it's not the main driving reason. Looking on the copiepresse website, there are plenty of leaflet where stated you can't make a copy before asking for an authorization at us and more and more works (read controlled by us ;-) are in an electronic format. When reading the leaflet, I think that the underlying idea is to translate the old approach of physical object and adding more restriction (maybe it's my naive perception). More restriction ? is it possible ? yes. In the current situation, the physical archiving of press is done by publicly funded libraries or archiving institution. There are exception in the copyright/author's right law to allow them to preserve the press article1. The access to the preserved article is allowed to the public even without a fee. But in the digital world, who is taking care of the archiving ? archive.org is doing a part of the job but lacking a lot of funding (and help) to extend the scale. Google is doing it as a service for the Internet user and adds a little bit of advertising. Some other initiatives are growing in the area of digital archiving but I agree with Jean-Etienne about the difficulty of digital archiving. Beside the technical difficulties, there are numerous legal difficulties including the latest action from copiepresse. The editors societies don't want to share and want to gain the new archiving market. They see the opportunity of changing the current physical archiving system by removing it for the electronic world. That why I think their real target is not the money of Google but they just want to kill the ability to do digital archiving.

The ability to do digital archiving is really critical to preserve on the long-term the information. More and more information is only available in digital format (quite often in proprietary format, with a lot of restriction, or in a limited quality)… Digital archiving can only work in a collaborative way. Not a single provider can't scale to archive all the information generated today. The only way is to free the archiving of digital information and to clearly extend the archiving exception in the law not to only institution but to any foundation, citizen or association willing to help to archive the digital society.

By the way, here is a small archive of lesoir.be without crawling the website and only using the RDF file provided by the website (the python script to do so, feel free to use it to archive your favorite website).

del.icio.us it!

Tags: copyright copiepresse copiepress google archiving

Footnotes:

1. I took the press as an example as copiepresse is representing some mainstream press in Belgium

Page Collection for ^2006-11