Using Usenet from Perl - The Perl Journal, Winter 1996

Graham Barr

The Internet is an ideal medium for disseminating information to masses of people. E-mail can be used to distribute information to the masses via mailing lists, but there's a drawback: every piece of mail is sent to every subscriber. If there are a thousand people on your mailing list, there will be a thousand separate copies of each message zipping around the Internet. Mailing list maintenance can be a hassle as well, as we all know from the occasional spasms of "subscribe" and "unsubscribe" messages on our favorite lists.

The most popular alternative to mailing lists is Usenet. By keeping articles in centralized repositories, Usenet avoids the traffic problems posed by large mailing lists. These repositories then exchange articles among themselves.

Users can read these articles, and post new ones, by connecting to a Usenet server using the Network News Transfer Protocol (NNTP).

The articles are categorized into newsgroups; each has a particular theme or subject. Articles can be associated with one or more newsgroups. There are hundreds of different newsgroups available. Four of the most important are devoted to Perl:

   comp.lang.perl.misc		           
   comp.lang.perl.announce 	
   comp.lang.perl.modules	 
   comp.lang.perl.tk

These groups are the ideal forum to ask questions and make Perl-related announcements. (comp.lang.perl.announce is moderated, which means that all articles must be approved by the group's moderator.)

So what can we do with these newsgroups, or others, using Perl?

Finding Newsgroups

To start with, we can find a list of all the newsgroups on the nearest news server. The code below shows how to initiate a connection to the news server and retrieve the list of newsgroups.

#!/usr/bin/perl -w 	

use Net::NNTP; 

# most systems provide the name 'news' as an
# alias for the news server. If yours 
# doesn't, you'll need to change the following
# line to the name of your server.

$NNTPhost = 'news'; 	 	

# Create the connection

$nntp = Net::NNTP->new($NNTPhost) 
   or die "Cannot contact $NNTPhost: $!"; 

# The 'list' method returns a reference to a
# hash. The keys are the group names; the values
# are short descriptions of the groups 	 

$groups = $nntp->list() 
   or die "Cannot get group list";

print join("\n", keys %$groups), "\n";

# Always remember to quit the connection!

$nntp->quit;

If you're new to Usenet, you might be wondering which newsgroups to read. Help is at hand; most servers support a command which suggests newsgroups for new readers.

# Get a list of recommended subscriptions 	
# This may fail, since not all servers support 
# this feature

	$subs = $nntp->subscriptions() 
   or 		die "Cannot get subscription list";

# The 'subscriptions' method returns a reference
# to an array. Each element is the name of a
# recommended newsgroup 	

	print join("\n", @$subs), "\n";

Now we know what groups are available, and which are recommended for new readers. What else can we do? Besides write yet another newsreader, we can do something far more useful: filter out articles we don't want to see. If, like me, you don't have the time to read, or even browse, all the articles in your favorite newsgroups, you can use my News::NNTP module to write scripts that automatically extract articles matching criteria of your own design.

Retrieving Articles

Every article is assigned an article number by the news server. Your newsreader uses these numbers to keep track of which articles have been read. For example, if you use a newsreader on a Unix machine, you probably have a .newsrc file in your home directory, with lots of lines like

   comp.lang.perl.announce: 1-435 	
   comp.lang.perl.misc: 1-42997 
   comp.lang.perl.modules: 1-1342 	
   comp.lang.perl.tk! 1-2263,2512

This group information is available via a method aptly named group(). When passed a group name, this method sets the current group pointer (CGP) and returns information about the group. If no group name is given, information about the current group is returned.

The CGP is one of two pieces of information that the NNTP server keeps, the other being the current article pointer (CAP). The CAP can be moved via three methods: last() and next(), which move the pointer backwards and forwards, and nntpstat(), which takes a single argument and sets the CAP.

The content of an article can be retrieved with three methods as well: head(), which retrieves the header of the article; body(), which retrieves the body; and article(), which retrieves both.

Luckily, you don't have to keep moving the CAP to retrieve each article. If you know the article number, you can pass it as an argument to head(), article(), or body() and the required article will be returned. This also sets the CAP as a side effect.

The example below shows how you can set the current group and retrieve parts of articles.

# Set the current group 	
($count,$first,$last,$group) = 		
  $nntp->group("comp.lang.perl.misc"); 	 

print join("\t", $count,$first, $last, $group),
           "\n"; 
print "-" x 60, "\n"; 	 

# Get the header of the last article 	

$arr = $nntp->head($last); 	
print @$arr if $arr; 	

print "-" x 60, "\n"; 	 	

# Now get the previous article 	

$nntp->last; 	
$arr = $nntp->body; 	
print @$arr 	 if $arr; 	 	

print "-" x 60, "\n"; 	 	

# And finally the oldest article still available 

$arr = $nntp->article($first); 	
print @$arr 	if $arr;

Besides setting the current group pointer and using article numbers, you can also retrieve articles via Message-ID strings. Just as with e-mail messages, each Usenet message is assigned a unique Message-ID, and this string can also be provided to the head(), body() or article() methods to retrieve articles. However, calling these methods with a Message-ID doesn't change the CAP.

If you don't care about article numbers and just want to find the articles that have been posted since, say, yesterday, use the newnews() method, which returns the Message-IDs of all articles posted to a group (or groups) since a specified date.

The example below shows how to retrieve all articles posted in the last day to comp.lang.perl.misc. It retrieves each article and places it into a file. But this could be extended further; for example, you could have the script mail these articles to you (see my column in TPJ Vol. 1, Issue 1). Here we'll assume that you want to write each article to a separate file.

# Find all articles in comp.lang.perl.announce 
# posted in the last 24 hours 	

$news = $nntp->newnews( time - 86400,
               'comp.lang.perl.misc') 
    or die "Cannot get newnews: $!";

foreach $msgid (@$news) { 	 
    # Get the text of the article 	 
    $article = $nntp->article($msgid) 
       or die "Cannot get '$msgid': $!";

    # Save the text in a file 	 
    ($file = $msgid) =~ s/[\/\$]/_/g;

    open(ARTICLE, ">$file") 
       or die "Cannot open $file: $!";

    print ARTICLE @$article; 	 
    close(ARTICLE); 	
}

Now it's getting a little more useful. But we can take it a step further, scanning the headers of the articles and retrieving only those we might be interested in. I do this myself with the comp.lang.perl.misc newsgroup; personally, I find that there's too much traffic for me to browse every article. For that purpose I run a script every hour which extracts articles satisfying particular criteria.

# Find all articles in 'comp.lang.perl.misc' 	
# that were posted in the last hour 	

$news = $nntp->newnews( time - 3600,
                           'comp.lang.perl.misc') 
       or die "Cannot get newnews: $!";

foreach $msgid (@$news) {

    # Extract the subject line from the message 	
    $subj = $nntp->xhdr( 'Subject', $msgid ) 
       or die "Cannot get subject: $!";

    next unless $subj =~ /CPAN/ios;

    # Get the text of the article 	 
    $article = $nntp->article($msgid) or 		
       die "Cannot get '$msgid': $!";

    # Save the text in a file 	 
    ($file = $msgid) =~ s/[<>\/\$]/_/g;

    open(ARTICLE, ">$file") 
       or 		die "Cannot open $file: $!";

    print ARTICLE @$article; 	 
    close(ARTICLE); 	
}

This code wastes a little too much network bandwidth, because it first requests subject lines, and only later requests the articles. Instead of retrieving the subject lines with xhdr(), we could use the xpat() method, which makes our news server perform the pattern matching. The only disadvantage is that, as you might expect, xpat()'s pattern matching is much simpler than Perl's.

The pattern matching scheme used by xpat() is called wildmat, which you can think of as a stripped-down version of regular expressions. Here's a short description:

All wildmat patterns are automatically anchored at beginning and end.
An asterisk '*' matches any sequence of zero or more characters
A question mark '?' matches any single character.
Square brackets delimit a range, just as with Perl. A leading caret '^' negates the range.
A backslash may be used to quote special characters.
All patterns are case sensitive.

The code below is similar to the previous example, but uses xpat() to search for CPAN articles instead of a regular expression. Also, instead of using Message-IDs to reference the articles, we use article numbers just for kicks.

($count,$first,$last) = 	
   	$nntp->group('comp.lang.perl.misc');

 $subj = $nntp->xpat('Subject',
                          '*[Cc][Pp][Aa][Nn]*', 
                          [$last - 20, $last]) 
   or die "Cannot get subject lines: $!";

foreach $msgnum (keys %$subj@$news) {

    # Get the text of the article 	 
    $article = $nntp->article($msgnum) 
       or 		die "Cannot get '$msgnum': $!";

    open(ARTICLE, ">$msgnum") 
       or die "Cannot open $file: $!";

    print ARTICLE @$article; 	 
    close(ARTICLE); 	
}

Posting Articles

If your news server and newsgroup permit, you can post articles as well as read them. To do this with Net::NNTP, you'll need to create a series of lines similar to an e-mail message, with a blank line separating the header from the body. In particular, you'll want these four fields:

Subject - This line should always be present, and should contain a concise description of your article. Subject lines like "Help" and "Can any gurus answer this?" aren't very explanatory; see Dean Roehrich's periodic comp.lang.perl.misc posting about good Perl Usenet etiquette.

From - This line should contain an e-mail address for people who want to contact you directly instead of posting a followup for everyone to see.

Newsgroups - This line must be present, containing a comma separated list of groups to which this article is being posted. Do not put spaces after the commas!

References - This line is an ordered list of Message-ID strings from previous articles in the thread. It is normally generated by the newsreader. If your article isn't a followup, you don't need it.

The following example reads an article from a file and posts it to the server.

	
# Open the file containing the new article 
open(ART, "post.art") 
   or die "Cannot read 'post.art'"; 	 	

# Post the article 
$nntp->post(<ART>) 
   or die "Could not post article: $!"; 	 

# Close the file 	
close(ART);

The post() method is great if your article is already formatted and ready to see the world. But if you're constructing an article on the fly, you need a method that transmits your article line by line. That's what the datasend() method does. Here's an example that uses datasend() to post an article. It's functionally equivalent to the above program, but has a little more flexibility: if you wanted, you could have the loop perform some transformation on certain article lines as they're read from the filehandle.

open(ART, "post.art") 
   or die "Cannot read 'post.art'";

$nntp->post() 
   or die "Could not post article: $!";

while(<ART>) { 	 
    $nntp->datasend($_); 	
}

close(ART);

Those of you about to write scripts with Net::NNTP should know that it supports a debug mode. The debug() method, when called with a value greater than zero, echoes all communication between your program and your the news server to STDERR. If you ever have problems with your scripts, try it before you panic.

All examples in this article are written using my Net::NNTP module. Net::NNTP is distributed as part of the libnet distribution, and is available from any CPAN site.

One final point - remember that Internet bandwidth is a finite resource. Please don't abuse it.

__END__

TABLE OF CONTENTS