An IP Telephone in 74 Lines of Perl - The Perl Journal, Fall 2000

Lincoln D. Stein

Resources:

Perl 5.005 ............................................................................................ CPAN
IO::Socket .......................................................................................... CPAN
IO::File ................................................................................................ CPAN
Audio::DSP ....................................................................................... CPAN
ALSA ............................................................ http://www.alsa-project.org
4Front Technologies ................................ http://www.opensound.com
mpg123 ... http://www.sfs.nphil.uni-tuebingen.de/~hipp/mpg123.html
lame ............................................................... http://www.sulaco.org/mp3

Convergence! The latest buzzword of the dot com era. Convergence is the magical integration of the desktop computer, the Internet, television, radio, and the telephone. In the words of the industry pundits, convergence will change everything, and the technology hailed as the forerunner of convergence is IP telephony, which allows you to make long distance calls with nothing more than an Internet connection and a high quality sound card.

I hate being left behind on the technology curve, so I decided to write my own IP telephone application in Perl. It isn't elegant, and it lacks most of the functions of real IP telephony applications, but it works. In this article, I'll show you the two versions of the application: a simple one which requires ISDN-speed connections to work well, and a somewhat more sophisticated version that uses the MP3 format to reduce the bitrate for slower connections.

Soundcards and /dev/dsp

IP telephony requires three things:

1. Reading sound data from the microphone

2. Writing sound data to the system speaker

3. Moving sound data across the network

We know how to do item three with Berkeley sockets, but what about items one and two?

On Unix systems, the answer is simple. Audio-capable Unix systems have a special device file for communicating with the digital signal processor (DSP) driver. Common names include /dev/dsp, /dev/audio, and /dev/sound. Just open the device like an ordinary file, read from it to capture sound data from the microphone, and write to it to send sound data to the speaker.

The programs in this article runs on Linux systems, and assumes the sound device is named /dev/dsp. Linux sound drivers are available from a number of sources, including the kernel itself and an open source project called ALSA (for Advanced Linux Sound Architecture, available at http://www.alsa-project.org). A commercial vendor called 4Front Technologies sells an inexpensive package of sound drivers that work with a large number of sound cards on a variety of Unix operating systems, including FreeBSD, Solaris, HPUX, and AIX (http://www.opensound.com). Although I haven't tested it, the program should work fine on any system equipped with the 4Front drivers.

Only duplex-capable sound cards are suitable for telephony, because a telephone isn't much good if it can't send and receive at the same time. The sound driver must also support duplex operation, something that gave me a great deal of trouble on my Linux laptop until I discovered that the ALSA driver for my sound card lacked duplex support.

When you read from /dev/dsp (or equivalent), the data is usually returned in PCM (pulse code modulation) format. Incoming sound is sampled at some number of times per second, and the amplitude of the sound is reported as a positive integer. Typical sampling rates range from 4000 Hz (samples per second) to 44,100 Hz. Each sample may be 8 or 16 bits and may be mono (one channel) or stereo (two channels). To send sound data from one sound card to another, the drivers for both cards have to be configured to accept data at the same sampling rate, sample size, and number of channels.

Unix provides a set of ioctl calls that let you adjust the sound card properties. More portably, you can use Seth Johnson's Audio::DSP module (available on CPAN) to change the sound settings for OSS and ALSA drivers. Unfortunately, Seth's module doesn't provide the direct filehandle access that we require for the more sophisticated telephony application, as we shall see later.

I don't know anything about working with audio on Win32 or Macintosh platforms. Please let me know if there's a way to do this.

The Simple Version

The simple version of the application is shown in Listings 1 and 2. There are two programs: You run simple_send.pl to place a call, and simple_recv.pl to accept incoming connections. simple_recv.pl has to be running on the destination machine in order for simple_send.pl to work.

Here's what simple_send.pl looks like when running on my laptop, which is named pesto.

  pesto> simple_send.pl prego.lsjs.org
  Connected, go ahead...

And here's what simple_recv.pl looks like, running on my desktop machine, prego:

  prego> simple_recv.pl
  waiting for a connection...
  accepting connection from 192.168.3.2

As soon as a connection is established, the two programs activate their audio systems, and you can conduct a telephone conversation across the network. It's particularly cool across my home wireless network -- like a high-tech walkie-talkie!

Halting the conversation is very crude in my current version. One or the other of the parties must kill the application with the interrupt key.

Conceptually, both programs are simple. They first establish a network connection using the IO::Socket interface. They then open up the DSP. Audio data read from the DSP is sent to the remote host via the socket, and data received from the socket is written to the DSP. The primary difference between the two programs is that simple_send.pl actively establishes the connection, while simple_recv.pl passively waits for incoming connections. The conversation itself is two way: both programs send and receive audio data.

simple_send.pl is the simpler of the two, so let's walk through it first. It's shown in Listing 1, which we'll walk through line by line.

Listing 1. simple_send.pl

Lines 1-3: Load modules

We turn on strict type checking and load the IO::File and IO::Socket modules, providing an object-oriented interface to filehandles.

Lines 4-5: Define constants

We pick an arbitrary buffer size for buffering data passing between the network and the sound device. We also define a constant for the path to the DSP device.

Line 6: sigCHLD handler

Later you will see that when we establish a connection, we fork so that one process can read from the sound card while the other one writes to it. The CHLD handler helps ensure that when one process dies, the other dies with it.

Lines 7-10: Set up socket

We get the name and port of the destination machine from the command line, and call IO::Socket::INET's new() method to establish a connection. If successful, this returns a socket object for communication with the remote host.

Line 11: Open DSP

We use IO::File to open the digital signal processor driver in read/write mode, using a file access mode of r+ for this purpose.

Lines 12-13: Fork

We call fork to separate our work among two processes. The parent process will read from the socket and write to the sound card, while the child performs the opposite task. In the parent process, the value returned by fork will be the process number of the child. In the child process, the value returned by fork is zero.

Lines 14-17: Parent process

The parent process is a tight loop, which reads BUFSIZE bytes of data from the socket and immediately sends it to the DSP using print. If the read from the socket returns 0, indicating that the remote end has hung up, we kill the child by sending it a TERM signal, and exit.

Lines 18-20: Child process

The child has the opposite task. It reads from the sound card and writes to the socket until either an error occurs while reading, or the parent kills it.

The receiver application, simple_recv.pl, has a slightly harder job because it has to wait for an incoming connection, dispatch it, and then wait for another. Only one connection will be handled at a time because there's only one microphone and speaker system.

Listing 2. simple_recv.pl

Lines 1-3: Load modules

We turn on strict type checking and load the IO::Socket and IO::File modules as before.

Lines 4-6: Define constants and CHLD handler

We define the DSP and BUFSIZE constants as before. We will also be forking when incoming connections come in, but in this case we don't want to terminate the parent process. We simply wait on the child process in order to reap its status code (which we ignore). Otherwise we will accumulate zombie processes; see the perlipc documentation bundled with Perl for more details.

Lines 7-10: Open listening socket

We get the port number from the command line, or assume a default. We call IO::Socket::INET->new() to create a listening socket, passing it values for the local port, the size of the listen queue, and the Reuse flag, which prevents problems reopening the socket when the server is killed and immediately relaunched.

Lines 11-20: Main accept loop

We enter an infinite loop. Each time through the loop we print a message and call the socket's accept method, which blocks until an incoming connection is received. When this happens, accept returns a connected socket. We print out an informational message containing the dotted IP address of the remote host, and open a filehandle on the DSP in the same manner as before. If the open is unsuccessful, we print a warning message and hang up. Otherwise we call handle_connect() to do the data transfer.

Lines 22-26: handle_connection()

The handle_connection()subroutine works a lot like the main section of simple_send.pl. After forking, our parent process handles copying data from the socket to the DSP, while the child process handles the reverse operation.

Lines 27-33: Parent process

The part of the subroutine that handles the parent process is very slightly different from the corresponding section of simple_send.pl. Instead of terminating when the user hits the interrupt key, we want to intercept this fact, gracefully close the connection, and go back to listening for a new connection.

In order to achieve this, we use an eval{} block as an exception handler. The eval{} creates a local INT signal handler which simply calls die. This is followed by the tight read/write loop that we saw before. When the user hits the interrupt key, the eval{} block terminates with die, and execution resumes with the next statement following the block. We close the socket, send a TERM signal to the child, and return from handle_connection() to resume the main accept loop.

Lines 34-38: Child process

The child process is identical to the one in simple_send.pl, except that we close our copy of the listening socket since we won't be needing it.

Adding an MP3 Encoder

This pair of programs works great across a LAN or a fast Internet connection. However, the conversation breaks up periodically on slower connections. The default sample rate for /dev/dsp is 8 kHz at one byte per sample, monoaural, which means that the connection must support at least 8000 * 8 = 64,000 bits per second for one way communication and 128,000 bits per second for duplex communication. This can only be achieved with a really good dual-channel ISDN connection, or a DSL, cable, or leased line. Making matters worse, the connection must sustain this speed even if nobody's talking, because a second of silence generates just as much data as a second of conversation. Slower connections cannot keep up with these requirements.

There are several ways to reduce the bandwidth requirements. We could reduce the sampling rate and sacrifice the audio quality, but 8 kHz is already pretty low. We could apply a general compression algorithm, such as gzip, to the data stream. However, audio data is relatively uncompressible with text-oriented algorithms like gzip due to the noisy, rapidly-varying nature of the data. Or we could redesign the application entirely, using discontinuous UDP packets rather than a continuous TCP stream and adjusting the UDP transmission rate dynamically to meet bandwidth availability. This is how it's done in the voice-over-IP (VoIP) protocol, which is popular in commercial IP telephony applications.

For fun, I tried a different approach. The popular MP3 audio compression format can achieve 10:1 or greater compression of audio streams. Furthermore, there are a number of Unix command-line tools for compressing and decompressing MP3s. My favorites are mpg123 (http://www.sfs.nphil.uni-tuebingen.de/~hipp/mpg123.html) for decompressing and playing MP3s, and lame (http://www.sulaco.org/mp3) for creating them. In principle, one can simply interpose lame between the DSP and the socket in order to compress the data, and uncompress it at the other end using mpg123. Although this will not work for reasons described below, conceptually we would want to open up two pipes:

  open COMPRESS,   "lame - - </dev/dsp |";
  open UNCOMPRESS, "| mpg123 -s >/dev/dsp";

The first pipe reads /dev/dsp and passes it to lame for MP3 encoding. The encoded data is written to standard output, where it is piped to our program. We would read from it like this:

  print $socket $data while sysread(COMPRESS, $data, BUFSIZE);

The second pipe tells mpg123 to read MP3-encoded data from our program, uncompress it into PCM data, and write it to /dev/dsp. We'd use it like this:

  print UNCOMPRESS $data while sysread($sock, $data, BUFSIZE);

However, when I tried this, I found a couple of hitches. One is that lame will not accept 8-bit PCM data, requiring 16-bit data sampled at 16 kHz or higher. This was solved by calling the proper ioctls to put the DSP into the proper mode. The other hitch is that some sound drivers do not allow you to open the /dev/dsp driver twice, so the straightforward strategy of opening two pipes would not work. I solved that one by opening /dev/dsp once read/write, and then reopening STDIN and STDOUT on it, so that lame and mpg123 wouldn't try to open /dev/dsp a second time.

To simplify matters, I put all the DSP-manipulating code in a self-contained module named DSP.pm, shown in Listing 3. The module is object-oriented. You create a new DSP handle by calling DSP->new. The object that is returned is a read/write IO handle attached to /dev/dsp. The compress() method returns a pipe that you can read from to retrieve MP3-compressed audio date. The uncompress() method returns a second pipe that you can write MP3 data to. The data will be uncompressed and sent to the speakers. The following example reads MP3-compressed data from the microphone and immediately writes it back to the speaker, creating an awful racket:

  my $dsp = DSP->new;
  my $compress = $dsp->compress;
  my $uncompress = $dsp->uncompress;
  print $uncompress $data while 
			sysread($compress, $data, 1024);

Listing 3. DSP.pm module

Lines 1-5: Module setup stuff.

The module begins by bringing in the modules we need. We inherit from IO::Handle to get all the available object-oriented filehandle methods.

Lines 6-10: Load ioctl definitions

We need to call a set of ioctls to configure the sound card driver. The constants for these ioctls are available in the file sys/soundcard.ph. If you don't already have this file in your Perl library directory, you'll need to run h2ph on your system's include files, as described in the documentation for h2ph bundled with Perl.

Many ioctl constants are calculated using a hash called %sizeof, which contains the sizes of various system-specific data types, such as integers. Although this is not well-documented, you have to create %sizeof before loading any .ph file that depends on it. A quick examination of soundcard.ph showed that it needs to know the size of int. Rather than code this in a nonportable way, we get the size of int from Perl's Config module and use it to initialize a global variable named %sizeof before loading the .ph file.

Lines 11-15: Constants

We declare a constant for the path to the DSP device special file, and constants containing the invocations of the lame MP3 encoder and the mpg123 MP3 decoder. The various command-line options passed to lame specify PCM-format input, monaural sound, a maximum bitrate of 32 kbs, a sample rate of 16 kHz, and standard input and output for input and output. The only complication here is the -x flag, which must be set for little-endian architectures so that lame will byteswap the incoming PCM data. We determine the endian-ness of our architecture on the fly by looking at $Config{byteorder}.

The flags to mpg123 specify monaural sound and tell the program to send uncompressed PCM data to standard output.

Lines 16-23: new() method

The new() method opens the DSP for reading and writing. After successfully opening the device special file, we call ioctl three times. The first call sets the sample size to 16 bits (2 bytes). The second call sets the sampling rate to 16 kHz, and the third puts the DSP into monaural mode by setting the STEREO flag to false. Notice that the second argument to each ioctl call is a constant defined in sys/soundcard.ph, while the third is a packed integer.

We return the handle, blessed into the DSP class.

Lines 25-32: compress() method

The compress() method will return a pipe suitable for reading MP3-compressed data. The trick here is to replace STDIN with the DSP handle, so that lame takes its input from /dev/dsp, reading directly from the microphone. We save the current value of STDIN in a temporary filehandle, and then reopen STDIN onto the DSP filehandle, using the object-oriented fdopen() method. We now call IO::File->new() to open the lame pipe. We restore STDIN, and return the pipe to the caller.

Lines 35-44: uncompress() method

The uncompress() method will return a pipe suitable for writing MP3-compressed data. It uses the same trick as compress(), except that now we operate on STDOUT rather than STDIN. After saving the current value of STDOUT in a temporary filehandle, we reopen it onto the DSP, and call IO::File->new() to open the mpg123 pipe. We don't want data getting held up in standard IO buffering, so we activate the pipe's autoflush property. After restoring STDOUT, we return the pipe to the caller.

Listing 4. mp3_send.pl

Listing 5. mp3_recv.pl

Listings 4 and 5 show mp3_send.pl and mp3_recv.pl, versions of the earlier simple scripts modified to transmit MP3-encoded audio streams. Because they're nearly identical, we don't need to walk through them again. There are two important changes:

1. Instead of opening /dev/dsp directly, the scripts load the DSP.pm module and create a new DSP object.

2. Instead of writing or reading to /dev/dsp directly, the scripts invoke the DSP object's compress() and uncompress() method to retrieve the filehandles for reading and writing.

Otherwise, the scripts are identical to the simple versions.

Unfortunately, when I tested these scripts on my LAN, I discovered an annoying one-to-two second delay between speaking into the microphone at one end and hearing the sound at the other. Although the MP3 encoder and decoder can keep up with the data, there seems to be some latency. Perhaps the encoder needs to accumulate a certain amount of buffered sound data before it will encode it, or perhaps I just need faster computers. The main testing platform was a pair of 300 MHz Pentium II desktops. In any case, I couldn't eliminate this delay by playing with the script's transmission buffer, and the delay is long enough to make a normal conversation very difficult.

Nevertheless, the program does allow voice transmissions in real time across slow network connections. Although the nominal MP3 bitstream rate is 32 kilobits, in practice less bandwidth is required because of the long pauses in speech. I expect it to work between two moderately well-connected sites on the Internet, and perhaps even between hosts connected by 56K modems. Although it doesn't make much of an IP telephone, you could use the system as an intercom, a baby monitor, or just surprise your work associates by suddenly speaking to them from an idle computer!

Summary

With a little ingenuity and a dab of Perl, you can turn your desktop into an IP telephony platform. Getting IP telephony to work across fast network connections is easy, but working with limited bandwidth is more of a challenge.

These scripts are the mere beginnings of a real application. They need all the bells and whistles, such as a built-in telephone directory, an auto-dial function, a way to screen calls, and cute little sound effects for the dialtone, busy signal and ringing. Feel free to take the source and start doing a little bit of your own convergence!

_ _END_ _

Lincoln D. Stein wrote CGI.pm.

TABLE OF CONTENTS