Editorial: Data Hiding - The Perl Journal, Autumn 1996

Jon Orwant

Publishers haven't yet decided whether the Internet is friend or foe. Some of them see it merely as a new medium for hawking their wares. They get web sites. Others predict that it will radically transform - or doom - their industry as content inevitably migrates online. They get indigestion.

Sometimes people ask me why TPJ is printed on paper; after all, most Perl programmers have ready access to the net, and are comfortable navigating the web, so why not have an online (and cheaper, if not free) TPJ? One reason is that, alas, ink on crushed trees is much better suited to reading than any computer screen. The resolution is higher, it's more portable, and the interface is intuitive. And no one wants to lug a laptop into the bathroom.

The MIT Media Laboratory, where I work in the Electronic Publishing Group, is building an electronic book that mimics the look and feel of traditional books. You download new texts (hey, it's only software) into RAM embedded in the book's spine. The pages are paper, but have no ink. Instead, they have arrays of tiny pixels: spheres of indium tin oxide, each 40 microns in diameter, whose opacity changes when they rotate in response to an electric impulse. The book doesn't exist yet, but a printer which uses this paper does, and I had the pleasure of being the first to adapt a book to it: a Choose Your Own Adventure interactive novel, in which the reader makes a decision at the bottom of each page and feeds the page back into the printer to see the outcome.

But I digress. The other fear of publishers is copying. Uh oh, they think, here comes the big bad Internet, where ne'er-do-wells will liberate that Top Ten List you were about to sell to Bantam and let it loose on the Internet. It's not surprising that one of the questions we hear most is: How can publishers and authors be compensated for their work when it's so easy to copy information? Slapping a copyright symbol on documents is easy. But copyright law provides for a corrective, not preventative, remedy: you must first detect that a violation has occurred before you can recover damages in court.

Publishers could try to prevent copyright violations from occurring in the first place - when distributing an online book or magazine, they could package it "inside" its own software, so that only those who have licensed the text can read it, and even then they can see only one page at a time. That makes copying a book harder, but not impossible; it will never be completely impossible as long as people still use their eyes to read - because if you can see it, a camera/computer can too.

We'll probably start to see such digital books soon. Such a technique is appropriate for some uses, but not most. It's awkward, it's expensive, and it won't thwart the truly dedicated. Some less intrusive techniques can be used to help computers identify when a document has been copied from another source by embedding a statement of ownership in the text. The study of hiding clandestine messages in plain sight has a suitably mysterious name: steganography.

Steganography research at the Media Lab focuses on images, video, and audio more than text. Images let you hide messages in high frequency areas, like fields of grass or heads of hair. The redundancy in video can be exploited to bury fleeting messages in imperceptible noise. In speech you can add a barely perceptible echo to certain segments, or modulate the signal and embed your message in the least significant bits. The techniques have a common premise: some bits are less important than others, so we can jiggle them in ways that encode our message. The more attributes we can vary, the more bits we can encode, and the longer our message can be. That's the problem with text: While you can mutate a few pixels in an image without harm, you can't do the same to a book, because the individual elements (letters instead of pixels) can't be changed at whim.

One of the most effective text safeguards relies on economics instead of technology. Consider paperbacks: No one's stopping you from photocopying your own. It's just not worth your time. Inconvenience makes a great license manager.

What makes paperbacks hard to copy is that the text is effectively an image, not ASCII, and copying an image is expensive and tedious compared to copying eight-bit bytes. Can the the flexibility of text be merged with the message-hiding properties of images?

A crude method is demonstrated in this paragraph. You'd have a hard time photocopying this paragraph without retaining the copyright message, but the text itself is rather unattractive because the message isn't hidden too well.

A similar but subtler scheme: ink the copyright message faintly into the background as a makeshift watermark.

Here's another bad scheme, but an intellectually interesting one. The copyright message is encoded as the pattern of italicized words: an italicized word is a 1, and all other words are 0. This entire paragraph merely spells out PERLJOURNAL, in ASCII, which isn't a particularly long message - and it impairs the readability somewhat, as you can tell.

Here's a message that you can't discern - but if it were ASCII, your computer (or your Web browser) could. Each line in this paragraph has some spaces at the end. The precise number of spaces varies from line to line, and those choices are used to encode the message.

The most common English letter is e. One could have some fun with the e's in a document, varying their widths so that choices between elongation and diminution encode the bits that conceal a secret message. The shorter the encoded message, and the larger the text to hide it in, the less you need to resort to extreme sizes. It doesn't have to be an e, of course: any symbol will do, so long as it appears frequently.

(Assuming many such symbols in your paragraphs, you won't usually vary widths much - but c.f. this paragraph, and a book, "Gadsby." Both lack that most common symbol.)

One could, in theory, encode the entire Library of Congress in a single period, by choosing its diameter very precisely. If you could choose between 128 different sizes, you could store a single seven-bit character. With 10_15,000,000 choices, you could store a book. And 10_{5,000,000,000,000} choices would be needed for the ten terabytes contained in the Library of Congress. Of course, such resolution is beyond the capability of any physical device - for comparison, there are about 10₇₀ atoms in the universe and 10₁₂₀ possible chess games.

Now, spaces are about as common as e's and more common than periods, and allow for more variation in size. I've written a simple program that processes PostScript documents and hides messages in the text by changing the width of the spaces between the words. It varies the spacing so that it's no longer uniform within each line, yet retains the justification so that the text remains aligned with both margins.

The program, called CopySight, was used to process that last paragraph (did you notice?) as well as this one, and the next, hiding a message in each. In the previous paragraph, spaces were adjusted only a little; in this one a bit more, and in the next sentence most of all.

See what I'm talking about?

TABLE OF CONTENTS