PREVIOUS  TABLE OF CONTENTS  NEXT 

Integrating Perl into Microsoft Office Innards

Tim Meadowcroft

Commercial Product
Perl Development Kit ($95)........................................www.activestate.com

Introduction

Working in the Microsoft world of Windows, Office, Visual Basic, and loosely structured user documents, there are a few problems that regularly appear where you'd like to use Perl to tweak a piece of data or perform some task, but can't, because you're deep in the Microsoft way of doing things. Normally this level of integration is hard to achieve, but this article describes how you can grab just that little bit of Perl when you want it using tools from ActiveState's Perl Development Kit.

Microsoft tools tend to integrate well with other Microsoft tools, and not at all with anything else. And when it comes to writing code, you're often forced to use VB or a derivative, which is not an ideal tool for manipulating text. But invoking your favorite bit of Perl, simply and transparently, deep within the internals of Excel, Word, VB and the like is now much easier, thanks to ActiveState's PerlCOM and PerlCtrl, parts of their Perl Development Kit.

This article is describes a scenario in which a complete Perl solution (using an external Perl script to automate Office) wasn't appropriate, but a hybrid Perl/non-Perl approach was, and how PerlCOM made that hybrid solution almost trivially simple.

Warning: There is VB code in the following example. It's the subject of the article, so I can't really skip it. You shouldn't need to know VB in order to read the examples; they're fairly simple, and as long as you get the gist of what the VB code does, you'll be fine.

Background

More frequently than we'd prefer, UK telephone numbers are subject to wide-reaching changes combined with promises that "this will be the last time." The most recent of these changes is the overnight change to London and five other city codes, covering over a quarter of the UK population, and all mobile and pager numbers. (See http://www.numberchange.org/ for the gritty details.) For the sake of this article, I'll limit the discussion to just the London numbers, for which the seven-digit local codes became eight-digit codes overnight:

  0171 xxx xxxx (Central London) becomes 020 7xxx xxxx

  0181 xxx xxxx (Outer London) becomes 020 8xxx xxxx

In any large company, there are plenty of data sources that need updating, and it's the semi-structured and unstructured data sources that are hardest to change.

Databases with proper telephone number fields, where area codes and numbers are properly separated, are easy to change with one-time scripts. The trouble arises in the messy areas: text files, word processing documents, and all those ubiquitous spreadsheets (I work for investment banks; I've seen people write essays in Microsoft Excel because "it's the only program I understand"). If you just blindly modify every cell, you risk converting data that shouldn't really be touched.

The Problem

We have quite a few spreadsheets (such as telephone books) with contact information scattered throughout. Many of these had old-style (seven-digit) Central London telephone numbers, or were loosely formatted:

Tim Meadowcroft220 8537(0171 implied)
Reception8000(0171 220 implied)
Tim's Fax220-8375/6/7(0171 implied, multiple numbers)
Tim's Department800 0001(not a phone number, but a department code)

This is easy to convert with a cute application of a Perl regular expression or two. The usual approach is to automate Excel with Win32::OLE,(OLE Automation is a system where Microsoft exposes all the innards of an application so that programs able to talk to OLE objects can operate the application. This is very powerful for the right type of problem, and the Win32::OLE module makes this possible. Effectively, your program drives the application, rather than the application invoking your program.)but then my standalone script isn't interacting with the user, and I don't want to just blindly modify all cells, which would cause a problem with the last example above. I could automate Excel and make it ask the user before each modification, but that would be tedious. What I'd really like to do is let the user control the process from within Excel, let them select the cells they'd like to modify, and then invoke my conversion script on all those cells, whether it's one or a thousand.

Microsoft provides a built-in programming environment for this, using VBA (Visual Basic for Applications) as an embedded scripting language based on VB (it used to be a subset, but now it has extra features and is best considered a different, but related, product). VBA, being embedded in Office tools, is great for putting a "fix the selected cells" button on the Excel Toolbar, but not so great for actually coding the change, as anyone who's tried to write text manipulation code in VB will tell you.(The pedantic might note that the later versions of VBScript have a RegExp object, which should be callable from VBA, but after a couple of hours downloading and registering objects I still couldn't call it from Excel. And there are plenty of other problems where Perl is going to be "just right" anyway.)

The basic VBA routine is as simple as shown below. This is written within Excel, and so can be easily attached to a new button on an Excel toolbar. It loops over all the cells that are currently selected, and calls a FixPhone() function with the contents of the cell. FixPhone() returns a empty string if it can't convert the string, or the new value if it can. We then either replace the cell contents, or highlight it to signal an error to the user. Now we just need to write FixPhone().


  Sub FixPhoneOnSelection()
    On Error Resume Next
    If TypeOf Selection Is Range Then
        'we don't try to fix a chart, ... just cells
        Dim c As Range
        For Each c In Selection.Cells
            If Not IsEmpty(c.Value) Then
                dim t as string
                t = FixPhone(c.Text)
                If t <> "" Then
                    ' replace cell with new value - remove coloring
                    c.Value = t
                    c.Interior.ColorIndex = xlNone
                Else
                    ' leave cell alone, but color it to show error
                    c.Interior.Color = RGB(255, 255, 0)
                End If
            End If
        Next
    End If
  End Sub

The Solution

Perl programmers are probably thinking that the ideal solution would be to spawn a Perl script and capture the output, using backticks or a similar mechanism, but unfortunately VB isn't very good at things like spawning a process and capturing and parsing the STDOUT stream. It can call DLLs, but stumbles if the call is complicated, and mapping parameter and return types can be complex and error-prone. But one thing it can do very well is make and use COM or OLE objects.

OLE was the original name for a compound document technology Microsoft wrote. As OLE evolved, COM was written as a Component Object Model technology that OLE could be based upon. At this stage, the naming got a bit complex as Microsoft started referring to "OLE objects", "COM objects", "OLE technologies", and then also introduced the "ActiveX" name to mean the whole OLE/COM arena as it affected component developers. Nowadays it is easiest, and approximately correct, to consider all the names as interchangeable, with COM being the core name.

PerlCOM, part of the ActiveState Perl Development Kit, exposes a generic Perl interpreter as a COM object -- that is, an "object" as defined by most OO definitions (a combined chunk of data and functionality with a callable interface) -- that can be used by any environment that "talks COM". Functions defined in a Perl script, when given to the PerlCOM object, appear as methods on that PerlCOM object. They can be simply called from VB with parameters properly mapped and return values properly returned.

Given a FixPhone() function, the VB to make and invoke the relevant Perl is shown here:
  sub FixPhone( byval text as string ) as string
    'Declare an object variable and make an instance of PerlCOM
    Dim perlObj as Object
    Set perlObj = CreateObject("PerlCOM.Script")

    'OK then, check it worked
    If perlObj Is Nothing Then
        MsgBox ("Sorry, No PerlCOM on this machine")
        exit
    End If

    'Create Perl routine
    Call perlObj.EvalScript("sub FixPhone { return '020 7'.$_[0]; }")

    'Invoke my perl subroutine
    Dim t as String
    t = perlObj.FixPhone(text)
    FixPhone = t
  end sub

Notice that for the sake of simplicity, we make a PerlCOM object, give it a small inline script to parse, and then we can invoke the Perl function we want as a new method on the PerlCOM object.

From here, writing a bit of Perl to fix up the numbers is fairly simple. The neat thing is the tidy way that I can invoke a Perl subroutine as if it were VB, by creating new methods on a COM object.

Of course, in the real code I don't want to make a new Perl interpreter and then destroy it for every cell, so I hide the "create the Perl object and initialize the code" in a separate routine that creates the object on first invocation and caches it thereafter:


  function GetMyPerlObj as Object
    'Declare an object variable and make it on first call only
    static perlObj as Object
    if perlObj is nothing then
      Set perlObj = CreateObject("PerlCOM.Script")

      'OK then, check it worked
      If perlObj Is Nothing Then
        MsgBox ("Sorry, no PerlCOM on this machine")
        exit
      End If

      'Make a nice Perl routine 
      Call perlObj.EvalScript("sub FixPhone { return '020 7'.$_[0]; }")
    end if

    ' this is how you return a value in VB...
    set GetMyPerlObj = perlObj
  end function
 
and then, in the original routine, where I had
  t = FixPhone(c.Text)
I can now simply call
  t = GetMyPerlObj.FixPhone( c.text )

Those with a purist OO background may raise complaints here that a COM object is effectively changing its interface by having new methods made suddenly available. Well, that's how it's done in the COM world. It looks a little unusual, but anyone who's used to AUTOLOAD methods in Perl shouldn't have a problem reading the code.

To do the same from Microsoft Word (or Powerpoint, or Publisher, or Visio, or any other VBA enabled product) you simply write a new outer routine that grabs and modifies the appropriate selected item.

An example for Word is shown below (this assumes the same GetMyPerlObj function has been included).


  Sub FixPhoneOnWordSelection()
    If Selection.Text <> "" Then
      Dim t As String
      t = GetMyPerlObj.FixPhone(Selection.Text)
      If t <> "" Then
          ' replace selection with new value
          Selection.Text = t
      Else
          ' couldn't convert - pop up a message
          MsgBox ("Couldn't find a suitable phone number to convert")
      End If
    End If
  End Sub

This is a much simpler routine than the Excel example, as there's only one selection to fix, and no cells to enumerate. Accordingly, I've decided to report errors directly to the user, whereas in the Excel code I changed the coloring of cells to indicate errors. (This is important: if the user selects the wrong column for changes, you don't want him or her to have to dismiss a thousand dialog boxes, one for each cell.)

And for those of you who are really interested, the Perl code I actually used to fix up our phone numbers (as opposed to the trivial version above) looked like this. It's not very complex and it doesn't solve everything, but it saved us a lot of time converting a lot of numbers.

 # Tests:
 #  020 xxxx xxxx : fine as is
 #  xxx xxxx      : assume 020 7xxx xxxx
 #  2xxx          : Building 1 extension, assume 020 7457 2xxx
 #  8xxx          : Building 2 extension, assume 020 7220 8xxx
 #  0171 xxx xxxx : convert to 020 7xxx xxxx
 #  0181 xxx xxxx : convert to 020 8xxx xxxx
 # Anything else is an error and should be hand converted....
 #
 sub FixPhone {
     local $_ = shift;
     return $_ if /^020 \d{4} \d{4}$/;
     return $_ if s/^\s*(\d{3})[-\s]+(\d{4})\s*$/020 7$1 $2/;
     return $_ if s/^\s*(\d{3})[-\s]+(\d{4})[-\s]+(\d{4})\s*$/$1 $2 $3/;
     return $_ if s/^\s*(2\d{3})\s*$/020 7457 $1/;
     return $_ if s/^\s*(8\d{3})\s*$/020 7220 $1/;
     return $_ if s/^\s*0171[-\s]+(\d{3})[-\s]+(\d{4})\s*$/020 7$1 $2/;
     return $_ if s/^\s*0181[-\s]+(\d{3})[-\s]+(\d{4})\s*$/020 8$1 $2/;
     return '';
 }

But Wait, There's More

PerlCOM exposes much more than what I've shown here. It provides simple mechanisms for loading modules and calling functions in a scalar or array context. It can make and return instances of a Perl object, and even makes the raw Hash datatype available to poor VB programmers who've suffered with the atrocious Collection class for so long:

  dim hash as Object
  set hash = perlObj.CreateHash   'imagine this as $hash = { };
  hash.name = "Tim"              '$hash{name} = "Tim"
  hash.surname = "Meadowcroft"

As wordy as this is, it's nothing compared to a VB Collection object!

For the examples I've shown so far, I needed a licensed PDK on the machine where I'm running the scripts. But if I wanted to make a VB/Perl hybrid that I could distribute to others, what would I do?

The PDK also includes PerlCtrl, which makes ActiveX components from a script. Whereas PerlCOM exposes a generic Perl interpreter, PerlCtrl wraps up a selected script into an ActiveX DLL(There's that jargon again. An ActiveX DLL is a Dynamically Linked Library containing one or more COM objects. It registers itself under Windows so that when someone asks for an instance of the appropriate object, Windows knows how to load the DLL and create the object automatically without the caller needing to know where the object came from.) that can be distributed freely and invoked in a manner similar to the above example (but without the EvalScript() call).

I won't go into much detail about PerlCtrl. When you've written some useful functions (say, invoking WWW::Search on the selected words in a document) that you want to use from inside some visually pretty (but feature poor) VB GUI, PerlCtrl comes in handy. Much like PerlApp compiles a Perl script into a stand-alone EXE file, PerlCtrl compiles Perl code into DLL components.

And PerlApp is thrown in with the PDK too. Unix hackers may well wonder what all the fuss is about, but for those of us locked behind Windows it's a breath of fresh air.

Conclusion

If you start working with the innards of Microsoft products, you're going to need some help. In the "integrate with anything as long as it's Microsoft" world of Win32 programming, which a lot of us unfortunately inhabit, the COM integration that the PDK offers is well worth the $95.

_ _END_ _


PREVIOUS  TABLE OF CONTENTS  NEXT