Forced to use proprietary file formats? Let open source ease the burden.
Let's begin with a story. Here's what happened: my second book, coauthored with Dr Michael Moorhouse, finally was finished. I had spent an extra six months on it, which meant it now was at least six months late. I had spent every spare minute typesetting, proofreading, writing, manually converting Michael's Microsoft Word files to LaTeX, reading and then re-reading. Then, I'd proofread it all again. When it was done and dusted, I was jaded. Soon after, I received the final proof of the cover. And there it was—printed right on the back cover—a promise to provide Microsoft PowerPoint slides on the Web site for use with the text. It was too late to change the cover, which meant I was committed to providing the slides one way or another. I had forgotten that we had decided to do this at the start of the project, more than 18 months prior.
Eighteen months ago, PowerPoint was the de facto standard slide production technology within the academic community. Today, PDF is popular too. As with many in the Linux community, I already had made the move to OpenOffice.org, leaving PowerPoint behind. With 20 chapters in the book, I estimated it would take at least 20 days' effort to produce the slides manually. The thought of doing this work with PowerPoint was not something I relished. I could work within OpenOffice.org Impress, of course, and then export to PowerPoint when finished, but this idea didn't sit well with me, either. The basic problem was I knew all the content already was in the LaTeX files and having to reproduce it using a slide production application left me feeling even more drained than I already was. If only I could find a way to extract the content programmatically from my LaTeX files and populate PowerPoint slides with it—that would improve things considerably.
Searching Google resulted in frustration. Perhaps not surprisingly, details of the PowerPoint file format were hard to come by. I did find a file in Microsoft Windows Help format that described the XML standard for Microsoft Office documents, to which PowerPoint documents can be exported. Unfortunately, it was a large, complicated piece of writing. Having decided I wasn't going to get anywhere on Google, I surfed over to Comprehensive Perl Archive Network (CPAN). Perl, my programming language of choice, has been hooked up to all types of file formats and other computing forms. If anyone had played with Perl and PowerPoint, details of the work would be available on CPAN. Unfortunately, this search also drew a blank.
Then it occurred to me: if I could work with the open and widely published OpenOffice.org Impress document format, I then could export my Impress slides to PowerPoint as a last step. A quick perusal of the OpenOffice.org Web site uncovered the official XML description of the OpenOffice.org file formats. Weighing in at more than 600 pages, the standard is bigger than my book!
The XML document is well written, but it's pretty heavy going. I surfed back to CPAN to see if any other programmers had taken the time to work with OpenOffice.org formats and were gracious enough to upload their work to CPAN. This time I wasn't disappointed. Jean-Marie Gouarne of Genicorp recently had released the OpenOffice::OODoc module, a Perl interface to the OpenOffice.org formats. Given an existing document, OpenOffice::OODoc can manipulate the content, adding to, deleting from and updating the disk file as need be.
I started with a simple filter, written in Perl, that takes a LaTeX file as input and produces the slide content as output in a customized textual form. By producing a text file, I ensured that any text editor could be used to edit the output from the filter, fine-tuning the textual content as necessary. Once happy with the textual content, another filter, also written in Perl, uses the textual content to create an Impress presentation. The Impress presentation then can be opened in Impress and exported to PowerPoint and/or PDF format.
I made a conscious effort to keep my presentations as simple as possible and decided to have only three slide types. The title_slide would contain the title of the chapter at the start of the presentation file. Within the presentation, the title_slide would do double duty as a placeholder for any graphic images associated with the chapter, with one title_slide created per graphic image. The bullet_slide would contain section titles as its slide heading and subsection titles as bullet items. Finally, the sourcecode_slide would provide a mono-spaced, verbatim slide used for program listings.
I used Impress to create a three-slide presentation manually, which I called blank.sxi. Each of the created slides corresponded to each of the three slide types described in the last paragraph. I planned to clone this presentation every time I programmatically created a presentation for each of my chapters. By cloning, I'd ensure that all of the presentations conformed to a standardized look and feel.
The getcontent script is the type of script that Perl programmers typically create, use and then throw away. (See the on-line Resources for downloading the files referred to in this article.) It loops on standard input, reading one line at a time, and attempts to pattern-match on content of interest. If a match occurs, appropriate output is produced. As an example of what getcontent does, here's the code for dealing with the chapter title from the LaTeX file:
if ( /\\chapter\{(.*)\}/ ) { print "CHAPTERTITLE: $1\n"; next; }
A simple regular expression attempts to match on the LaTeX chapter macro; if a match is found, the chapter title is extracted and output is generated. The call to next short-circuits the loop, allowing the next line to be read in from standard input when a match is found. In this way, the following LaTeX snippet:
\chapter{Working with Regular Expressions}
That is, the LaTeX markup is removed and replaced with a much simpler markup. The section and subsection LaTeX macros were treated in a similar way. Here's the code:
if ( /\\section\{(.*)\}/ ) { print "BULLETTITLE: $1\n"; next; } if ( /\\subsection\{(.*)\}/ ) { print "BULLETCONTENT: $1\n"; next; }
Working with source code listings is only slightly more complex, due to the requirement to spot when a chunk of verbatim text has been entered and exited. Here's the code that handles entry into a LaTeX verbatim block:
if ( /\\begin\{verbatim\}/ ) { print "STARTCODE\n"; $in_verbatim = TRUE; next; }
And, here's the code used to handle the exit from a verbatim block:
if ( $in_verbatim ) { if ( /\\end\{verbatim\}/ ) { print "STOPCODE\n"; $in_verbatim = FALSE; } else { print; } next; }
A simple boolean, the $in_verbatim scalar, helps to determine whether the script currently is working within a verbatim block. Similar code extracts the maxims that appear throughout the book's chapters, and a few if blocks handle the graphics, their captions and other content of interest. For example, consider the following chunk of LaTeX markup:
\chapter{The Basics} \textit{Getting started with Perl.} \section{Let's Get Started!} There is no substitute for practical experience when first learning how to program. So, here is the first Perl program \index{welcome@\texttt{welcome}, and the first program, called \texttt{welcome}: \begin{verbatim} print "Welcome to the World of Perl!\n"; \end{verbatim} \noindent When executed by \texttt{perl} \footnote{We will learn how to do this is in just a moment.}, this small program displays the following, perhaps rather not unexpected, message on screen: \begin{verbatim} Welcome to the World of Perl! \end{verbatim}
The getcontent script transforms the above LaTeX into this textual content:
CHAPTERTITLE: The Basics CHAPTERCONTENT: Getting started with Perl. BULLETTITLE: Let's Get Started! STARTCODE print "Welcome to the World of Perl!\n"; STOPCODE STARTCODE Welcome to the World of Perl! STOPCODE
Notice how all of the LaTeX markup is gone, replaced by a simpler markup language that will be used to produce slides programmatically. Assuming the LaTeX chunk was in a file called chapter3.tex, the getcontent script is executed as follows, piping the result of the transformations into an appropriately named file:
perl getcontent chapter3.tex > chapter3.input
The chapter3.input file now contains the textual content, and it can be fine-tuned with any text editor prior to producing the slides.
Producing the slides within an Impress document was complicated by a number of factors. For starters, the OpenOffice::OODoc module cannot be used to create a new OpenOffice.org file; it can manipulate existing files only. Additionally, the module was created with a view to working primarily with OpenOffice.org Writer files—word processor documents—not Impress presentations. By way of example, here's a short program, called appendpara, that adds some text to an already existing Writer document:
#! /usr/bin/perl -w use strict; use OpenOffice::OODoc; my $document = ooDocument( file => 'blank.sxw' ); $document->appendParagraph ( text => 'Some new text', style => 'Text body' ); $document->save;
This small program uses the OpenOffice::OODoc module and creates a document object from the existing Writer file. The program then invokes the appendParagraph method to add some text before invoking the save method to commit the changed document to disk.
In addition to the appendParagraph method, the OpenOffice::OODoc module provides the insertElement method, which allows a new page of a specified type to be added to a document. The page can be a clone of an existing page or it can be actual, raw XML.
After reading as far as page 6 of the 600+ page OpenOffice.org XML file format document, I discovered that Impress used the //draw:page XML type to represent a slide within a presentation. Unfortunately, the OpenOffice::OODoc module could not work directly with objects of this type, so I had to come up with some other mechanism to manipulate the data. Specifically, I wanted to take the blank template slides contained in the blank.sxi document and clone each slide as I needed it, populating the slide's content with the textual content produced by the getcontent script. To do so, I needed to learn more about the Impress XML format.
I had two choices: continue to read the 600+ page standard document or take a look at an actual file to see if I could learn enough to get the job done. I chose the latter. Recalling from a previous Linux Journal article that OpenOffice.org compacts its multipart file using the popular ZIP algorithm, I created a temporary directory and unzipped the blank.sxi file:
mkdir unzipped cd unzipped unzip ../blank.sxi
This produced a bunch of files and directories:
content.xml META-INF meta.xml mimetype settings.xml styles.xml
Of most interest is the content.xml file, which contains the actual content that makes up the document. Viewing this onscreen or within an editor produced a mass of hard-to-decipher XML. In order to keep the parts as small as possible, no attention had been paid to formatting the XML, in any of the parts of the zipped container, in any meaningful way. Typically, the XML is dumped/stored as a non-indented, non-whitespace text stream. To try to make sense of it, I needed to be able to print the XML in a legible manner. In what I can describe only as a moment of temporary inspiration, I dropped into a command-line and typed xml followed by two tabs. A listing of pre-installed tools that start with the letters xml appeared on screen:
xml2-config xml-config xmllint xmlto xml2man xml-i18n-toolize xmlproc_parse xmlwf xml2pot xmlif xmlproc_val xmlcatalog xmlizer xmltex
The xmllint tool immediately caught my eye. Reading its man page uncovered the --format option, which—yes, you guessed it—pretty-prints XML provided to the tool. Therefore, typing xmllint --format content.xml resulted in output I could pipe to less and actually read without losing my sanity. Here's an abridged snippet of the pretty-printed content.xml showing the XML for the title_slide from the blank.sxi Impress document:
<draw:page draw:name="page1" draw:style- ... <draw:text-box presentation:style-name= ... <text:p text:style-name="P1"> <text:span text:style-name="T1"> ChapterTitleSlide </text:span> </text:p> </draw:text-box> <draw:text-box presentation:style-name= ... <text:p text:style-name="P3"> <text:span text:style-name="T2"> ChapterTitleSlideText </text:span> </text:p> </draw:text-box> <presentation:notes> <draw:page-thumbnail draw:style-name= ... <draw:text-box presentation:style-name ... </presentation:notes> </draw:page>
Notice the ChapterTitleSlide and ChapterTitleSlideText content, which I had typed into blank.sxi when creating it with Impress. If I could use the insertElement method to add raw XML based on this extract, with the empty content replaced with my textual content, I'd be home free.
By way of example, consider what happens once the title of the presentation and its subtitle are processed by produce_slides. The insertElement method is invoked as follows, creating a new slide:
$presentation->insertElement( '//draw:page', $last_slide++, title_slide( $title_title, $title_content ), position => 'after' );
The title_slide subroutine returns raw XML, which is inserted into the document.
Given an input file conforming to the textual content produced by getcontent, the produce_slides script clones the blank.sxi Impress file and populates any number of slides, programmatically producing a presentation. The script is not unlike getcontent in structure, its only warts being the verbatim inclusion of the required XML for each of the three slide types contained within blank.sxi. To create a presentation, invoke produce_slides as follows:
perl produce_slides 3 chapter3.input
This results in a new Impress document called chapter3.sxi appearing on disk.
With the Impress files created, I needed to replace my graphic image placeholders with the actual image. The getcontent script extracted the image filename, however, not the actual image. Importing the images into Impress should have been straightforward, except that the originals I had were of pretty poor quality compared to those that made it into the book. The final images had been improved greatly during the publisher's final typesetting phase. And, of course, I didn't have the final image files.
Then I remembered that the publisher had sent final proof PDFs with all the high-quality graphic images in place. I used xpdf to view the proofs at 200% and then fired up The GIMP to screen-capture the xpdf display window. I then cut out the graphic image and saved it as a JPEG. It took a little while, but when finished I had a beautiful set of book-quality images to import into my Impress presentations. With this task complete, I exported the Impress document to PowerPoint format and the job was done. My initial estimate of 20 days of effort was reduced to about 20 hours of real work.
And now, of course, if I need to produce some slides quickly, I can create my textual content manually in vi, run it through the produce_slides script and I'm done.
What started off as a seemingly impossible task—programmatically producing PowerPoint presentations—turned out to be quite possible, thanks to open source. All the tools I needed shipped out of the box with my stock Red Hat 9 distribution: vi, unzip, Perl, xmllint, xpdf, The GIMP and the OpenOffice.org suite.
Resources for this article: /article/8055.