LJ Archive


Greg Roelofs

Issue #33, January 1997

Primer on zip, unzip, pkzip.

As much as we all love Linux, it is nevertheless true that occasionally we must force ourselves to deal with the DOS/MS-Windows world, however indirectly. For some of us that involves having a dual-boot system (perhaps via LILO—the LInux LOader—or OS/2's Boot Manager), but even those of us who manage to avoid that fate will sooner or later come across files that originated on some flavor of DOS or Windows system. More than likely, a few of those files will end in .zip—and that's where the unzip command comes in.

unzip is a free utility to process zipfiles, as these things are generally called. Zipfiles are actually archives of one or more other files, almost always compressed to save disk space and/or transmission time. In this regard they are similar to compressed tar archives, which are those files usually ending in .tar.Z, .tar.gz or .tgz that one finds on most Linux ftp sites and many CD-ROM distributions. One major difference between zip files and tar archives: compressed tar archives bundle all of the files together and then compress the result as a single entity; zipfiles compress individual files, then store them in the archive. This zip file method isn't quite as efficient in achieving the maximal overall compression, but it does allow you to list the archive's contents and to extract individual files without decompressing the whole mess.


How does one actually use unzip to list an archive's contents? The simplest way is with the -l option (for “list”):

$ unzip -l quake92p.zip
Archive:  quake92p.zip
 Length    Date    Time    Name
 ------    ----    ----    ----
  36064  06-25-96  13:18   DEICE.EXE
 369135  06-27-96  03:51   QUAKE92P.1
   2618  06-27-96  03:34   README.TXT
    177  06-25-96  20:07   INSTALL.BAT
    206  06-27-96  03:54   QUAKE92P.DAT
 ------                    -------
 408200                    5 files

You have each file's name (on the right), its uncompressed size, and the date and time of its last modification. For many of us, however, especially those long steeped in the terse intricacies of ls, this is a little too short and sweet. For fans of ls, or for anyone wishing to know more about the details of the archive, unzip has an entire mode devoted to listing both useful and obscure zipfile information: zipinfo mode, triggered via the -Z option. (On some systems the zipinfo command exists as a link to unzip and is synonymous with unzip -Z, but this is not true of Slackware distributions as of this writing.) We'll limit ourselves to a description of the default zipinfo listing format:

$ unzip -Z quake92p.zip
Archive:  quake92p.zip   406075 bytes   5 files
-rwxa--     2.0 fat   36064 b- defN 25-Jun-96 13:18 DEICE.EXE
-rw-a--     2.0 fat  369135 b- stor 27-Jun-96 03:51 QUAKE92P.1
-rw-a--     2.0 fat    2618 t- defN 27-Jun-96 03:34 README.TXT
-rwxa--     2.0 fat     177 t- defN 25-Jun-96 20:07 INSTALL.BAT
-rw-a--     2.0 fat     206 t- defN 27-Jun-96 03:54 QUAKE92P.DAT
5 files, 408200 bytes uncompressed, 405569 bytes compressed:  0.6%

You will immediately recognize a certain resemblance to the output of ls -l. The header line gives the archive name, its total size, and the total number of files in it; the trailer gives the number of files listed (in this case all of them), the total uncompressed and compressed data size of the listed files (not counting internal zipfile headers), and the compression ratio. Here the ratio is quite poor, mostly due to the fact that the largest file (QUAKE92P.1) is stored without any compression. In the leftmost column are the file permissions. The next column indicates the version of the archiver, and the one after that is what tells us the files came from the FAT (DOS) file system. Next are the uncompressed file size and a column indicating which files are most likely to be binary and which are probably text. The next three columns note the compression method used on each file; the time stamps; and the full file names.


Now that we know what files we have, how do we actually get the files out? File extraction is as simple as typing unzip and the file name:

$unzip quake92p
Archive:  quake92p.zip
  inflating: DEICE.EXE
  extracting: QUAKE92P.1
  inflating: README.TXT
  inflating: INSTALL.BAT
  inflating: QUAKE92P.DAT

Here we've omitted the .zip suffix; unzip first looks for the file quake92p and, not finding it, checks for quake92p.zip instead. What if we wanted only the README.TXT file? No problem. Anything (well, almost anything) after the zipfile name is taken to be the name of one of the enclosed files:

$unzip quake92p README.TXT
Archive:  quake92p.zip
 inflating: README.TXT

Here you may notice a little snag. If you now edit this file in Linux with an editor like vi, you'll see what looks like ^M at the end of each and every line. Or, if you view the file with a pager like more, you'll discover that any line uncovered by the --More-- prompt gets erased immediately. These problems are due to the fact that DOS and its successors store text files with two end-of-line characters, CR and LF (a.k.a. carriage return and linefeed, respectively, or ^M and ^J, or CTRL-M and CTRL-J), rather than the more efficient single character (LF) used on all Unix systems. So when a Unix utility—like an editor or a pager or a compiler—looks at a DOS text file, it may behave a little oddly or die altogether.

Fortunately there's a simple solution: unzip's -a option. Originally a mnemonic for ASCII conversion, the option these days is used for all sorts of text-file conversions. As a single-letter option it does its best to automatically convert files that are supposedly text, while leaving alone those that are marked binary. Be careful! zip and PKZIP don't always guess correctly when creating the archive, particularly for certain classes of MS-Windows files, and unzip's “text” conversions are almost always irreversible. In other words, don't extract with auto-conversion and then delete the original zipfile without first making sure everything is Okay. unzip does indicate which files it thinks are text when auto-converting, however:

$ unzip -a quake92p
Archive:    quake92p.zip
inflating:  DEICE.EXE               [binary]
extracting: QUAKE92P.1              [binary]
inflating:  README.TXT              [text]
inflating:  INSTALL.BAT             [text]
inflating:  QUAKE92P.DAT            [text]

In this case everything worked as intended. If, for some reason, zip marked a text file as binary and you want to force text conversion, simply double the option: -aa.

But wait, there's more! The discriminating Linux user, happily accustomed to a file system that not only preserves the case of file names but also distinguishes between names differing only in case, is not going to settle for a bunch of all uppercase DOS file names in his or her directories. Enter the -L option. If (and only if) the file came from a single case file system like DOS FAT or VMS, unzip -L will convert it to lowercase upon extraction, thusly:

$ unzip -aL quake92p
Archive:  quake92p.zip
  inflating: deice.exe               [binary]
  extracting: quake92p.1             [binary]
  inflating: readme.txt              [text]
  inflating: install.bat             [text]
  inflating: quake92p.dat            [text]

Isn't that nice?


So now you've just downloaded a whole bunch of zipfiles but don't want to unpack them just to make sure they're Okay. What's the solution? Use the -t option to test them:

$ unzip -t quake92p
Archive:  quake92p.zip
    testing: DEICE.EXE                OK
    testing: QUAKE92P.1               OK
    testing: README.TXT               OK
    testing: INSTALL.BAT              OK
    testing: QUAKE92P.DAT             OK
No errors detected in compressed data of quake92p.zip.

Here we tested only one, and the output is a little too verbose—we really want only the one-line summary for each archive. unzip supports both a -q option for various levels of quietness (the more q's, the quieter) and the concept of wildcards, both for the internal files and for the zipfiles themselves:

$ unzip -tq \*.zip
No errors detected in compressed data of arena2b-grr.zip.
No errors detected in compressed data of PngSuite.zip.
No errors detected in compressed data of libgr2-elf-install.zip.
No errors detected in compressed data of ppmz-7.3.zip.
arithc.c                bad CRC e220fe9c  (should be 1c24998c)
At least one error was detected in macm.zip.
No errors detected in compressed data of xfer-zip151.zip.
No errors detected in compressed data of quake091.zip.
No errors detected in compressed data of quake92p.zip.
No errors detected in compressed data of p93b2200.zip.
8 archives were successfully processed.
1 archive had fatal errors.

Note that the wildcard character (“*”) is escaped with a backslash (“\”). Most shells expand wildcards themselves, and if we allowed that, unzip would see the command line as a list of archives; it would treat the first one as the zipfile name and the rest as files to be tested within the first one. By escaping the wildcard, we allow unzip to do its own directory search and wildcard-matching—which, incidentally, has the advantage that Unix-style regular expressions (very powerful wildcards) can be used not only under Linux but under all of the operating systems for which unzip ports exist, even plain old DOS.

The other thing to notice is that one of the archives has an error in it. Perhaps there was a transmission error, or maybe the original was damaged when it was created; either way, the file arithc.c in macm.zip is probably not going to be usable. It's always good to know these sorts of things sooner rather than later.

There are quite a few other options and modifiers not covered here; a full tutorial would occupy most of this magazine. Fortunately, the unzip and zipinfo man pages (man unzip and man zipinfo) contain a complete listing of all of the options and examples for many of them. Unfortunately, Slackware 3.0 and earlier don't include the zipinfo man page. An abbreviated summary of zipinfo's options is available by typing unzip -Z . Similarly, a summary of most of unzip's options can be had simply by typing unzip with no parameters.

History, Acknowledgments, Pointers and the Future

unzip, zipinfo, zip and their kin were written by the Info-ZIP group, an Internet-based collection of strange beings from another universe who are currently scattered all over the planet. Yours truly (that would be me) is the principal author of unzip and zipinfo, but literally hundreds of people have contributed to them. Originally based on code by Samuel H. Smith, unzip has since been completely rewritten, with the exception of one routine which is no longer included by default. Nevertheless, we certainly owe him a debt of gratitude for getting us into this pickle. It would probably also be nice to mention the folks at PKWARE, whose PKZIP and PKUNZIP programs are the source of most of the DOS-originated zipfiles in the world. Note that Info-ZIP's programs are intended to be compatible with PKWARE's zipfiles, but they are not clones of PKWARE's programs. (For example, unzip recreates stored zipfile directory trees by default, whereas PKUNZIP requires a special option to do it.

Note also that while zip and gzip (sometimes called “GNU zip”) have similar names, a similar heritage—Jean-loup Gailly and Mark Adler are the co-authors of the latter and are also long-standing members of the Info-ZIP group—and the same compression engine, the two programs are basically incompatible. The same goes for unzip and gunzip. Jean-loup never foresaw the confusion that would arise from the similarity, and I was too late in suggesting the obvious, sick alternative (feather*) to get the name changed.

On a more serious note, the current version of unzip is 5.2, and 5.21 will be out by the time you read this. While everything discussed above works equally well with the previous version (5.12), there are various new features and other improvements that make 5.2 worth getting. You can find the latest public releases of source code and executables at UUNET's anonymous ftp site:

ftp://ftp.uu.net/pub/archiving/zip/ ftp://ftp.uu.net/pub/archiving/zip/UNIX/LINUX/

You can also find news, history, descriptions of certain weirdos, and pointers to other ftp sites around the world at the following web site:


Greg Roelofs escaped from the University of Chicago with a degree in astrophysics and fled screaming to Silicon Valley, where he now does really cool graphics and compression stuff for Philips Research. He joined Info-ZIP in the spring of 1990, shortly after the group formed, and under his dark influence the group has nearly achieved its goal of Total Universal Reconstructive Disintegration, lacking only a better acronym in order to complete their plans. He's also the father of the Cutest Baby in the Known Universe. He be reached by e-mail at newt@uchicago.edu or on the Web at quest.jpl.nasa.gov/Info-ZIP/people/greg/.

[ ---- this is a footnote ---- ] * So all of the archives on Sunsite would be...yes, you guessed it: tar'd and feather'd. Bwah ha ha ha ha ha ha! If only.

LJ Archive