This month's column takes a brief look at the GNU coding standards, a document that describes how to write and package GNU software.
What is it that makes a GNU program a GNU program? What makes GNU software “better” than other (free or non-free) software? The most obvious difference is the GNU General Public License (GPL), which describes the distribution terms for GNU software. But this is usually not the reason you hear people saying “Get the GNU version of xyz, it's much better.” GNU software is usually more robust, and performs better, than standard Unix versions. We're going to look at some of the reasons why, and at the document that describes the principles of GNU software design.
The GNU Coding Standards describe how to write software for the GNU project. It covers a range of topics. As of this writing, related chapters are not grouped together, so we'll look at the chapters by topics, not in the order they appear.
You can find the GNU Coding Standards in the Autoconf distribution, currently autoconf-2.3.tar.gz, from your nearest GNU mirror site. An ASCII copy (standards.txt) should also be available as a standalone file from your nearest GNU mirror site, as well.
The first issue discussed has to do with intellectual ownership. If you're GOING to write a GNU program that re-implements a Unix utility, don't look at the Unix source code! (Source code licenses are harder and harder to get these days, so this is less of a problem than it was 10 years ago.) The other issue has to do with copyright assignment. If you're going to write or work on a GNU program, you have to either declare your work to be in the public domain, or assign the copyright in it to the FSF. (Small changes don't have to do this, so don't be scared off by this if you want to submit a bug fix. On the other hand, if you enhance GNU [cw]find[ecw] so that it can read cpio archive tapes, you probably would have to do paperwork. Even this is usually painless.)
You can, of course, write a program from scratch, release it under the GPL, and keep the copyright. You may also generate your own changes to a program for which the FSF owns the copyright, and distribute your version separately from the FSF's version, under the GPL. Assigning copyright to the FSF is only a necessity when you want your changes to be folded back into the main distribution of a GNU program.
A number of chapters provide general advice about program design. The four main issues are compatibility (with standards and Unix), what language to write in, whether to rely on non-standard features of other programs (in a word, “don't”), and what “portability” means.
Compatibility with ANSI, POSIX, and Berkeley Unix is an important goal. But it's not an overriding one. The general idea is to provide all necessary functionality, with command line switches to provide a strict ANSI or POSIX mode.
C is the preferred language for writing GNU software, since it is the most commonly available language. In the Unix world, ANSI C is only now becoming common (sad but true), so K&R C is still the most widely portable dialect. This is changing rapidly though, with C++ becoming more commonplace. One widely used GNU package written in C++ is groff (GNU troff). With GCC supporting C++, it has been my experience that installing groff is not difficult.
The standards state that portability is a bit of red herring. GNU utilities are ultimately intended to run on the GNU kernel with the GNU C library. But since the kernel isn't finished yet, and users are using GNU tools on non-GNU systems, portability is desirable, just not paramount. The standard recommends using Autoconf (about which I one day hope to write a column) for achieving portability among different Unix systems.
The next group of chapters provides general advice about program behavior. We will return to look at one of these chapters in detail, below. These chapters focuses on how to design your program, how error messages should be formatted, how to write libraries (make them reentrant), and standards for the command line interface.
Error message formatting is important, since several tools, notably Emacs, use the error messages to help you go straight to the point in the source file or data file where an error occurred.
GNU utilities should use a function named getopt_long for processing the command line. This function provides command line option parsing for both traditional Unix style options (gawk -F: ...) and GNU style long options (gawk --field-separator=: ...). All programs should provide --help and --version options, and when a long name is used in one program, it should be used the same way in other GNU programs. To this end, there is a rather exhaustive list of long options used by current GNU programs.
The most substantive part of the manual describes how to write C code, covering things like formatting the code, comments, how to use C cleanly, how to name your functions and variables, and how to declare, or not declare, standard system functions that you wish to use.
Code formatting is a religious issue; many people have different styles that they prefer. I personally don't like the FSF's style, and if you look at gawk, which I maintain, you'll see it's formatted in standard K&R style. But this is the only variation in gawk from this part of the coding standards (other variations will go away in gawk 3.0, coming this year).
Nevertheless, while I don't like the FSF's style, I consider it of the utmost importance, when modifying some other program, to stick to the coding style already used. Having a consistent coding style is more important than which coding style you pick.
What I find important about the chapters on C coding is that the advice is good for any C coding, not just if you happen to be working on a GNU project. So, if you're just learning C, or even if you've been working in C (or C++) for a while, I would recommend these chapters to you, since they encapsulate many years of experience.
Two chapters cover writing documentation for your program. The preferred way is to write a manual using Texinfo, which was discussed in an earlier column (Issue #6, October 1994). There is some nice advice in here about writing manuals. And, as described earlier, Texinfo is an enjoyable language in which to write documentation.
Finally, there are three chapters devoted to the mechanics of making a release. These chapters discuss the conventions to use for Makefiles, how configuration should work, and other generalities about how a release should work.
These chapters, together with the Autoconf manual, provide the needed information for packaging up a program and making the final released tar] file.
We'll take a look now at the chapter entitled Program Behavior for All Programs. This chapter provides the principles of software design that make GNU programs better than their Unix counterparts. We will quote selected parts of the chapter, with some examples of where these principles have paid off.
Avoid arbitrary limits on the length or number of any data structure, including file names, lines, files, and symbols, by allocating all data structures dynamically. In most Unix utilities, “long lines are silently truncated”. This is not acceptable in a GNU utility.
This is perhaps the single most important rule in GNU software design, “no arbitrary limits.” All GNU utilities should be able to manage arbitrary amounts of data.
While this makes it harder for the programmer, it makes things much better for the user. I have one gawk user who runs an awk program on over 650,000 files (no, that's not a typo) to gather statistics. gawk grows to over 192 Megabytes of data space, and the program runs for around seven CPU hours. He would simply not be able to run his program using another awk implementation.
Utilities reading files should not drop NUL characters, or any other nonprinting characters (including those with codes above 0177). The only sensible exceptions would be utilities specifically intended for interface to certain types of printers that can't handle those characters.
It is also well known that Emacs can edit any arbitrary file, including files containing binary data!
Check every system call for an error return, unless you know you wish to ignore errors. Include the system error text (from perror or equivalent) in every error message resulting from a failing system call, as well as the name of the file if any and the name of the utility. Just “cannot open foo.c” or “stat failed” is not sufficient.
Checking every system call provides robustness. This is another case where life is harder for the programmer, but better for the user. An error message detailing what exactly went wrong makes finding and solving any problems much easier.
Check every call to malloc or realloc to see if it returned zero. Check realloc even if you are making the block smaller; in a system that rounds block sizes to a power of 2, realloc may get a different block if you ask for less space.
In Unix, realloc can destroy the storage block if it returns zero. GNU realloc does not have this bug: if it fails, the original block is unchanged. Feel free to assume the bug is fixed. If you wish to run your program on Unix, and wish to avoid lossage in this case, you can use the GNU malloc.
You must expect free to alter the contents of the block that was freed. Anything you want to fetch from the block, you must fetch before calling free.
In three short paragraphs, Richard Stallman has distilled the important principles for doing dynamic memory management using malloc. It is the use of dynamic memory, and the “no arbitrary limits” principle that makes GNU programs so robust and more capable than their Unix counterparts.
Use getopt_long to decode arguments, unless the argument syntax makes this unreasonable.
Long options were mentioned earlier. Their use is intended to make GNU programs easier to use and more consistent than the Unix versions. The getopt_long function is a nice one; it provides you all the flexibility and capabilities you may need for argument parsing. As a simple yet obvious example, --verbose is spelled exactly the same way in all GNU programs. Contrast this to -v, -V, -d etc.
Finally, we'll quote from an earlier chapter that discusses how to write your program differently than the way a Unix program may have been written.
For example, Unix utilities were generally optimized to minimize memory use; if you go for speed instead, your program will be very different. You could keep the entire input file in core and scan it there instead of using stdio. Use a smarter algorithm discovered more recently than the Unix program. Eliminate use of temporary files. Do it in one pass instead of two (we did this in the assembler).
Or, on the contrary, emphasize simplicity instead of speed. For some applications, the speed of today's computers makes simpler algorithms adequate. Or go for generality. For example, Unix programs often have static tables or fixed-size strings, which make for arbitrary limits; use dynamic allocation instead. Make sure your program handles NULs and other funny characters in the input files. Add a programming language for extensibility and write part of the program in that language.
An excellent example of the difference an algorithm can make is GNU diff. My computer's previous incarnation was an AT&T 3B1; a system with a MC68010 processor, a whopping two megabytes of memory and 80 megabytes of MFM disk.
I did (and do) lots of editing on the manual for gawk, a file that is currently over 17,000 lines long (although at the time, it was only in the 10,000 lines range). I used to use diff -c quite frequently to look at my changes. On this slow system, switching to GNU diff made an extremely noticeable difference in the amount of time it took for the context diff to appear. The difference is almost entirely due to the better algorithm that GNU diff uses.
The GNU Coding Standards is a worthwhile document to read if you wish to develop new GNU software, enhance existing GNU software, or just wish to learn how to be a better programmer. The principles and techniques it espouses are what make GNU software the preferred choice of the Unix community.
As mentioned, the released version of the standards covers its topics in a rather haphazard order. As a result of working on this column, I volunteered to re-organize them into several related chapters. This new version may be available by the time you read this article; keep an eye on your nearest GNU mirror site.