A look at the questions and answers for a contest to find Linux solutions to common problems.
On April 24, 1999, the Slovenian National Linux User Group organized the first national competition in UNIX/Linux. The competition was among the activities connected with the 23rd Annual National Competition in Computer Science for high school students and the Fifth Festival of Computing, both of which were held in late April at the Department of Computer Science at the University of Ljubljana. As opposed to the already long-established tradition of national competitions in computer science, where a great emphasis is given to problems easily solved with procedural programming languages, this competition seeks its niche in favoring solutions implemented with “alternative” tools from the UNIX programming environment, where the strengths of these tools can be shown in the best way.
When preparing the exercises for this competition, those of us who served as the organizing committee searched the Web extensively for similar competitions elsewhere. Unfortunately, we found none. We would therefore like to make our own experiences available to the broader audience and hopefully initiate some collaboration in this area. In this article, we present the exercises that were part of the competition together with their solutions, explained in detail and commented. We conclude with the results of the competition accompanied by a few remarks and afterthoughts.
2:1. Frequency Analysis of Text
Perform a simple frequency analysis of text. Read a text file and write to the standard output the complete list of all words in the text along with their frequencies. The list should be sorted in ascending order, with the least frequent words on the top and the most frequent ones on the bottom. A “frequency” is a number which tells us how many times a certain word occurs in the given text. You can assume the text file contains no other characters but letters and blanks.
2:2. vi Editor
On a multi-user system, you want to prevent multiple users from editing the same file simultaneously using the vi editor. Suggest a solution! Implement your solution as a script. Comment the good and the bad aspects of your solution. You can assume that every user calls the editor using the command vi filename. You cannot rely on the vi editor issuing a warning when editing an already-opened file.
2:3. Shuffle
The lines in a given file are sorted by certain criterion. You don't want this, and would like to shuffle the lines pseudo-randomly. Propose your shuffling algorithm and implement it in the code. Pay attention to the efficiency of your code. “Pseudo-randomly” means you are allowed to use the random number generator available in the tool you will choose.
2:4. IP Addresses
A text file contains multiple references to IP addresses. You want to replace them all with the fully qualified domain names (FQDN). The IP addresses and the corresponding FQDNs are listed in the /etc/hosts file:
193.2.1.72 nanos.arnes.si
You can assume this file contains nothing but records like the one shown.
Numerical IP value is of the form 0.0.0.0-255.255.255.255. Without sacrificing much generality, you can assume every pattern 0.0.0.0-999.999.999.999 in the given text file is a valid IP address and its corresponding FQDN can be found in the /etc/hosts file. You can also assume each IP address in the text file is delimited by at least one blank character on both sides.
You are allowed to use the script languages of any command shell (e.g., sh, csh, ksh, bash) or any script language such as sed, awk and perl. You are also allowed to use all the user commands provided by a UNIX system as specified by POSIX. You are not allowed to use compiled languages such as C, C++, Pascal or FORTRAN. If you are in doubt as to whether the tool you plan to use is allowed, you can ask a member of the supervising committee. The decision of the supervising committee is final.
2:1. Frequency Analysis of Text
It is easy to solve this exercise with the sort and uniq commands and some command pipelines. The solution can be written in a single line:
tr " " "\n" < filename | sort | uniq -c | sort -n
With tr we replace each blank in the text file with a newline, thus chopping it into a form where a single word is written per line. Since tr expects the data stream from the standard input, we have to redirect the data from the file to the standard input with the < sign. The author of the exercise has made life a bit easier for us, since we don't have to worry about the periods, commas and other non-letters in the text. The output from tr is pipelined (the | sign) to the input of the sort command, which outputs an alphabetically sorted list of all words in the text file. We pipeline it again to the input of the uniq command. With no options, the uniq command eliminates multiple occurrences of adjacent lines in a text. With the option -c (count), it also reports the number of occurrences for each line read on the input. Since we have already chopped the text beforehand, uniq thus outputs the list of frequencies. The only remaining task is sorting it by frequency rather than alphabetical order, which we do with the second sort command. The -n option is used to perform the sort by the numerical value of the first field.
Was the explanation too quick? If you are not familiar with pipelines, we advise you to try and assemble the complete operation one step at a time. To start, use the tr command alone:
tr " " "\n" < filename
Now try to pipeline its output into sort, and watch what happens! Continue with the rest of the pipelined commands.
Let us conclude this discussion with an illustration showing frequency analysis on an actual text. The ten most common words in Martin Krpan, a text by Fran Levstik, are:
61 v 65 Krpan 74 bi 74 na 107 da 121 in 127 ne 142 se 161 pa 207 je
2:2. vi Editor
This exercise requires familiarity with the concept of file locking. The given program (in our case, the vi editor) is renamed or moved to a directory not included in the users' PATH environment variable. In its place, we put an enveloping script named vi which, when called, creates a lock file, starts the editor and removes the lock file on exit. The lock file is created in the same directory as the file we are editing, and not in the /tmp directory, for instance. This is necessary to avoid any confusion which could arise when two files with the same name exist in two different directories.
A possible problem may occur if two users simultaneously start editing a certain file. User A starts the script, which discovers a lock file does not exist and proceeds to create one. Before it actually creates it—remember, UNIX is a multitasking system—the process is interrupted by user B starting the same script. Since the lock file still doesn't exist, the script also decides that user B is allowed to set the lock.
The problem can be avoided if the procedure is reversed: first we create the lock file, and afterwards we check if we were successful. Does that sound strange? The solution presented here does exactly that. Using the touch command, the script always tries to create a lock file, then reacts depending on the exit status reported by touch. On a successful exit, touch exits with a zero (true) exit code, while on an unsuccessful exit, it reports a non-zero (false) exit code. There are other possible situations in which touch exits with a non-zero exit code, but here we are focusing on only one. If we try to touch a file belonging to another user for which we don't have write permission, touch reports an error and exits with a non-zero exit code. Therefore, we have to make sure that when the lock file is created, no other user has write permission. This is accomplished using the umask command as shown below: user (u) is the only one who is granted both read and write (rw) permissions, while users from the same group (g) and all the other (o) users are granted none. Since we don't want the umask command to affect our environment, we have limited its influence by enclosing it in parentheses. The gibberish following the touch command is used to send the error messages produced by touch into oblivion. We are interested only in the exit code and are happy with touch running absolutely quietly.
#!/bin/ksh if ( umask u=rw,g=,o=; touch $1.lock >\ /dev/null 2>&1 ) then vi $1 rm -f $1.lock else echo "$1 is currently locked." fi
Since it depends on the touch exit code, the solution presented here doesn't prevent a user from opening his own file more than once. This is not explicitly required in the exercise, so it is not considered a deficiency in this context. There is also nothing preventing root from opening anybody's files. Moreover, the solution employing an enveloping script does not prevent any user from reading an already-opened file from within the vi editor. Unlike the traditional AT&T vi, many of its newer incarnations have the file-locking facility built in. vim (Vi IMproved), the most common vi replacement on Linux, is one of them.
2:3. Shuffle
Shuffling is obviously a procedure that can easily be accomplished by brute force. First, each line is prepended by a random number, then the lines in the file are sorted according to these random numbers, and finally all the prepended numbers are cut away, retaining only the original lines in their altered order. Again, the solution can be written in a single line:
awk '{ print rand(), $0 }' sorted.lst | sort -n
|
cut -d " " -f 2-
The first command performs the following statement in the awk scripting language on the file sorted.lst:
{ print rand(), $0 }For those not familiar with awk, commands in this scripting language have the form
condition { statement }For each record on input (if not specified otherwise, a record defaults to a line, as in the present case), the condition is checked, and if met, the corresponding statement is executed.
In our case, no condition was specified, which means the statement is executed for each line on input. The statement itself says to print a random number followed by the line read on the input. The output from awk is then piped to the input of sort, which sorts the lines by the numerical value of their first field. The first field is the random number which was prepended in the previous step.
Finally, we remove the prepended random numbers with cut. The two options given to cut mean, treat blank as field delimiter (-d " ") and extract the fields from the second field onward (-f 2-). We have thus dropped the first field, which was the prepended random number.
Sorting makes the above algorithm an N log N one. A more efficient shuffling was invented by Durstenfeld (Comm. ACM 7, 1964, 420), having a linear dependency. Here we present an implementation in Perl for our purpose:
#!/usr/bin/perl @line =<>; # the complete input file #is read into a vector for ($i = 0; $i <= $#line; $i++) { # for i-th line $rnd = $i + int(rand($#line - $i + 1)); # we select a random line # between i and the end print $line [$rnd]; # print it $line [$rnd] = $line[$i]; # replace with i-th }
2:4. IP addresses
This exercise requires recognizing the following pattern in the text: a blank, followed by one, two or at most three digits, a dot, again one, two or three digits, another dot, again one, two or three digits, yet another dot, once more one, two or three digits, and the final blank. This pattern can be efficiently described as a regular expression in Perl:
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
Here \b denotes a word boundary, \. is an explicit dot (otherwise a dot means “any character” in the regular expression syntax) and \d{1,3} means a sequence from one to three digits.
The solution consists of two separate tasks. First, we have to read the /etc/hosts file and construct a mapping table. Second, we scan the text file for IP addresses and replace them with the corresponding FQDN from the mapping table. Again we present a solution in Perl:
#!/usr/bin/perl open (HOSTS_FILE, "/etc/hosts"); while ($line = <HOSTS_FILE>) { chomp $line; ($ip, $fqdn) = split(/ +/, $line); $hostbyip {$ip} = $fqdn; } close(HOSTS_FILE); while (<>) { s/\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) \b/$hostbyip{$1}/g; print; }
When constructing the mapping table $hostbyip, we used a nice feature of Perl called associative arrays or hashes, which allows us to refer to an element in the table by a symbolic name. Here, we used the IP address written as a string (variable $ip). The address replacement in the second task is implemented using the s (substitution) operator: s/pattern/replacement/g. The final g (global) modifier makes the substitution operator more eager; the first pattern found in the line doesn't satisfy it, but it continues to scan the rest of the line for another occurrence(s) of the given pattern.
In reality, the Domain Name System has replaced the /etc/hosts files long ago. For a more realistic example, we would have to replace the reading of /etc/hosts table by a gethostbyaddr call. The modified program is actually even shorter:
#!/usr/bin/perl -p BEGIN { sub gethostbyip { my ( $ip ) = @_ ; $packaddr = pack ("C4", split (/\./, $ip) ); ($name) = gethostbyaddr( $packaddr, 2 ); return $name; } } s/\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) \b/gethostbyip($1)/eg;
Thirteen competitors entered this year. They had 90 minutes to solve all exercises using pencil and paper. They were allowed to use all available literature, but did not have access to a computer to test their solutions. Therefore, the committee decided to judge favorably all solutions exhibiting the correct ideas, even if they were not written syntactically error-free. After reviewing all submitted solutions, the committee decided to award prizes to the following three contestants:
Andraz Tori
Mitja Bezget
Gasper Fele-Zorz
The average quality of the submitted solutions suggests that despite all the publicity Linux received during the last year, high school students are not particularly familiar with the tools from the UNIX/Linux programming environment. This is a pity, since we believe that the many strong scripting languages and modular tools are one of Linux's advantages over its competitors.
The analysis of the anonymous questionnaire which the competitors were asked to fill in also yielded some interesting results. The competitors were asked questions about their own computing environment, to classify the exercises on a scale of 1 to 5 from “easy” to “difficult” and to give suggestions. Some found the limit to scripting languages too restrictive. An interesting response came from a competitor who considered the exercises would be simple “provided that C or C++ could be used”.
This calls for a comment. Every exercise is easiest to solve in a language one is familiar with. Anybody familiar with both C and some scripting language would probably agree that these exercises can be solved in the latter with much less effort. This was intentional. What wasn't intentional is that we discovered the fact that high school students hardly touch on any scripting language at all and are thus unaware of their benefits. Rapid prototyping, for example, is quickly and easily accomplished from readily available tools. This is a complement rather than a replacement for the compiled languages. If speed is truly important, a successful prototype is normally followed by a compiled version, usually distinguished by much faster execution. In practice, both approaches coexist, while in our schools, one seems to have a complete dominance.
Last but not least, the title of this competition also probably deserves a comment. UNIX, or Linux in its narrower sense, denotes the operating system kernel. Kernel-level exercises were not part of this competition and, given the limited time and resources available to competitors, would not be feasible at this moment. It would be honest to admit that in the trade-off between short and catchy names and long and precise ones, we have leaned towards the former. Who knows—perhaps the extra room we have created for ourselves will even prove useful at some later time.