Converting in the shell

Conversion on Command


Different newline or character sets can make it difficult for Linux users to exchange data. A couple conversion tools in the shell help remove the headaches.

By Heike Jurzik

Sebastian Duda, Fotolia

If you frequently exchange text documents between Windows and Linux systems, you only need a few commands in the shell to convert the different newline characters - dos2unix and unix2dos convert with a single command. If a character set causes problems beyond that, iconv and recode will convert the jumble of letters into something more palatable.

What's Going On?

File suffixes are just hollow words. Linux differentiates file types by their content and not by the way the file name ends. Linux recognizes, for example, an MP3 file as such, even if its name suggests a completely different file type.

Figure 1 shows the Gnome file manager, Nautilus, which identifies both the camouflaged MP3 file (chicken.txt shown in the example) and the LaTeX file pretending to be a PNG graphic file (brief.png).

Figure 1: No cheating - Linux recognizes file types by their content and not just by their file suffixes.

The file command-line tool helps to identify files correctly (Listing 1). This practical tool takes a closer look at the files and tries various tests to identify the file type. Usually, the first few bytes are all it takes to make a correct decision. If file can't make its mind up, it will let you know as follows:

$ file weirdfile
weirdfile:     data

The program gives you a fair amount of information on text files. The output in Listing 2 goes to show that text is not just text. Three text files generate three completely different messages from file: one message is UTF-8, one uses Windows line terminators (CRLF, see "The End of the Line ..."), and the other is ISO-8859. In the following sections, I will look at programs that modify these different kinds of text to facilitate the exchange of data with other systems.

Listing 1: File
01 $ file *
02 letter.png:     LaTeX 2e document text
03 chicken.txt:    MP3 file with ID3 version 2.3.0 tag
04 test.txt:       ISO-8859 text
Listing 2: Text is not just text
01 $ file *
02 utf8.txt:       UTF-8 Unicode text
03 win.txt:        ISO-8859 text, with CRLF line terminators
04 iso8859.txt:    ISO-8859 text

The End of the Line ...

... is often where the trouble starts because different operating systems use different approaches to marking ends of lines. The syntax, which is basically taken from the way manual typewriters handle the end-of-line scenario, is completely different on Linux and Windows; whereas Linux systems just add a line feed (\n = "new line"), Windows uses a combination of line feed and carriage return (\r = "return").

This small - but critical - difference is very noticeable when you open a text file in an editor; a text file created on Windows will be full of ^M characters if you open it in Vim.

The shell tools dos2unix ("DOS to Unix") and unix2dos ("Unix to DOS") give you an angle on this. As the names suggest, they convert files from one format to another.

Some Linux distributions bundle the tools in packages of their own; current Debian versions provide the tofrodos package with the programs fromdos and todos. Two symbolic links from /usr/bin/dos2unix and /usr/bin/unix2dos point to the tools, allowing users to work with dos2unix and unix2dos, or with fromdos and todos, as needed. The programs basically do the same things, although with a couple of differences.

In this article, I will refer to these tools as dos2unix/unix2dos (because you can still use this syntax on Debian systems) and point out where the options differ.

All's Well That Ends Well

To repair a text file created on Windows for use by Linux users, just type:

$ dos2unix win.txt

The tool is communicative and tells users what is going on behind the scenes in the form of command-line output:

$ dos2unix: converting file win.
txt to UNIX format ...

The Debian version does not give you any information by default, although you can set the -v (verbose) option to make Debian talk:

dos2unix: Converting win.txt

Because the command processes the original file on whatever distribution you use and does not create an automatic backup, you might want to use the Debian option of setting the -b flag to tell dos2unix to create a backup (a file with a .bak suffix):

$ dos2unix -b win.txt
$ ls
win.bak  win.txt

Users with other distributions can set the -n flag to specify a different name for the output file. The syntax is dos2unix -n input output, for example:

$ dos2unix -n win.txt linux.txt

The -k (-p on Debian) flag is another useful option; it tells Linux to keep the original timestamp for the file. All of these command-line parameters work in the same way for unix2dos, which converts in the other direction.

Character Set Confusion

Some distributions insist on the use of the legacy character set, whereas others rely on the multi-lingual new kid on the block. The ISO-8859-1(5) versus UTF-8 debate has achieved cult status on forums and mailing lists, similar to the perennial discussions of the best Linux text editor. All modern distributions now support UTF-8 by default.

If you are not prepared - or simply unable - to change for some reason, you can convert back to ISO-8859-1(5) with a couple of simple steps; this said, you might be faced with strange renderings of special characters when you need to exchange HTML or text files.Empty boxes in menus or dialogs, broken accents or umlauts, and many other bugs make the change unfriendly either way.

The character set and format converters recode and iconv help users avoid headaches. Although iconv is included with most distributions, for which you can use your preferred package manager, you might need to install recode.

The -l parameter tells you which formats the tools support; to avoid the long list scrolling off your screen, you might want to pipe the output to a pager, as in recode -l | less or iconv -l | less.

Reapply Daily

Both tools accept files at the command line or will work as filters. When working as filters, the tools accept data from standard input and send the results to standard output (Figure 2).

Figure 2: Fighting code confusion - if you are not prepared to move to UTF-8, you can run a command in the shell to convert.

If you enter the following at the prompt,

$ iconv -f UTF-8

the tool knows that the incoming text is encoded in UTF-8.

If you then type text at the command line or copy it from an application to the shell and then press Ctrl+D to stop reading input, iconv will convert your input and display the results on your standard output.

By default, the program will use the coding of your current locale (echo $LANG [1]). If this is set to de_DE@euro, for example, iconv will give you the results in ISO-8859-15 (Latin-9, which contains the euro character besides the characters provided by ISO-8859-1).

Recode works in a similar way - if you enter recode utf8, the program will wait for UTF-8-encoded text and output ISO-8859-1 (Latin 1) by default.

If you need a different kind of output, you can pass in the required character set to iconv using the -t option. For example:

$ iconv -f UTF-8 -t MAC

Recode expects slightly different syntax - the output format separated by two dots from the input format (older versions of the program expect a colon):

$ recode utf8..latin9

The programs will also accept the file to convert as a command-line parameter:

$ iconv -f UTF-8 utf8. txt
$ recode utf8..latin9 utf8.txt

Whereas iconv outputs the results in the shell, recode overwrites the original.

The redirection operators are useful here [2]:

$ recode utf8..latin9 < utf8.txt
 > iso8859.txt

The application to iconv is similar; if you prefer not to use operators, you can set the -o option followed by the output file:

$ iconv -f UTF-8 utf8.txt -o iso
8859.txt

Conclusions

The benefits of these conversion artists in the shell are obvious.

If you need to convert a large number of files for your own system, you can use the normal Bash tricks:

$ for i in ~/download/ *.txt; do
recode utf8..lat9 $i; done

A for loop tells recode to convert multiple files at a single pass. Of course, this will also work with iconv.

INFO
[1] "Command Line: Environmental Variables" by Heike Jurzik, Linux Magazine, July 2007, pg. 89
[2] "Command Line: Data Flow" by Heike Jurzik, Linux Magazine, September 2007, pg. 88