Work the Shell

Analyzing Comma-Separated Values (CSV) Files

Dave Taylor

Issue #260, December 2015

Introducing FIX-CSV, a script to analyze and fix errors in comma-separated values (CSV) files, so Dave finally can do his taxes. No, really. Read on!

Ugh. I've been working on my taxes. I know, it's a bit weird to be doing my taxes in the autumn, but if you defer and file an extension with the IRS every year, well, then you're used to tax time being September/October, not April 15th.

I have a very old-school, geeky way of preparing for my own taxes, and it involves using an Excel spreadsheet to enter all my line item expenses then normalizing and cleaning up the data. When that's all done, which typically involves a lot of sorting and re-sorting of the data, I then export it all as a comma-separated values data file and pull out a Linux shell script to analyze and summarize expenses by category.

I suppose I could do that in the spreadsheet program itself, but it either would involve me having to learn the spreadsheet's programming language (for example, Visual Basic in Microsoft Excel 2016) or manually click-dragging series of cells to summarize their values. Both are tedious, and however peculiar my solution, the idea of actually learning Visual Basic just boggles my mind, so that's just not an option.

But, there's a lurking problem in the CSV format I use, and to understand it, I need to dig in to exactly what's being formatted.

A typical expense entry has four fields: date, category, amount and any detailed notes or comments. For example:

4/10/15   subscriptions   19.99    Linux Journal

All of it's neatly organized in columns and data cells, as befits a spreadsheet, of course.

Random aside: did you know that it was a spreadsheet that proved the viability of the personal computer back in the day? VisiCalc was the groundbreaking app with its sophisticated interface (for the day, at least) and ability for accountants and business folk to create sophisticated mathematical tables and regular people to...balance their checkbooks. Yes, one of the killer apps for the very first generation of PC was checkbook balancing. We've come a long way!

With a spreadsheet populated with these four fields, the easiest way to create a dataset for further work is to export it as comma-separated values, the “CSV” format. Here's how that particular line will be exported:

4/10/14,subscriptions,19.99,Linux Journal

Not too bad, and it's easily managed by changing the field separator to a comma. For example, to extract just the numeric value: cut -d, -f3.

In fact, once the output is sorted by category, it's a simple awk script to read the CSV file line by line, testing each category against the previous and summing up values as it goes:

BEGIN { sum=0; category=""; FS="," }
{ if ( $2 != category ) {
    if ( sum > 0 ) { print category," == ",sum; }
    sum=$3
  }
  else { 
    sum+=$3 
  }
  category=$2
}
END { print category, " == ", sum }

Awk scripts are blocks of code that match specified regular expressions, although all three of the above blocks are somewhat special cases. The first, BEGIN is executed before the first line of input is read in, so it just initializes variables. The last, END, is run after the last line is processed.

And the middle section? It's a regular expression that matches every line (by being omitted entirely). Since the field separated is set to a comma, it means that within the main awk block, $1 is the date, $2 is the category, $3 is the amount and $4 is the comment.

For the sample input line, it'd be:

$1 = 4/10/14
$2 = subscriptions
$3 = 19.99
$4 = Linux Journal

That's easy enough, and easy to understand, I expect. The code's also quite readable, so you can see what's going on.

The problem? The problem arose when I encountered lines where one of the fields had a comma. For example, if I had the comment field on this line be “Linux Journal, annually”, the CSV output would be:

4/10/14,subscriptions,19.99,"Linux Journal, annual"

If you're thinking about the field separator, it's immediately obvious what's going to cause trouble. Instead of actually escaping the comma, Excel has just quoted the field that has the comma in the output.

In this particular instance, it's not that big of a problem. All that happens is that instead of having “Linux Journal, annual” as field 4, you'd end up with “Linux Journal” in field 4 and “annual” in field 5.

Where this does turn out to be a problem is with the expense itself. In particular, Excel displays four-digit values with a comma if they're a currency: 1,300.00

With. A. Comma.

And, that comma survives the export to CSV format, which is a bit mind-boggling. Suffice it to say, it turns out to be tricky, as you can see here:

4/10/14,subscriptions,"1,300.99",Linux Journal

The easy way to solve the problem is to choose a different cell format style that excludes the predilection of the spreadsheet to export with commas. But hey, you read my column so you're probably used to taking the long, circuitous route. So, let's do it again!

A bit of analysis reveals that if you simply split out the lines that contain quotes from those that don't, you quickly can identify those that need fixing or tweaking. Let's start with the raw file that contains two lines: one with the embedded comma problem, one without:

4/7/14,subscriptions,199.99,Ask Dave Taylor Monthly
4/10/14,subscriptions,"1,300.99",Linux Journal

There are lots of ways to identify the line with the problem, including picking lines with more than the expected four fields, but let's do something easier:

$ grep \" expenses.csv
4/10/14,subscriptions,"1,300.99",Linux Journal

The cut command now can be used to extract just the quoted field—cut -d\" -f2—and then any comma removed with sed.

In other words, use a script block like this, if the line in question is stored in the variable inline:

f1=$(echo $inline | cut -d\" -f1)
f2=$(echo $inline | cut -d\" -f2)
f3=$(echo $inline | cut -d\" -f3)

Let's examine what these three cut statements do: f1 is everything prior to the first quote mark; f2 is everything that's been quoted, and f3 is everything after the quoted passage. In the case of the Linux Journal subscription, it'd look like this:

f1=4/10/14,subscriptions,
f2=1,300.99
f3=,Linux Journal

That's just about all of the hard work done because now you safely can strip the commas from f2 without affecting the rest of the line, safely stored in f1 and f3.

Then it all can be reassembled in a single line:

echo $f1`echo $f2|sed 's/,//g'`$f3

Remember here that the backticks denote a sequence that's going to be passed to a subshell and its output substituted. With the Linux Journal line, the output is exactly as desired:

4/10/14,subscriptions,1300.99,Linux Journal

It turns out that's the solution, and you now have all the basic pieces of the script itself. Actually, there's no need to separate out files with quoted lines versus those that don't have quotes because that can be done within the script itself.

And so, here's the succinct script that can fix the CSV file quickly and easily:

#!/bin/sh
# fix CSV files with embedded commas
while read inline
do
  if [ ! -z "$(echo $inline | grep \")" ] 
  then
    f1=$(echo $inline | cut -d\" -f1)
    f2=$(echo $inline | cut -d\" -f2)
    f3=$(echo $inline | cut -d\" -f3)
    echo $f1`echo $f2|sed 's/,//g'`$f3
  else
    echo $inline
  fi
done
exit 0

Does it work? Let's give it a whirl:

$ sh fix-csv-commas.sh < expenses.csv 
4/7/14,subscriptions,199.99,Ask Dave Taylor Monthly
4/10/14,subscriptions,1300.99,Linux Journal

And there you go. As for me, well, it's back to finishing up my taxes now that I've managed to burn a few hours creating this useful “CSV-Fixer” script.