Network time service has been in trouble. Now it's getting a makeover.
Network time synchronization—aligning your computer's clock to the same Universal Coordinated Time (UTC) that everyone else is using—is both necessary and a hard problem. Many internet protocols rely on being able to exchange UTC timestamps accurate to small tolerances, but the clock crystal in your computer drifts (its frequency varies by temperature), so it needs occasional adjustments.
That's where life gets complicated. Sure, you can get another computer to tell you what time it thinks it is, but if you don't know how long that packet took to get to you, the report isn't very useful. On top of that, its clock might be broken—or lying.
To get anywhere, you need to exchange packets with several computers that allow you to compare your notion of UTC with theirs, estimate network delays, apply statistical cluster analysis to the resulting inputs to get a plausible approximation of real UTC, and then adjust your local clock to it. Generally speaking, you can get sustained accuracy to on the close order of 10 milliseconds this way, although asymmetrical routing delays can make it much worse if you're in a bad neighborhood of the internet.
The protocol for doing this is called NTP (Network Time Protocol), and the original implementation was written near the dawn of internet time by an eccentric genius named Dave Mills. Legend has it that Dr Mills was the person who got a kid named Vint Cerf interested in this ARPANET thing. Whether that's true or not, for decades Mills was the go-to guy for computers and high-precision time measurement.
Eventually though, Dave Mills semi-retired, then retired completely. His implementation (which we now call NTP Classic) was left in the hands of the Network Time Foundation and Harlan Stenn, the man Information Week feted as “Father Time” in 2015 (www.informationweek.com/it-life/ntps-fate-hinges-on-father-time/d/d-id/1319432). Unfortunately, on NTF's watch, some serious problems accumulated. By that year, the codebase already was more than a quarter-century old, and techniques that had been state of the art when it was first built were showing their age. The code had become rigid and difficult to modify, a problem exacerbated by the fact that very few people actually understood the Byzantine time-synchronization algorithms at its core.
Among the real-world symptoms of these problems were serious security issues. That same year of 2015, InfoSec researchers began to realize that NTP Classic installations were being routinely used as DDoS amplifiers—ways for crackers to packet-lash target sites by remote control. NTF, which had complained for years of being under-budgeted and understaffed, seemed unable to fix these bugs.
This is intended to be a technical article, so I'm going to pass lightly over the political and fundraising complications that ensued. There was, alas, a certain amount of drama. When the dust finally settled, a very reluctant fork of the Mills implementation had been performed in early June 2015 and named NTPsec (https://www.ntpsec.org). I had been funded on an effectively full-time basis by the Linux Foundation to be the NTPsec's architect/tech-lead, and we had both the nucleus of a capable development team and some serious challenges.
This much about the drama I will say because it is technically relevant: one of NTF's major problems was that although NTP Classic was nominally under an open-source license, NTF retained pre-open-source habits of mind. Development was closed and secretive, technically and socially isolated by NTF's determination to keep using the BitKeeper version-control system. One of our mandates from the Linux Foundation was to fix this, and one of our first serious challenges was simply moving the code history to git.
This is never trivial for a codebase as large and old as NTP Classic, and it's especially problematic when the old version-control system is proprietary with code you can't touch. I ended up having to revise Andrew Tridgell's SourcePuller utility heavily—yes, the same code that triggered Linus Torvalds' famous public break with BitKeeper back in 2005—to do part of the work. The rest was tedious and difficult hand-patching with reposurgeon (www.catb.org/esr/reposurgeon). A year later in May 2016—far too late to be helpful—BitKeeper went open source.
Getting a clean history conversion to git took ten weeks, and grueling as that was, it was only the beginning. I had a problem: I was expected to harden and secure the NTP code, but I came in knowing very little about time service and even less about security engineering. I'd picked up a few clues about the former from my work leading GPSD (catb.org/gpsd), which is widely used for time service. Regarding the latter, I had some basics about how to harden code—because when you get right down to it, that kind of security engineering is a special case of reliability engineering, which I do understand. But I had no experience at “adversarial mindset”, the kind of active defense that good InfoSec people do, nor any instinct for it.
A way forward came to me when I remembered a famous quote by C. A. R. Hoare: “There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies.” A slightly different angle on this was the perhaps better-known aphorism by Saint-Exupéry that I was to adopt as NTPsec's motto: “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”
In the language of modern InfoSec, Hoare was talking about reducing attack surface, global complexity and the scope for unintended interactions leading to exploitable holes. This was bracing, because it suggested that maybe I didn't actually need to learn to think like an InfoSec specialist or a time service expert. If I could refactor, cut and simplify the NTP Classic codebase enough, maybe all those domain-specific problems would come out in the wash. And if not, then at least taking the pure software-engineering approach I was comfortable with might buy me enough time to learn the domain-specific things I needed to know.
I went all-in on this strategy. It drove my argument for one of the very first decisions we made, which was to code to a fully modern API—pure POSIX and C99. This was only partly a move for ensuring portability; mainly I wanted a principled reason (one we could give potential users and allies) for ditching all the cruft in the codebase from the big-iron UNIX era.
And there was a lot of that. The code was snarled with portability #ifdefs and shims for a dozen ancient UNIX systems: SunOS, AT&T System V, HP-UX, UNICOS, DEC OSF/1, Dynix, AIX and others more obscure—all relics from the days before API standardization really took hold. The NTP Classic people were too terrified of offending their legacy customers to remove any of this stuff, but I knew something they apparently didn't. Back around 2006, I had done a cruft-removal pass over GPSD, pulling it up to pretty strict POSIX conformance—and nobody from GPSD's highly varied userbase ever said boo about it or told me they missed the ancient portability shims at all. Thus, what I had in my pocket was nine years of subsequent GPSD field experience telling me that the standards people had won their game without most UNIX systems programmers actually capturing all the implications of that victory.
So I decrufted the NTP code ruthlessly. Sometimes I had to fight my own reflexes in order to do it. I too have long been part of the culture that says “Oh, leave in that old portability shim, you never know, there just might still be a VAX running ISC/5 out there, and it's not doing any harm.”
But when your principal concern is reducing complexity and attack surface, that thinking is wrong. No individual piece of obsolete code costs very much, but in a codebase as aged as NTP Classic, the cumulative burden on readability and maintainability becomes massive and paralyzing. You have to be hard about this; it all has to go, or exceptions will pile up on you, and you'll never achieve the mission objective.
I'm emphasizing this point, because I think much of what landed NTP Classic in trouble was not want of skill but a continuing failure of what one might call surgical courage—the kind of confidence and determination it takes to make that first incision, knowing that you're likely to have to make a bloody mess on the way to fixing what's actually wrong. Software systems architects working on legacy infrastructure code need this quality almost as much as surgeons do.
The same applies to superannuated features. The NTP Classic codebase was full of dead ends, false starts, failed experiments, drivers for obsolete clock hardware, and other code that might have been a good idea once but had long outlived the assumptions behind it—Mode 7 control messages, Interleave mode, Autokey, an SNMP dæmon that was never conformant to the published standard and never finished, and a half-dozen other smaller warts. Some of these (Mode 7 handling and Autokey especially) were major attractors for security defects.
As with the port shims, these lingered in the NTP Classic codebase not because they couldn't have been removed, but because NTF cherished compatibility back to the year zero and had an allergic reaction to the thought of removing any features at all.
Then there were the incidental problems, the largest of which was Classic's build system. It was a huge, crumbling, buggy, poorly documented pile of autoconf macrology. One of the things that jumped out at me when I studied NTF's part of the code history was that in recent years they seemed to spend as much or more effort fighting defects in their build system as they did modifying code.
But there was one amazingly good thing about the NTP Classic code: that despite all these problems, it still worked. It wheezed and clanked and was rife with incidental security holes, but it did the job it was supposed to do. When all was said and done, and all the problems admitted, Dave Mills had been a brilliant systems architect, and even groaning under the weight of decades of unfortunate accretions, NTP Classic still functioned.
Thus, the big bet on Hoare's advice at the heart of our technical strategy unpacked to two assumptions: 1) that beneath the cruft and barnacles the NTP Classic codebase was fundamentally sound, and 2) that it would be practically possible to clean it up without breaking that soundness.
Neither assumption was trivial. This could have been the a priori right bet on the odds and still failed because the Dread God Finagle and his mad prophet Murphy micturated in our soup. Or, the code left after we scraped off the barnacles could actually turn out to be unsound, fundamentally flawed.
Nevertheless, the success of the team and the project at its declared objectives was riding on these premises. Through 2015 and early 2016 that was a constant worry in the back of my mind. What if I was wrong? What if I was like the drunk in that old joke, looking for his keys under the streetlamp when he's dropped them two darkened streets over because “Offisher, this is where I can see.”
The final verdict is not quite in on that question; as I write, NTPsec is still in beta. But, as we shall see, there are now (in August 2016) solid indications that the project is on the right track.
One of our team's earliest victories after getting the code history moved to git was throwing out the autoconf build recipe and replacing it with one written in a new-school build engine called waf (also used by Samba and RTEMS). Builds became much faster and more reliable. Just as important, this made the build recipe an order of magnitude smaller so it could be comprehended as a whole and maintained.
Another early focus was cleaning up and updating the NTP documentation. We did this before most of the code modifications because the research required to get it done was an excellent way to build knowledge about what was actually going on in the codebase.
These moves began a virtuous cycle. With the build recipe no longer a buggy and opaque mess, the code could be modified more rapidly and with more confidence. Each bit of cruft removal lowered the total complexity of the codebase, making the next one slightly easier.
Testing was pretty ad hoc at first. Around May 2016, for reasons not originally related to NTPsec, I became interested in Raspberry Pis. Then it occurred to me that they would make an excellent way to run long-term stability tests on NTPsec builds. Thus, it came to be that the windowsill above my home-office desk is now home to six headless Raspberry Pis, all equipped with on-board GPSes, all running stability and correctness tests on NTPsec 24/7—just as good as a conventional rack full of servers, but far less bulky and expensive!
We got a lot done during our first 18 months. The headline number shows just how much was the change in the codebase's total size. We went from 227KLOC to 75KLOC, cutting the total line count by a full factor of three.
Dramatic as that sounds, it actually understates the attack-surface reduction we achieved, because complexity was not evenly distributed in the codebase. The worst technical debt, and the security holes, tended to lurk in the obsolete and semi-obsolete code that hadn't gotten any developer attention in a long time. NTP Classic was not exceptional in this; I've seen the same pattern in other large, old codebases I've worked on.
Another important measure was systematically hunting down and replacing all unsafe C function calls with equivalents that can provably not cause buffer overruns. I'll quote from NTPsec's hacking guide:
strcpy, strncpy, strcat: use strlcpy and strlcat instead.
sprintf, vsprintf: use snprintf and vsnprintf instead.
In scanf and friends, the %s format without length limit is banned.
strtok: use strtok_r() or unroll this into the obvious loop.
gets: use fgets instead.
gmtime(), localtime(), asctime(), ctime(): use the reentrant *_r variants.
tmpnam(): use mkstemp() or tmpfile() instead.
dirname(): the Linux version is re-entrant but this property is not portable.
This formalized an approach I'd used successfully on GPSD—instead of fixing defects and security holes after the fact, constrain your code so that it cannot have entire classes of defects.
The experienced C programmers out there are thinking “What about wild-pointer and wild-index problems?” And it's true that the achtung verboten above will not prevent those kinds of overruns. That's why another prong of the strategy was systematic use of static code analyzers like Coverity, which actually is pretty good at picking up the defects that cause that sort of thing. It's not 100% perfect, C will always allow you to shoot yourself in the foot, but I knew from prior success with GPSD that the combination of careful coding with automatic defect scanning can reduce your bug load a very great deal.
To help defect scanners do a better job, we enriched the type information in the code. The largest single change of this kind was changing int variables to C99 bools everywhere they were being used as booleans.
Little things also mattered, like fixing all compiler warnings. I thought it was shockingly sloppy that the NTP Classic maintainers hadn't done this. The pattern detectors behind those warnings are there because they often point at real defects. Also, voluminous warnings make it too easy to miss actual errors that break your build. And you never want to break your build, because later on, that will make bisection testing more difficult.
An early sign that this systematic defect-prevention approach was working was the extremely low rate of bugs we detected by testing as having been introduced during our cleanup. In the first 14 months, we averaged less than one iatrogenic C bug every 90 days.
I would have had a lot of trouble believing that if GPSD hadn't posted a defect frequency nearly as low during the previous five years. A major lesson from both projects is that applying best practices in coding and testing really works. I pushed this point back in 2012 in my essay on GPSD for The Architecture of Open Source, Volume 2 (www.aosabook.org/en/gpsd.html); what NTPsec shows is that GPSD is not a fluke.
I think this is one of the most important takeaways from both projects. We really don't have to settle for what have historically been considered “normal” defect rates in C code. Modern tools and practices can go a very long way toward driving those defect rates toward zero. It's no longer even very difficult to do the right thing; what's too often missing is a grasp of the possibility and the determination to pursue it.
And here's the real payoff. Early in 2016, CVEs (security alerts) started issuing against NTP Classic that NTPsec dodged because we had already cut out their attack surface before we knew there was a bug! This actually became a regular thing, with the percentage of dodged bullets increasing over time. Somewhere, Hoare and Saint-Exupéry might be smiling.
The cleanup isn't done yet. We're testing a major refactoring and simplification of the central protocol machine for processing NTP packets. We believe this already has revealed a significant number of potential security defects nobody ever had a clue about before. Every one of these will be another dodged bullet attributable to getting our practice and strategic direction right.
I have yet to mention new features, because NTPsec doesn't have many; that's not where our energy has been going. But, here's one that came directly out of the cleanup work.
When NTP originally was written, computer clocks delivered only microsecond precision. Now they deliver nanosecond precision (though not all of that precision is accurate). By changing some internal representations, we have made NTPsec able to use the full precision of modern clocks when stepping them, which can result in a factor 10 or more of accuracy improvement with real hardware, such as GPSDOs and dedicated time radios.
Fixing this was about a four-line patch. It might have been noticed sooner if the code hadn't been using an uneasy mixture of microsecond and nanosecond precision for historical reasons. As it is, anything short of the kind of systematic API-usage update we were doing would have been quite unlikely to spot the problem.
A longstanding pain point we've begun to address is the nigh-impenetrable syntax of the ntp.conf file. We've already implemented a new syntax for declaring reference clocks that is far easier to understand than the old. We have more work planned toward making composing NTP configurations less of a black art.
The diagnostic tools shipped with NTP Classic were messy, undocumented and archaic. We have a new tool, ntpviz, which gives time-server operators a graphical and much more informative view of what's been going on in the server logfiles. This will assist in understanding and mitigating various sources of inaccuracy.
We don't think our 1.0 release is far in the future—in fact, given normal publication delays, it might well have shipped by the time you read this. Our early-adopter contingent includes a high-frequency-trading company for which accurate time is business-critical. The company hasn't actually put NTPsec in production yet, though its techie in charge of time actively contributes to our project and expects to adopt it for production in the not-distant future.
There remains much work to be done after 1.0. We're cooperating closely with IETF to develop a replacement for Autokey public-key authentication that actually works. We want to move as much of the C code as possible outside ntpd itself to Python in order to reduce long-term maintenance load. There's a possibility that the core dæmon itself might be split in two to separate the TCP/IP parts from the handling of local reference clocks, drastically reducing global complexity.
Beyond that, we're gaining insight into the core time-synchronization algorithms and suspect there are real possibilities for improvement in those. Better statistical filtering that's sensitive to measurements of network weather and topology looks possible.
It's an adventure, and we welcome anyone who'd like to join in. NTP is vital infrastructure, and keeping it healthy over a time frame of decades will need a large, flourishing community. You can learn more about how to take part at our project website: https://www.ntpsec.org.