LJ Archive

diff -u

What's New in Kernel Development by Zack Brown

New NOVA Filesystem

Andiry Xu (working with Lu Zhang, Steven Swanson and others) posted patches for a new filesystem called NOVA (NOn-Volatile memory Accelerated). Normal RAM chips are wiped every time you turn off your computer. Non-volatile RAM retains its data across reboots. Their project targeted byte-addressable non-volatile memory chips, such as Intel's 3DXpoint DIMMs. Andiry said that the current incarnation of their code was able to do a lot already, but they still had a big to-do list, and they wanted feedback from the kernel people.

Theodore Y. Ts'o gave the patches a try, but he found that they wouldn't even compile without some fixes, which he posted in reply. Andiry said they'd adapt those fixes into their patches.

The last time NOVA made an appearance on the kernel mailing list was August 2017, when Steven made a similar announcement. This time around, they posted a lot more patches, including support for SysFS controls, Kconfig compilation options and a significant amount of documentation.

One of NOVA's main claims to fame, aside from supporting non-volatile RAM, is that it is a log-based filesystem. Other filesystems generally map out their data structures on disk and update those structures in place. This is good for saving seek-time on optical and magnetic disks. Log-based filesystems write everything sequentially, trailing old data behind them. The old data then can be treated as a snapshot of earlier states of the filesystem, or it can be reclaimed when space gets tight.

Log-based filesystems are not necessarily preferred for optical and magnetic drives, because sequential writes will tend to fragment data and slow things down. Non-volatile RAM is based on different technology that has faster seek-times, making a log-based approach a natural choice.

NOVA goes further than most log-based filesystems, which tend to have a single log for the whole filesystem, and instead maintains a separate log for each inode. Using the log data, NOVA can perform writes either in place like traditional filesystems or as copy-on-write (COW) operations, which keep the old version of a file until the new version has been written. This has the benefit of being able to survive catastrophic events like sudden power failures in the middle of doing a write, without corrupting the filesystem.

There were lots of responses to the patches from Andiry and the rest of his team. Most were bug reports and criticism, but no controversy. Everyone seemed to be interested in helping them get their code right so the patches could get into the main tree quickly.

Removing All Syscall Invocations from Kernel Space

There's an effort underway to reduce and ultimately remove all system call invocations from within kernel space. Dominik Brodowski was leading this effort, and he posted some patches to remove a lot of instances from the kernel. Among other things, he said, these patches would make it easier to clean up and optimize the syscall entry points, and also easier to clean up the parts of the kernel that still needed to pretend to be in userspace, just so they could keep using syscalls.

The rationale behind these patches, as expressed by Andy Lutomirski, ultimately was to prevent user code from ever gaining access to kernel memory. Sharing syscalls between kernel space and user space made that impossible at the moment. Andy hoped the patches would go into the kernel quickly, without needing to wait for further cleanup.

Linus Torvalds had absolutely no criticism of these patches, and he indicated that this was a well desired change. He offered to do a little extra housekeeping himself with the kernel release schedule to make Dominik's tasks easier. Linus also agreed with Andy that any cleanup effort could wait—he didn't mind accepting ugly patches to update the syscall calling conventions first, and then accept the cleanup patches later.

Ingo Molnar predicted that with Dominik's changes, the size of the compiled kernel would decrease—always a good thing. But Dominik said no, and in fact, he ran some quick numbers for Ingo and found that with his patches, the compiled kernel was actually a few bytes larger. Ingo was surprised but not mortified, saying the slight size increase would not be a showstopper.

This project is similar—although maybe smaller in scope—to the effort to get rid of the big kernel lock (BKL). In the case of the BKL, no one could figure out for years even how to begin to replace it, until finally folks decided to convert all BKL instances into identical local implementations that could be replaced piecemeal with more specialized and less heavyweight locks. After that, it was just a question of slogging through each one until finally even the most finicky instances were replaced with more specialized locking code.

Dominik seems to be using a similar technique now, in which areas of the kernel that still need syscalls can masquerade as userspace, while areas of the kernel that are easier to fix get cleaned up first.

Loading Arbitrary Executables as Kernel Modules

Alexei Starovoitov posted some patches to allow the kernel to load regular ELF binaries (aka plain executables) as kernel modules. These modules would be able to run user-mode helper routines instead of being absolutely confined to kernel space.

Alexei listed a variety of benefits for this. For one thing, as a user process, an ELF-based module could crash without bringing down the rest of the kernel. And although the ELF modules would run with root privileges, he said that a security breach would not lead directly into accessing the kernel's inner workings, but at least initially would be confined to userspace. The ELF module also could be terminated by the out-of-memory (OOM) killer, in case of need, or ended directly by a human administrator. It additionally would be feasible to subject ELF-based modules to regular userspace debugging and profiling, using the vast array of tools available for that.

Initially there were various technical questions and criticisms, but no one spoke out immediately against it. Linus Torvalds said he liked the feature, but he wanted one change: to make the type of module visible in the system logs. He said:

When we load a regular module, at least it shows in lsmod afterwards, although I have a few times wanted to really see module load as an event in the logs too. When we load a module that just executes a user program, and there is no sign of it in the module list, I think we *really* need to make that event show to the admin some way.

And he said specifically, "I do *not* want this to be a magical way to hide things."

Andy Lutomirski raised a pertinent question: why not just retool the modprobe program to handle ELF binaries as desired, rather than doing anything with kernel code at all? In other words, why couldn't this feature be implemented entirely outside the kernel?

But Linus replied:

The less we have to mess with user-mode tooling, the better.

We've been *so* much better off moving most of the module loading logic to the kernel, we should not go back in the old broken direction.

I do *not* want the kmod project that is then taken over by systemd, and breaks it the same way they broke firmware loading.

Keep modprobe doing one thing, and one thing only: track dependencies and mindlessly just load the modules. Do *not* ask for it to do anything else.

Right now kmod is a nice simple project. Lots of testsuite stuff, and a very clear goal. Let's keep kmod doing one thing, and not even have to care about internal kernel decisions like "oh, this module might not be a module, but an executable".

If anything, I think we want to keep our options open, in the case we need or want to ever consider short-circuiting things and allowing direct loading of the simple cases and bypassing modprobe entirely.

At one point, Kees Cook did offer some more serious criticisms of the patch's basic goal. Primarily, he noticed that Alexei's patch could be used to—and was intentionally designed to—execute arbitrary code in userspace automatically. This was different from the kernel's normal approach to modules, in which they could be loaded but not executed automatically.

Kees said, "This just extends all the problems we've had with defining security boundaries with modules out to umh [user-mode helper code] too. I would need some major convincing that this can be made safe."

He pointed out that a certain class of kernel bugs—apparently prevalent in the recent past—could redirect module loading outside of a virtual machine (that is, a container) and into the main kernel itself. And since containers could trigger loading an arbitrary module, this meant that a hostile user potentially could load an ELF module, redirect it back to the main kernel, and execute its attacking code immediately with full privileges.

Kees refused to let the patch go into the kernel as written. He said:

At the very least, you need to solve the execution environment problems here: the ELF should run with no greater privileges than what loaded the module, and very importantly, must not be allowed to bypass these checks through autoloading. What triggered the autoload must be the environment, not the "modprobe", since that's running with full privileges.

On the flip side, however, Kees acknowledged that Alexei's patch was an "interesting idea. I think it can work, it just needs much much more careful security boundaries and to solve our autoloading exposures too."

However, Alexei characterized Kees' response as "security paranoia without single concrete example of a security issue."

And Andy also disagreed with Kees' assessment. He pointed out that Kees' issue depended on an attacker finding and exploiting an additional vulnerability that would allow containers to redirect a module outside of itself—something that was not a kernel feature and that would be treated as a bug if it were ever discovered.

Kees agreed with Andy that the problem was not with Alexei's code but instead with potential vulnerabilities elsewhere in the kernel. He said, "I just don't want to extend that problem further." And he added that he wasn't opposed to Alexei's patch, but that his concerns were not paranoia, and "there are very real security boundary violations in this model."

At one point, in defense of Alexei's approach, Andy said, "I don't see how this is any more exploitable than any other init_module()." And Linus replied:

Absolutely. If Kees doesn't trust the files to be loaded, an executable—even if it's running with root privileges and in the initns—is still fundamentally weaker than a kernel module.

So I don't understand the security argument AT ALL. It's nonsensical. The executable loading does all the same security checks that the module loading does, including the signing check.

Kees acknowledged that his concern was not with Alexei's code itself, or even with the design of the feature. But he felt that if certain other bugs did appear in the kernel—as they had before—then someone would be able to exploit the feature to run arbitrary code at the root level.

However, Linus had spoken, and Kees' concern over potential future bugs were apparently not a showstopper. And just to hammer it home, David S. Miller reiterated Linus' point that kernel modules were far more dangerous than executable code, because they could access any container and namespace they pleased.

But this was not the end of the story!

Close by to this part of the conversation, Linus said to Kees:

My own personal worry is actually different—we do check the signature of the file we're loading, but we're then passing it off to execve() not as the image we loaded, but as the file pointer. So the execve() will end up not using the actual buffer we checked the signature on, but instead just re-reading the file.

Among other things, Linus said, "somebody could maybe try to time it and modify the file after-the-fact of the signature check, and then we execute something else."

He went on to say:

Initially, I thought it was a non-issue, because anybody who controls the module subdirectory enough to rewrite files would be in a position to just execute the file itself directly instead. But it turns out that isn't needed. Some bad actor could just do finit_module() with a file that they just *copied* from the module directory.

Linus said this issue had to be addressed before the patch could go into the kernel.

Andy also noticed something else that might be a deal-killer. He said:

This patch is a potentially severe ABI break. Right now, loading a module *copies* it into memory and does not hold a reference to the underlying fs. With the patch applied, all kinds of use cases can break in gnarly ways. Initramfs is maybe okay, but initrd may be screwed. If you load an ET_EXEC module from initrd, then umount it, then clear the ramdisk, something will go horribly wrong. Exactly what goes wrong depends on whether userspace notices that umount() failed. Similarly, if you load one of these modules over a network and then lose your connection, you have a problem.

He explained further, "Without your patch, init_module doesn't keep using the file, so it's common practice to load a module and then delete or unmount it. With your patch, the unmount case breaks. This is likely to break existing userspace, so, in Linux speak, it's an ABI break."

At this point—regarding Linus' security exploit—Andy felt that Kees' thumbs-up would be more important than he had at first. Kees' responsibility was module security, which Andy had thought was not an issue earlier in the discussion. Now that it was, it had become more important to get Kees' blessing on this patch. Andy pointed out to Alexei, "Kees is very reasonable, and he'll change his mind and ack a patch that he's nacked when presented with a valid technical argument."

However, he also said:

My ABI break observation is also a major problem, and Linus is going to be pissed if this thing lands in his tree and breaks systems due to an issue that was raised during review. So I think you need to either rework the patch or do a serious survey of how all the distros deal with modules (dracut, initramfs-tools, all the older stuff, and probably more) and make sure they can all handle your patch.

Alexei replied that neither of these problems were real issues. The ABI (application binary interface) break didn't really break the kernel ABI, and the security issue was not a real concern. He said, "I think you need to stop overreacting on a non-issue."

There was a bit of back and forth between them. It turned out that Alexei didn't believe there was an ABI breakage because in his intended use case, everything would be done identically to the way it was now, and so nothing would be broken. But Greg Kroah-Hartman replied, "For your use case, yes. For mine and Andy's and someone else's in the future, it might be." He added, "You are creating a very generic, new, user/kernel api that a whole bunch of people are going to want to use. Let's not hamper the ability for us all to use this right from the beginning please."

And in terms of specific use cases, Greg said:

We have userspace drivers for USB today, being able to drag that out-of-tree codebase into the kernel is a HUGE bonus, and something that I would love to do for a lot of reasons. I also can see moving some of our existing in-kernel drivers out of the kernel in a way that provides "it just works" functionality by using this type of feature.

A bunch of folks, including Linus, started debating ways to address the problems that had been identified so far. Alexei got on board after a while and started implementing changes as they were identified by the group.

It seems clear that this feature will go into the kernel. It provides cool functionality that is hotly desired by Linus and others. But the timing of getting the code into the kernel will depend on how well Alexei fixes the various problems, and whether new security or ABI issues arise.

This whole discussion was interesting on a number of levels. I particularly like the speed with which the critics and defenders of Alexei's patch would change positions, without regard to ego or fear of being seen as "wrong" or anything like that. And what had started as a wholehearted acceptance of a new feature, became concern over its possible problems and a quest to resolve them in a useful way.

I also like that Alexei's initial minor defensiveness was not treated as cause for bullying, and everyone simply tried to keep the discussion productive. It doesn't always go that way in kernel development.

And of course, I also find it exciting to be able to look in on discussions of potential kernel weaknesses, how they might be exploited by hostile actors, and what might be done to stop them. In the early days, that was exactly how the kernel folks used to talk about Microsoft—right out in the open! What might Microsoft do to destroy open source? How might it destroy the GPL? How might it destroy Linux? And all the while, the kernel people knew that the Microsoft people were reading their every post, just as they know now that hostile attackers are eagerly poking and prodding the mailing list discussions and every patch submission, looking for usable exploits.

Removing Support for Dead Hardware

Arnd Bergmann submitted a patch to remove the Linux ports for a variety of architectures, including blackfin, cris, frv, m32r, metag, mn10300, score and tile. To do this, he worked directly with the former maintainers of each port to make sure the code removal was done right and didn't break anything in the mainline kernel or anywhere else.

The bottom line was that no one used those architectures anymore. He offered his analysis of why this had happened, saying:

It seems that while the eight architectures are extremely different, they all suffered the same fate: There was one company in charge of an SoC line, a CPU microarchitecture and a software ecosystem, which was more costly than licensing newer off-the-shelf CPU cores from a third party (typically ARM, MIPS, or RISC-V). It seems that all the SoC product lines are still around, but have not used the custom CPU architectures for several years at this point. In contrast, CPU instruction sets that remain popular and have actively maintained kernel ports tend to all be used across multiple licensees.

Linus Torvalds had no objection to ripping those architectures out of the kernel, but he did say, "I'd like to see that each architecture removal is independent of the others, so that if somebody wants to resurrect any particular architecture, he/she can do so with a revert."

Linus pulled the patch into the main kernel tree and noted with glee that it took a half-million lines of code out of the kernel.

Linus was not the only one who wanted to ensure the possibility of easily resurrecting those architectures. Geert Uytterhoeven wanted to know exactly what would be required, since he had an interest in the formerly removed and later resurrected arch/h8300 architecture, currently still in the kernel and going strong. And he pointed out, "In reality, a resurrection may not be implemented as a pure revert, but as the addition of a new architecture, implemented using modern features."

To which Pavel Machek complained, "By insisting on new features instead of pure revert + incremental updates, you pretty much make sure resurrection will not be possible."

But Arnd pointed out, "now that the other architectures are gone, a lot of changes can be done more easily that will be incompatible with a pure revert, so the more time passes, the harder it will get to do that."

And he added, "Some of the architectures (e.g. tile or cris) have been kept up to date, but others had already bitrotted to the point where they were unlikely to work on any real hardware for many releases, but a revert could still be used as a starting point in theory."

In any case, Arnd's patches are sailing into the kernel, and those architectures are, for the moment, dead. But even so, it's odd to see them removed. There will certainly come a day in the far distant future when a hardware aficionado decides to write an emulator for them; and then they'll have to root around in the ancient logs of the kernel git repository to find versions of Linux that supported them, and those versions of Linux will run only on ancient hardware that itself will exist only in emulated form. So they'd have ancient hardware running on an ancient kernel on top of an emulated ancient CPU being simulated by the latest Quantum Sine-Qua-Non Ultra Linux (QSQNUL) , powered by black-hole turbines and puppy-dog hearts.

It'll happen.

Note: if you're mentioned in this article and want to send a response, please send a message with your response text to ljeditor@linuxjournal.com and we'll run it in the next Letters section and post it on the website as an addendum to the original article.

About the Author

Zack Brown is a tech journalist at Linux Journal and Linux Magazine, and is a former author of the "Kernel Traffic" weekly newsletter and the "Learn Plover" stenographic typing tutorials. He first installed Slackware Linux in 1993 on his 386 with 8 megs of RAM and had his mind permanently blown by the Open Source community. He is the inventor of the Crumble pure strategy board game, which you can make yourself with a few pieces of cardboard. He also enjoys writing fiction, attempting animation, reforming Labanotation, designing and sewing his own clothes, learning French and spending time with friends'n'family.

Zack Brown
LJ Archive