Modernizing swapping: virtual swap spaces
52 points by voxadam 3 days ago | 48 comments

hackernudes 8 hours ago
I see some comments about soft lockups during memory pressure. I have struggled with this immensely over the years. I wrote a userspace memory reclaimer daemon and have not had a lockup since: https://gist.github.com/EBADBEEF/f168458028f684a91148f4d3e79... .

The hangs usually happened when I was stressing VFS (the computer was a samba server) along with other workloads. To trigger a hang manually I would read in large files (bigger than available ram) in parallel while running a game. I could get it to hang even with 128GB ram. I tweaked all the vfs settings (swappiness, etc...) to no avail. I tried with and without swap.

In the end it looked like memory was not getting reclaimed fast enough, like linux would wait too long to start reclaiming memory and some critical process would get stuck waiting for some memory. The system would hang for minutes or hours at a time only making the tiniest of progress between reclaims.

If I caught the problem early enough (just as everything started stuttering) I could trigger a reclaim manually by writing to '/sys/fs/cgroup/memory.reclaim' and the system would recover. I wonder if it was specific to btrfs or some specific workload pattern but I was never able to figure it out.

reply
NooneAtAll3 4 hours ago
I want to express my gratitude for not falling for greedy marketers' lies and using correct base-2 for KB/MB/GB
reply
Numerlor 2 days ago
The swap/memory situation in linux has surprised me quite a bit coming from Windows.

Windows remains mostly fully responsive even when memory is being pushed to the limits and swapping gigabytes per second, while on linux when I ran a stress test that ate all the memory I had trouble even terminating the script

reply
dlcarrier 16 hours ago
There's two things that cause this. First, Windows has a variable swap file size, whereas Linux has a fixed size, so Windows can just fill up your drive, instead of running out of swap space. Second, the default behavior for the out-of-memory killer in Linux isn't very aggressive, with the default behavior being to over-commit memory instead of killing processes.

As far as I know, Linux still doesn't support a variable-sized swap file, but it is possible to change how aggressively it over-commits memory or kills processes to free memory.

As to why there differences are there, they're more historical than technical. My best guess is that Windows figured it out sooner, because it has always existed in an environment where multiple programs are memory hogs, whereas it wasn't common in Linux until the proliferation of web-based everything requiring hundreds of megabytes to gigabytes of memory for each process running in a Chrome tab or Electron instance, even if it's something as simple as a news article or chat client.

Check out this series of blog posts. for more information on Linux memory management: https://dev.to/fritshooglandyugabyte/series/16577

reply
Joker_vD 13 hours ago
Windows "figured it out sooner" because it never really had to seriously deal with overcommitting memory: there is no fork(), so the memory usage figures of the processes are accurate. On Linux, however, the un-negotiable existence of fork() really leaves one with no truly good solution (and this has been debated for decades).
reply
p_ing 8 hours ago
NT has been able to overcommit since it's inception.
reply
goodpoint 11 hours ago
fork is a massive feature, not a bug.
reply
tliltocatl 8 hours ago
fork() is a misfeature, as is SIGCHILD/wait and most of Unix process management. It worked fine on PDP-11 and that's it.

But Linux also overcommits mmap-anonymous/sbrk, while Windows leaves the decision to the user space, which is significantly slower.

reply
rwmj 6 hours ago
Not really. It elegantly solves the "create a process, letting it inherit these settings and reset these other settings", where "settings" is an ever changing and expanding list of things that you wouldn't want to bake into the API. Thus (omitting error checks and simplifying many details):

  pipe (fd[2]);      // create a pipe to share with the child
  if (fork () == 0) { // child
    close (...);     // close some stuff
    setrlimit (...); // add a ulimit to the child
    sigaction (...); // change signal masks
    // also: clean the environment, set cgroups
    execvp (...);    // run the child
  }
It's also enormously flexible. I don't know any other API that as well as the above, also lets you change the relationship of parent and child, and create duplicate worker processes.

Comparing it to Windows is hilarious because Linux can create processes vastly more efficiently and quickly than Windows.

reply
tliltocatl 4 hours ago
Yes, but at what cost? 99% of fork calls are immediately followed by exec(), but now every kernel object need to handle being forked. And a great deal of memory-management housekeeping is done only to be discarded afterward. And it doesn't work at all for AMP systems (which we will have to deal with, sooner or latter).

In 1970 it might have been the only way to provide a flexible API, but nowadays we have a great variety of extensible serialization formats better than "struct".

reply
rwmj 3 hours ago
At no cost apparently, since Linux still manages to be much faster and more efficient than Windows.
reply
ChocolateGod 13 hours ago
Windows will also prioritise to keep the desktop and current focussed application running smoothly, the Linux kernel has no idea what's currently focused or what not to kill, your desktop shell is up there on the menu in oom situations.
reply
p_ing 11 hours ago
The same behavior exists as far back as NT4 Server, which does not provide a foreground priority boost by default.
reply
kalaksi 13 hours ago
There are daemons (not installed by default) that monitor memory usage and can increase swap size or kill processes accordingly (you can ofc also configure OOM killer).
reply
LargoLasskhyfv 15 hours ago
> As far as I know, Linux still doesn't support a variable-sized swap file...

You can add (and remove) additional swapfiles during runtime, or rather on demand. I'm just unaware of any mechanism doing that automagically, though.

Could probably done in eBPF and some shell scripts, I guess?

reply
dsr_ 13 hours ago
swapspace (https://github.com/Tookmund/Swapspace) does this. Available in Debian stable.
reply
LargoLasskhyfv 9 hours ago
Wow. Since 20 years. And I'm rambling about eBPF...
reply
dlcarrier 6 hours ago
Linux's ePBF has its issues, too.

I once was trying to set up a VPN that needed to adjust the TTL to keep its presence transparent, only to discover that I'd have to recompile the kernel to do so. How did packet filtering end up privileged, let alone running inside the kernel?

I recently started using SSHFS, which I can run as an unprivileged user, and suspending with a drive mounted reliably crashes the entire system. Back on the topic of swap space, any user that's in a cgroup, which is rarely the case, can also crash the system by allocating a bunch of RAM.

Linux is one of the most advanced operating systems in existence, with new capabilities regularly being added in, but it feels like it's skipped over several basics.

reply
quotemstr 13 hours ago
Huh? What does swap area size have to do with responsiveness under load? Linux has a long history of being unusable under memory pressure. systemd-oomd helps a little bit (killing processes before direct reclaim makes everything seize up), but there's still no general solution. The relevance to history is that Windows got is basically right ever and Linux never did.

Nothing to do with overcommit either. Why would that make a difference either? We're talking about interactivity under load. How we got to the loaded state doesn't matter.

reply
nasretdinov 14 hours ago
In Linux the default swap behaviour is to also swap out the memory mapped to the executable file, not just memory allocated by the process. This is a relatively sane approach on servers, but not so much on desktops. I believe both Windows and macOS don't swap out code pages, so the applications remain responsive, at the of (potentially) lower swap efficiency
reply
superjan 12 hours ago
Don’t know about Macs, but on Windows executable code is treated like a readonly memory mapped file that can be loaded and restored as the kernel sees fit. It could also be shared between processes, though that is not happening that much anymore due to ASLR.
reply
demosito666 12 hours ago
Oh yeah. Bug 12309 was reported now what, 20 years ago? It’s fair to say that at this point arrival of GNU Mach will happen sooner than Linux will be able to properly work under memory pressure.
reply
01HNNWZ0MV43FF 18 hours ago
I've had that same experience. On new systems I install earlyoom. I'd rather have one app die than the whole system.

You'd think after 30 years of GUIs and multi-tasking, we'd have this figured out, but then again we don't even have a good GUI framework.

reply
LtWorf 15 hours ago
I used to use it but it's too aggressive. It kills stuff too quickly.
reply
jauntywundrkind 2 days ago
Like Linux / open source often, it depends on what you do with it!

The kernel is very slow to kill stuff. Very very very very slow. It will try and try and try to prevent having to kill anything. It will be absolutely certain it can reclaim nothing more, and it will be at an absolute crawl trying to make every little kilobyte it can free, swapping like mad to try options to free stuff.

But there are a number of daemons you can use if you want to be more proactive! Systemd now has systemd-oomd. It's pretty good! There's others, with other strategies for what to kill first, based on other indicators!

The flexibility is a feature, not a bug. What distro are you on? I'm kind of surprised it didn't ship with something on?

reply
Onavo 15 hours ago
> Windows remains mostly fully responsive even when memory is being pushed to the limits and swapping gigabytes per second

In my experience this is only on later versions of the NT Kernel and only on NVME (mostly the latter I think).

reply
robinsonb5 15 hours ago
Yeah I think SSD / NVME makes all the difference here - I certainly remember XP / Vista / Win 7 boxes that became unusable and more-or-less unrecoverable (just like Linux) once a swap storm starts.
reply
p_ing 12 hours ago
NT4 exhibits the same behavior under extreme load.
reply
rwmj 16 hours ago
The annoying thing I've found with Linux under memory stress (and still haven't found a nice way to solve) is I want it to always always always kill firefox first. Instead it tends to either kill nothing (causing the system to hang) or kill some vital service.
reply
0xbadcafebee 14 hours ago
Linux being... Linux, it's not easy to use, but it can do what you want.

1. Use `choom` to give your Firefox PIDs a score of +1000, so they always get reaped first

2. Use systemd to create a Control Group to limit firefox and reap it first (https://dev.to/msugakov/taking-firefox-memory-usage-under-co...)

3. Enable vm.oom_kill_allocating_task to kill the task that asked for too much memory

4. Nuclear option: change how all overcommiting works (https://www.kernel.org/doc/html/v5.1/vm/overcommit-accountin...)

reply
delamon 14 hours ago
You can bump /proc/$firefox_pid/oom_score_adj to make it likely target. The easiest way is to make wrapper script that bumps the score and then starts firefox. All children will inherit the score.
reply
pmontra 15 hours ago
I'm not sure that I'd want the OS to kill my browser while I'm working within it.

Of course the browser is the largest process in my system, so when I notice that memory is running low I restart it and I gain some 15 GB.

Basically I am the memory manager of my system and I've been able to run my 32 GB Linux laptop with no swap since 2014. I read that a system with no swap is suboptimal but the only tradeoff I notice is that manual OOM vs less writes on my SSD. I'm happy with it.

reply
robinsonb5 14 hours ago
There are two pillars to managing RAM with virtual memory: the obvious one is is writing one program's working set to disk, so that another program can use that memory. The other one - which isn't prevented by disabling swap - is flushing parts of a program which were loaded from disk, and reloading them from disk when next needed.

That second pillar is actually worse for interactivity than swapping the working set, which is why disabling swap entirely isn't considered optimal.

By far the best approach is just to have an absurd amount of RAM - which of course is a much less accessible option now than it was a year ago.

reply
jauntywundrkind 15 hours ago
If using systemd-oomd, you can launch Firefox into it's own cgroup / systemd.scope, that has memory pressure control settings set to not kill it. ManagedOOMPreference=avoid.

https://www.freedesktop.org/software/systemd/man/latest/syst...

There's a variety of oom daemons. bustd is very lightweight & new. earlyoom has been around a long time, and has an --avoid flag. https://github.com/rfjakob/earlyoom?tab=readme-ov-file#prefe...

Your concerns are very addressable.

reply
ajb 12 hours ago
Yeah sytemd-oomd seems tuned for server workloads, I couldn't get it to stop killing my session instead of whichever app had eaten the memory.

Honestly on the desktop I'd rather a popup that allowed me to select which app to kill. But the kernel doesn't seem to even be able to prioritize the display server memory.

reply
TacticalCoder 6 hours ago
> Windows remains mostly fully responsive even when memory is being pushed to the limits and swapping gigabytes per second ...

The problem problem though is all the times when Windows is totally unusable even though it's doing exactly jack shit. An example would be when it's doing all its pointless updates/upgrades.

I don't know what people are smoking in this world when they somehow believe that Windows 11 is an acceptable user experience and a better OS than Linux.

Somehow though it's Linux that's powering billions if not tens of billions of devices worldwide and only about 12% of all new devices sold worldwide are running that piece of turd that Windows is.

reply
robinsonb5 14 hours ago
OOM killers serve a purpose but - for a desktop OS - they're missing the point.

In a sane world the user would decide which application to shut down, not the OS; the user would click the appropriate application's close gadget, and the user interface would remain responsive enough for that to happen in a matter of seconds rather than minutes.

I understand the many reasons why that's not possible, but it's a huge failing of Linux as a desktop OS, and OOM killers are picking around the edges of the problem, not addressing it head-on.

(Which isn't to say, of course, that OOM killers aren't the right approach in a server context.)

reply
nasretdinov 9 hours ago
I don't even think it's not possible — surely there already exist solutions similar to BFS scheduler that would improve interactive performance
reply
p_ing 8 hours ago
This is effectively what macOS does. It presents any end user GUI programs (or Terminal.app if it's a CLI program) to elect to kill by way of popup with the amount of memory the application is consuming and whether or not it is responsive.

Not a bad system, but macOS has a fundamental memory leak for the past few versions which causes even simple apps like Preview.app, your favorite browser (doesn't matter which one), etc. to 'leak' and bring up the choose-your-own-OOM-kill-target dialog.

reply
olejorgenb 9 hours ago
> I understand the many reasons why that's not possible

Isn't the only reason that the UI effectively freeze under high memory pressure? If the processing involved in the core parts of the UI could be prioritized there's no reason for it not to work?

I don't understand why we can't have (?) a way of saying "the pages of these few processes are holy"? (yes, it would still be slow allocating new pages, but it should be possible to write a small core part such that it doesn't need to constantly allocate memory?)

reply
glyco 5 hours ago
The new design is still wrong though - the swap system is a cache, there's no reason why a page should only exist on one level.
reply
ChocolateGod 13 hours ago
Is there a reason as to why Linux has not adopted straight up page compression like Windows and macOS?
reply
creatonez 12 hours ago
It has? Many distros have it enabled by default, too. The article discusses it quite a bit.
reply
GrayShade 12 hours ago
It did, there's two incompatible approaches, zram and zswap.
reply
anthk 14 hours ago
OpenBSD and the rest have a limits file where you can set RAM limits per user and sometimes per process, so is not a big issue.

On GNU/Linux and the rest not supporting dynamic swap files, you can swap into anything ensembling a file, even into virtual disk images.

Also set up ZRAM as soon as possible. 1/3 of the physical RAM for ZRAM it's perfect, it will almost double your effective RAM size with ease.

reply
JamesTRexx 13 hours ago
I think zswap is the better option because it's not a fixed RAM storage, it merely compresses pages in RAM up to a variable limit and then writes to swap space when needed, which is more efficient.

It worked very well with my preceding laptop limited to 4GB of RAM.

reply
gzread 12 hours ago
You can do this with cgroups but you aren't allowed to use cgroups if you use systemd, because it messes up systemd.
reply