Hacker News

Modernizing swapping: virtual swap spaces

52 points by voxadam 3 days ago | 48 comments

I see some comments about soft lockups during memory pressure. I have struggled with this immensely over the years. I wrote a userspace memory reclaimer daemon and have not had a lockup since: https://gist.github.com/EBADBEEF/f168458028f684a91148f4d3e79... .

The hangs usually happened when I was stressing VFS (the computer was a samba server) along with other workloads. To trigger a hang manually I would read in large files (bigger than available ram) in parallel while running a game. I could get it to hang even with 128GB ram. I tweaked all the vfs settings (swappiness, etc...) to no avail. I tried with and without swap.

In the end it looked like memory was not getting reclaimed fast enough, like linux would wait too long to start reclaiming memory and some critical process would get stuck waiting for some memory. The system would hang for minutes or hours at a time only making the tiniest of progress between reclaims.

If I caught the problem early enough (just as everything started stuttering) I could trigger a reclaim manually by writing to '/sys/fs/cgroup/memory.reclaim' and the system would recover. I wonder if it was specific to btrfs or some specific workload pattern but I was never able to figure it out.

NooneAtAll3 4 hours ago

I want to express my gratitude for not falling for greedy marketers' lies and using correct base-2 for KB/MB/GB

Numerlor 2 days ago

The swap/memory situation in linux has surprised me quite a bit coming from Windows.

Windows remains mostly fully responsive even when memory is being pushed to the limits and swapping gigabytes per second, while on linux when I ran a stress test that ate all the memory I had trouble even terminating the script

dlcarrier 16 hours ago

There's two things that cause this. First, Windows has a variable swap file size, whereas Linux has a fixed size, so Windows can just fill up your drive, instead of running out of swap space. Second, the default behavior for the out-of-memory killer in Linux isn't very aggressive, with the default behavior being to over-commit memory instead of killing processes.

As far as I know, Linux still doesn't support a variable-sized swap file, but it is possible to change how aggressively it over-commits memory or kills processes to free memory.

As to why there differences are there, they're more historical than technical. My best guess is that Windows figured it out sooner, because it has always existed in an environment where multiple programs are memory hogs, whereas it wasn't common in Linux until the proliferation of web-based everything requiring hundreds of megabytes to gigabytes of memory for each process running in a Chrome tab or Electron instance, even if it's something as simple as a news article or chat client.

Check out this series of blog posts. for more information on Linux memory management: https://dev.to/fritshooglandyugabyte/series/16577

Joker_vD 13 hours ago

Windows "figured it out sooner" because it never really had to seriously deal with overcommitting memory: there is no fork(), so the memory usage figures of the processes are accurate. On Linux, however, the un-negotiable existence of fork() really leaves one with no truly good solution (and this has been debated for decades).

p_ing 8 hours ago

NT has been able to overcommit since it's inception.

goodpoint 11 hours ago

fork is a massive feature, not a bug.

tliltocatl 8 hours ago

fork() is a misfeature, as is SIGCHILD/wait and most of Unix process management. It worked fine on PDP-11 and that's it.

But Linux also overcommits mmap-anonymous/sbrk, while Windows leaves the decision to the user space, which is significantly slower.

rwmj 6 hours ago

Not really. It elegantly solves the "create a process, letting it inherit these settings and reset these other settings", where "settings" is an ever changing and expanding list of things that you wouldn't want to bake into the API. Thus (omitting error checks and simplifying many details):

  pipe (fd[2]);      // create a pipe to share with the child
  if (fork () == 0) { // child
    close (...);     // close some stuff
    setrlimit (...); // add a ulimit to the child
    sigaction (...); // change signal masks
    // also: clean the environment, set cgroups
    execvp (...);    // run the child
  }

It's also enormously flexible. I don't know any other API that as well as the above, also lets you change the relationship of parent and child, and create duplicate worker processes.

Comparing it to Windows is hilarious because Linux can create processes vastly more efficiently and quickly than Windows.

tliltocatl 4 hours ago

Yes, but at what cost? 99% of fork calls are immediately followed by exec(), but now every kernel object need to handle being forked. And a great deal of memory-management housekeeping is done only to be discarded afterward. And it doesn't work at all for AMP systems (which we will have to deal with, sooner or latter).

In 1970 it might have been the only way to provide a flexible API, but nowadays we have a great variety of extensible serialization formats better than "struct".

rwmj 3 hours ago

At no cost apparently, since Linux still manages to be much faster and more efficient than Windows.

ChocolateGod 13 hours ago

Windows will also prioritise to keep the desktop and current focussed application running smoothly, the Linux kernel has no idea what's currently focused or what not to kill, your desktop shell is up there on the menu in oom situations.

p_ing 11 hours ago

The same behavior exists as far back as NT4 Server, which does not provide a foreground priority boost by default.

kalaksi 13 hours ago

There are daemons (not installed by default) that monitor memory usage and can increase swap size or kill processes accordingly (you can ofc also configure OOM killer).

LargoLasskhyfv 15 hours ago

> As far as I know, Linux still doesn't support a variable-sized swap file...

You can add (and remove) additional swapfiles during runtime, or rather on demand. I'm just unaware of any mechanism doing that automagically, though.

Could probably done in eBPF and some shell scripts, I guess?

dsr_ 13 hours ago

swapspace (https://github.com/Tookmund/Swapspace) does this. Available in Debian stable.

LargoLasskhyfv 9 hours ago

Wow. Since 20 years. And I'm rambling about eBPF...

dlcarrier 6 hours ago

Linux's ePBF has its issues, too.

I once was trying to set up a VPN that needed to adjust the TTL to keep its presence transparent, only to discover that I'd have to recompile the kernel to do so. How did packet filtering end up privileged, let alone running inside the kernel?

I recently started using SSHFS, which I can run as an unprivileged user, and suspending with a drive mounted reliably crashes the entire system. Back on the topic of swap space, any user that's in a cgroup, which is rarely the case, can also crash the system by allocating a bunch of RAM.

Linux is one of the most advanced operating systems in existence, with new capabilities regularly being added in, but it feels like it's skipped over several basics.

quotemstr 13 hours ago

Huh? What does swap area size have to do with responsiveness under load? Linux has a long history of being unusable under memory pressure. systemd-oomd helps a little bit (killing processes before direct reclaim makes everything seize up), but there's still no general solution. The relevance to history is that Windows got is basically right ever and Linux never did.

Nothing to do with overcommit either. Why would that make a difference either? We're talking about interactivity under load. How we got to the loaded state doesn't matter.

nasretdinov 14 hours ago

In Linux the default swap behaviour is to also swap out the memory mapped to the executable file, not just memory allocated by the process. This is a relatively sane approach on servers, but not so much on desktops. I believe both Windows and macOS don't swap out code pages, so the applications remain responsive, at the of (potentially) lower swap efficiency

superjan 12 hours ago

Don’t know about Macs, but on Windows executable code is treated like a readonly memory mapped file that can be loaded and restored as the kernel sees fit. It could also be shared between processes, though that is not happening that much anymore due to ASLR.

demosito666 12 hours ago

Oh yeah. Bug 12309 was reported now what, 20 years ago? It’s fair to say that at this point arrival of GNU Mach will happen sooner than Linux will be able to properly work under memory pressure.

01HNNWZ0MV43FF 18 hours ago

I've had that same experience. On new systems I install earlyoom. I'd rather have one app die than the whole system.

You'd think after 30 years of GUIs and multi-tasking, we'd have this figured out, but then again we don't even have a good GUI framework.

LtWorf 15 hours ago

I used to use it but it's too aggressive. It kills stuff too quickly.

jauntywundrkind 2 days ago

Like Linux / open source often, it depends on what you do with it!

The kernel is very slow to kill stuff. Very very very very slow. It will try and try and try to prevent having to kill anything. It will be absolutely certain it can reclaim nothing more, and it will be at an absolute crawl trying to make every little kilobyte it can free, swapping like mad to try options to free stuff.

But there are a number of daemons you can use if you want to be more proactive! Systemd now has systemd-oomd. It's pretty good! There's others, with other strategies for what to kill first, based on other indicators!

The flexibility is a feature, not a bug. What distro are you on? I'm kind of surprised it didn't ship with something on?

Onavo 15 hours ago

> Windows remains mostly fully responsive even when memory is being pushed to the limits and swapping gigabytes per second

In my experience this is only on later versions of the NT Kernel and only on NVME (mostly the latter I think).

robinsonb5 15 hours ago

Yeah I think SSD / NVME makes all the difference here - I certainly remember XP / Vista / Win 7 boxes that became unusable and more-or-less unrecoverable (just like Linux) once a swap storm starts.

p_ing 12 hours ago

NT4 exhibits the same behavior under extreme load.

rwmj 16 hours ago

The annoying thing I've found with Linux under memory stress (and still haven't found a nice way to solve) is I want it to always always always kill firefox first. Instead it tends to either kill nothing (causing the system to hang) or kill some vital service.

0xbadcafebee 14 hours ago

Linux being... Linux, it's not easy to use, but it can do what you want.

1. Use `choom` to give your Firefox PIDs a score of +1000, so they always get reaped first

2. Use systemd to create a Control Group to limit firefox and reap it first (https://dev.to/msugakov/taking-firefox-memory-usage-under-co...)

3. Enable vm.oom_kill_allocating_task to kill the task that asked for too much memory

4. Nuclear option: change how all overcommiting works (https://www.kernel.org/doc/html/v5.1/vm/overcommit-accountin...)

delamon 14 hours ago

You can bump /proc/$firefox_pid/oom_score_adj to make it likely target. The easiest way is to make wrapper script that bumps the score and then starts firefox. All children will inherit the score.

pmontra 15 hours ago

I'm not sure that I'd want the OS to kill my browser while I'm working within it.

Of course the browser is the largest process in my system, so when I notice that memory is running low I restart it and I gain some 15 GB.

Basically I am the memory manager of my system and I've been able to run my 32 GB Linux laptop with no swap since 2014. I read that a system with no swap is suboptimal but the only tradeoff I notice is that manual OOM vs less writes on my SSD. I'm happy with it.

robinsonb5 14 hours ago

There are two pillars to managing RAM with virtual memory: the obvious one is is writing one program's working set to disk, so that another program can use that memory. The other one - which isn't prevented by disabling swap - is flushing parts of a program which were loaded from disk, and reloading them from disk when next needed.

That second pillar is actually worse for interactivity than swapping the working set, which is why disabling swap entirely isn't considered optimal.

By far the best approach is just to have an absurd amount of RAM - which of course is a much less accessible option now than it was a year ago.

jauntywundrkind 15 hours ago

If using systemd-oomd, you can launch Firefox into it's own cgroup / systemd.scope, that has memory pressure control settings set to not kill it. ManagedOOMPreference=avoid.

https://www.freedesktop.org/software/systemd/man/latest/syst...

There's a variety of oom daemons. bustd is very lightweight & new. earlyoom has been around a long time, and has an --avoid flag. https://github.com/rfjakob/earlyoom?tab=readme-ov-file#prefe...

Your concerns are very addressable.

ajb 12 hours ago

Yeah sytemd-oomd seems tuned for server workloads, I couldn't get it to stop killing my session instead of whichever app had eaten the memory.

Honestly on the desktop I'd rather a popup that allowed me to select which app to kill. But the kernel doesn't seem to even be able to prioritize the display server memory.

TacticalCoder 6 hours ago

> Windows remains mostly fully responsive even when memory is being pushed to the limits and swapping gigabytes per second ...

The problem problem though is all the times when Windows is totally unusable even though it's doing exactly jack shit. An example would be when it's doing all its pointless updates/upgrades.

I don't know what people are smoking in this world when they somehow believe that Windows 11 is an acceptable user experience and a better OS than Linux.

Somehow though it's Linux that's powering billions if not tens of billions of devices worldwide and only about 12% of all new devices sold worldwide are running that piece of turd that Windows is.

robinsonb5 14 hours ago

OOM killers serve a purpose but - for a desktop OS - they're missing the point.

In a sane world the user would decide which application to shut down, not the OS; the user would click the appropriate application's close gadget, and the user interface would remain responsive enough for that to happen in a matter of seconds rather than minutes.

I understand the many reasons why that's not possible, but it's a huge failing of Linux as a desktop OS, and OOM killers are picking around the edges of the problem, not addressing it head-on.

(Which isn't to say, of course, that OOM killers aren't the right approach in a server context.)

nasretdinov 9 hours ago

I don't even think it's not possible — surely there already exist solutions similar to BFS scheduler that would improve interactive performance

nasretdinov 9 hours ago

Potentially https://github.com/hakavlad/prelockd ?

p_ing 8 hours ago

This is effectively what macOS does. It presents any end user GUI programs (or Terminal.app if it's a CLI program) to elect to kill by way of popup with the amount of memory the application is consuming and whether or not it is responsive.

Not a bad system, but macOS has a fundamental memory leak for the past few versions which causes even simple apps like Preview.app, your favorite browser (doesn't matter which one), etc. to 'leak' and bring up the choose-your-own-OOM-kill-target dialog.

olejorgenb 9 hours ago

> I understand the many reasons why that's not possible

Isn't the only reason that the UI effectively freeze under high memory pressure? If the processing involved in the core parts of the UI could be prioritized there's no reason for it not to work?

I don't understand why we can't have (?) a way of saying "the pages of these few processes are holy"? (yes, it would still be slow allocating new pages, but it should be possible to write a small core part such that it doesn't need to constantly allocate memory?)

glyco 5 hours ago

The new design is still wrong though - the swap system is a cache, there's no reason why a page should only exist on one level.

ChocolateGod 13 hours ago

Is there a reason as to why Linux has not adopted straight up page compression like Windows and macOS?

creatonez 12 hours ago

It has? Many distros have it enabled by default, too. The article discusses it quite a bit.

GrayShade 12 hours ago

It did, there's two incompatible approaches, zram and zswap.

anthk 14 hours ago

OpenBSD and the rest have a limits file where you can set RAM limits per user and sometimes per process, so is not a big issue.

On GNU/Linux and the rest not supporting dynamic swap files, you can swap into anything ensembling a file, even into virtual disk images.

Also set up ZRAM as soon as possible. 1/3 of the physical RAM for ZRAM it's perfect, it will almost double your effective RAM size with ease.

JamesTRexx 13 hours ago

I think zswap is the better option because it's not a fixed RAM storage, it merely compresses pages in RAM up to a variable limit and then writes to swap space when needed, which is more efficient.

It worked very well with my preceding laptop limited to 4GB of RAM.

gzread 12 hours ago

You can do this with cgroups but you aren't allowed to use cgroups if you use systemd, because it messes up systemd.