It's seriously annoying that ECC memory is hard to get and expensive, but memory with useless LEDs attached is cheap.
You mean like compilers and test suites ? Very few professional workloads don't parallelize well these days.
(Though games these days scale better than they used to, but only up to a to a point.)
I find that most tools I write for my own use can be made to scale with cores, or run so fast that the overhead of starting threads is longer than the program runtime. But I write that in Rust which makes parallelism easy. If I wrote that code in C++ I would probably not bother with trying to parallelize.
I do hope the nuclear powerplant next door uses more fault tolerant hardware, though.
> I much prefer cheaper hardware.
The cost savings are modest; order of magnitude 12% for the DIMMs, and less elsewhere. Computers are already extremely cheap commodities.
Assuming that's more due to intentional market segmentation than actual cost, yeah I would pay 12% more for ECC. But I'm with the other guy on not valuing it a ton. I have backups which are needed regardless of bitrot, and even if those don't help, losing a photo isn't a huge deal for me.
That was me. It isn't "officially" supported by AMD, but it should work. You can enable EDAC monitoring in Linux and observe detected correction events happening.
> Assuming that's more due to intentional market segmentation than actual cost
That's the argument, yeah.
Ironically, that's around the time Intel started making it difficult to get ECC on desktop machines using their CPUs. The Pentium 3 and 440BX chipset, maxing out at 1GB, were probably the last combo where it pretty commonly worked with a normal desktop board and normal desktop processor.
I'm not really sure if this makes it overall more or less reliable than DDR2/3/4 without ECC though.
Because you can't track on-die ECC errors, you have no way of knowing how "faulty" a particular DRAM chip is. And if there's an uncorrected error, you can't detect it.
it's "ECC" but not the ecc you want, marketing garbage.
I think this sort of reporting is a pretty basic feature that should come standard on all hardware. No idea why it's an "enterprise" feature. This market segmentation is extremely annoying and shouldn't exist.
I would definitely like to have a laptop with ECC, because obviously I don't want things to crash and I don't want corrupted data or anything like that, but I don't really use desktop computers anymore.
The main overhead is simply the extra RAM required to store the extra bits of ECC.
The main reason ECC RAM is slower is because it's not (by default) overclocked to the point of stability - the JEDEC standard speeds are used.
The other much smaller factors are:
* The tREFi parameter (refresh interval) is usually double the frequency on ECC RAM, so that it handles high-temperature operation. * Register chip buffers the command/address/control/clock signals, adding a clock of latency the every command (<1ns, much smaller than the typical memory latency you'd measure from the memory controller) * ECC calculation (AMD states 2 UMC cycles, <1ns).
And there's non-random bit errors that can hit you at any speed, so it's not like going slow guarantees safety.
However, there are still gaps. For one thing, the OS has to be configured to listen for + act on machine check exceptions.
On the hardware level, there's an optional spec to checksum the link between the CPU and the memory. Since it's optional, many consumer machines do not implement it, so then they flip bits not in RAM, but on the lines between the RAM and the CPU.
It's frustrating that they didn't mandate error detection / correction there, but I guess the industry runs on price discrimination, so most people can't have nice things.
That said, memory DIMM capacity increases with even a small chance of bit-flips means lots of people will still be affected.
Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.
However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.
In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.
In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).
As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.
I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.
Thats the thing. Bit flips impact everything memory-resident - that includes program code. You have no way of telling what instruction was actually read when executing the line your instrumentation may say corresponds to the MOV; or it may have been a legit memory operation, but instrumentation is reporting the wrong offset. There are some ways around it, but - generically - if a system runs a program bigger than the processor cache and may have bit flips - the output is useless, including whatever telemetry you use (because it is code executed from ram and will touch ram).
I have at least one motherboard that just re-auto-overclocks itself into a flaky configuration if boot fails a few times in a row (which can happen due to loose power cords, or whatever).
How do you know the number/proportion of users who run without telemetry enabled, since by definition you're not collecting their data?
(Not imputing any malice, genuinely curious.)
I never dug too deeply but the app is still running on some out of support iPads so maybe it's random bit flips.
Actually "dereferencing a pointer that had just passed a nil check" could be from a flow control fault where the branch fails to be taken correctly.
If that was caused by bad memory, I would expect other software to be similarly affected and hence crash with about comparable frequency. However, it looks like I'm falling more into the other 90% of cases (unsurprisingly) because I do not observe other software crashing as much as firefox does.
Also, this whole crashing business is a fairly recent effect - I've been running firefox for forever and I cannot remember when it last was as much of an issue as it has become recently for me.
Two years ago, I've had Factorio crash once on a null pointer exception. I reported the crash to the devs and, likely because the crash place had a null check, they told me my memory was bad. Same as you I said "wait no, no other software ever crashed weirdly on this machine!", but they were adamant.
Lo and behold, I indeed had one of my four ram sticks with a few bad addresses. Not much, something like 10-15 addresses tops. You need bad luck to hit one of those addresses when the total memory is 64GB. It's likely the null pointer check got flipped.
Browsers are good candidates to find bad memory: they eat a lot of ram, they scatter data around, they have a large chunk, and have JITs where a lot of machine code gets loaded left and right.
Any memory allocation failing within the browser forces an instant crash unless the callsite explicitly opts in to handling the allocation failure.
"Check malloc failure" is an opt-out feature in browsers, not opt-in. It's the same in Chromium. Failing to check would cause too many security issues. (One more reason new stuff tends to prefer Rust, etc)
I once had a bitflip pattern causing lowercase ascii to turn into uppercase ascii in a case insensitive system. Everything was fine until it tried to uppercase numbers and things went wrong
The first time I had to deal with faulty ram ( more than 20y ago ), the bug would never trigger unless I used pretty much the whole dimm stick and put meaningful stuff in it etc in my case linking large executables , or untargzipping large source archives.
Flipping a pixel had no impact though
My suspicion has always been some kind of a memory leak, but memory corruption also makes sense.
Unfortunately, Chrome (which I use for work - Firefox is for private stuff) has NEVER crashed on me yet. Certainly not in the past 5 years. Which is odd. I'm on Linux btw.
Without knowing more about your configuration, it's hard to give advice, but definitely worth trying with a clean profile first.
If you don't report this problem upstream it will never get fixed, as obviously no-one else is seeing this. Firefox has a built-in profiler that you can use to report performance problems like this.
How long are the inputs that you get problems with?
I would add Thunderbird to that list.
It used to be memory usage, now it's crashing.
Hint: No-one is claiming memory is to blame for 100% of the Firefox crashes. No-one is claiming it's 99% either.
Sorry, but I experienced first hand Firefox's memory leaks not being taken seriously. This "bitflips" news is just released, but I fully expect anybody complaining about Firefox crashes to be met with low effort "It's your RAM," responses for the next few years now.
Things like [1] will also tell you that something corrupted your memory, and if you see a nontrivial (e.g. lots of bits high and low) magic number that has only a single bit wrong, it's probably not a random overwrite - see the examples in [2].
There's also a fun prior example of experiments in this at [3], when someone camped on single-bit differences of a bunch of popular domains and examined how often people hit them.
edit: Finally, digging through the Mozilla source, I would imagine [4] is what they're using as a tester when it crashes.
[1] - https://github.com/mozilla-firefox/firefox/commit/917c4a6bfa...
[2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1762568
[3] - https://media.defcon.org/DEF%20CON%2019/DEF%20CON%2019%20pre...
[4] - https://github.com/mozilla-firefox/firefox/blob/main/toolkit...
(see: https://github.com/mozilla-firefox/firefox/blob/main/toolkit..., which points to a specific commit in that repo - turns out to be tip of main)
He doesn't explain anything indeed but presumably that code is available somewhere.
I've seen a lot of confirmed bitflips with ECC systems. The vast majority of machines that are impacted are impacted by single event upsets (not reproducible).
(I worded that precisely but strangely because if one machine has a reproducible problem, it might hit it a billion times a second. That means you can't count by "number of corruptions".)
My take is that their 10% estimate is a lower bound.
Yes, that's a confounding factor, and in fact the starting assumption when looking at a crash. Sometimes you can be pretty sure it's hardware. For example, if it's a crash on an illegal instruction in non-JITted code, the crash reporter can compare that page of data with the on-disk image that it's supposed to be a read-only copy of. Any mismatches there, especially if they're single bit flips, are much more likely to be hardware.
But I've also seen it several times when the person experiencing the crashes engages on the bug tracker. Often, they'll get weird sporadic but fairly frequent crashes when doing a particular activity, and so they'll initially be absolutely convinced that we have a bug there. But other people aren't reporting the same thing. They'll post a bunch of their crash reports, and when we look at them, they're kind of all over the place (though as they say, almost always while doing some particular thing). Often it'll be something like a crash in the garbage collector while watching a youtube video, and the crashes are mostly the same but scattered in their exact location in the code. That's a good signal to start suspecting bad memory: the GC scans lots of memory and does stuff that is conditional on possibly faulty data. We'll start asking them to run a memory test, at least to rule out hardware problems. When people do it in this situation, it almost always finds a problem. (Many people won't do it, because it's a pain and they're understandably skeptical that we might be sandbagging them and ducking responsibility for a bug. So we don't start proposing it until things start feeling fishy.)
But anyway, that's just anecdata from individual investigations. gsvelto's post is about what he can see at scale.
ptr ^= (1 << rand_between(0,64));
that got inserted in the code by accident. That's just not the way that we write software.If it's only hit once by a random person, memory starts being more likely.
(Unless that LOC is scanning memory or smth)
Also, in an unsafe language all bets are off. A memory clobber, UAF or race condition can generate quite strange and ephemeral crashes. Even if the majority of time it generates the “same” failure mode, it can still sporadically generate a rare execution trace. It’s best to stop thinking of these as deterministic processes and more as a distribution of possible outcomes.
This is a bit vague to really reply to very specifically, but yes, this is hard. Which is why quite some people work in this area. It's rather valuable to do so at Firefox-scale.
Even if the majority of time it generates the “same” failure mode, it can still sporadically generate a rare execution trace.
This doesn't matter that much because the "same" failure mode already allows you to see the bug and fix it.
Ask them to publish raw MCE and ECC dumps with timestamps correlated to crashes, or reproduce the failure with controlled fault injection or persistent checksums, because without that this reads like a hypothesis dressed up as a verdict.
Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.
I also find that firefox crashes much more than chrome based browsers, but it is likely that chrome's superior stability is better handing of the other 90% of crashes.
If 50% of chrome crashes were due to bit flips, and bit flips effect the two browsers at basically the same rate, that would indicate that chrome experiences 1/5th the total crashes of firefox... even though the bit flip crashes happen at the same rate on both browsers.
It would have been better news for firefox if the number of crashes due to faulty hardware were actually much higher! These numbers indicate the vast majority of firefox crashes are actually from buggy software : (
I can’t say the same for Chromium. Despite barely using it, I had at least one tab or iframe crash last year, and there’s a moderate chance (I’ll suggest 15%) on any given day of leaving it open that it will just spontaneously die while I’m not paying attention to it (my wild guess, based on observations about Inkscape if it’s executing something CPU-bound for too long: it’s not responding in a timely fashion to the compositor, and is either getting killed or killing itself, not sure which that would be).
Frankly, from a crashing perspective, both are very reliable these days. Chromium is still far more prone to misrendering and other misbehaviour—they prefer to ship half-baked implementations and fix them later; Firefox, on the other hand, moves slower but has fewer issues in what they do ship.
Edit: more context, I power cycle at least once a week on desktop and the version is typically a bit behind new. I also don't have more tabs open than will fit in the row. All these habits seem likely to decrease crashes.
I've tried all kinds of things software-wise but keep getting random crashes.
I wonder if I should do a longer memory test, maybe some CPU stress testing at the same time...
Or you can view several of them and see if there's a common pattern in the "Signature" field. Firefox really should only be regularly crashing if: (1) there's a real bug and the thing that triggers it, (2) you're running out of memory, or (3) you have hardware.
I don't know what the odds of faulty hardware are for a randomly chosen user, but they're much higher for a randomly chosen user who is seeing regular crashes.
this may have something to do with the fact that my laptop is from 2017, however.
I agree. Good thing he doesn't back up his claim with any sort of evidence or reasoned argument, or you'd look like a huge moron!
> And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much.
The actual measurement is 5%. The 10% figure is entirely made up, with zero evidence or reasoned argument except a hand-wavy "conservative".
Edit: actually, the claim is even less supported:
> out of these ~25000 crashes have been detected as having a potential bit-flip. That's one crash every twenty potentially caused by bad/flaky memory
"Potential" is a weasel word here. We don't see any of the actual methodology. For all we know, the real value could be 0.1% or 0.01%.
The hardware bugs are there. They're just handled.
Live objects get swapped between Discardable buffers quite frequently. They're not expected to stay at the same position in memory.
RAM flips are common. This kind of thing is old and has likely gotten worse.
IBM had data on this. DEC had data on this. Amazon/Google/Microsoft almost certainly had data on this. Anybody who runs a fleet of computers gets data on this, and it is always eye opening how common it is.
ZFS is really good at spotting RAM flips.
Perhaps you're part of the group driving hardware crashes up to 10% and need to fix your machine.
> Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.
That's a misinterpretation. The finding refers to the composition of crashes, not the overall crash rate (which is not reported by the post). Brought to the extreme, there may have been 10 (reported) crashes in history of Firefox, and 1 due to faulty hardware, and the statement would still be correct.
Hardware problems are just as good a potential explanation for those as anything else.
Turned out at their altitude cosmic rays were flipping bits in the top-most machines in the racks, sometimes then penetrating lower and flipping bits in more machines too.
I also use a bunch of other extensions though, dark reader, vimium, sideberry... I'd expect me to be a bit more exposed than the average user. Yet it's just rock stable for me. Maybe it just works better on linux?
1: I know this because I installed https://addons.mozilla.org/en-US/firefox/addon/tab-counter-p... to check :)
2: However after finding Karakeep I don't actually have 1000 tabs anymore!
But non-ECC is fine for most of us mortals gaming and streaming.
I would expect pro gamers to opt for ECC though.
[1] https://www.corsair.com/us/en/explorer/diy-builder/memory/is...
>> DDR5 technology comes with an exclusive data-checking feature that serves to improve memory cell reliability and increase memory yield for memory manufacturers. This inclusion doesn't make it full ECC memory though.
"Proper" ECC has a wider memory buss, so the CPU emits checksum bits that are saved alongside every word of memory, and checked again by the CPU when memory is read. Eg. a 64 bit machine would actually have 72 bit memory.
DDR5 "ECC" uses error correction only within the memory stick. It's there to reduce the error rate, so otherwise unacceptable memory is usable - individual cells have become so small that they are not longer acceptably reliable by themselves!
DDR4 is not fully reliable memory either.
This is common for many high speed electrical engineering challenges: Running a slightly higher error rate option with ECC on top can have an overall lower error rate at higher throughput than the alternative of running it slow enough to push the error rate down below some threshold.
It makes some people nervous because they don’t like the idea of errors being corrected, but the system designers are looking at overall error rates. The ECC is included in the system’s operation so it isn’t something that is worthwhile to separate out.
A bit error rate of one per billion with a parity bit on each packet is much more reliable than a undetectable bit error rate of one per trillion.
"ECC" does not give you fully reliable RAM. UEs are still be observed.
What's the chance of fail? If you have one device that achieves equal performance with less reliable cells and redundancy to another device that uses more reliable cells without redundancy, it's not really any different.
NAND is horribly flaky, cell errors are a matter of course. You could buy boutique NOR or SLC NAND or something if you want really good cells. You wouldn't though, because it would be ruinously expensive, but also it would not really give you a result that an SSD with ECC can't achieve.
But I don’t really know what the Firefox team does with crash reports and in making Firefox almost crash proof.
I have been using it at work on Windows and for the last several years it always crashes on exit. I have religiously submitted every crash report. I even visit the “about:crashes” page to see if there are any unsubmitted ones and submit them. Occasionally I’ll click on the bugzilla link for a crash, only to see hardly any action or updates on those for months (or longer).
Granted that I have a small bunch of extensions (all WebExtensions), but this crash-on-exit happens due to many different causes, as seen in the crash reports. I’m too loathe to troubleshoot with disabling all extensions and then trying it one by one. Why should an extension even cause a crash, especially when its a WebExtension (unlike the older XUL extensions that had a deeper integration into the browser)? It seems like there are fundamental issues within Firefox that make it crash prone.
I can make Firefox not crash if I have a single window with a few tabs. That use case is anyway served by Edge and Chrome. The main reasons I use Firefox, apart from some ideological ones, are that it’s always been much better at handling multiple windows and tons of tabs and its extensibility (Manifest V2 FTW).
I would sincerely appreciate Firefox not crashing as often for me.
> this crash-on-exit happens due to many different causes, as seen in the crash reports
It points to the same direction: all these different causes are just symptoms, the root cause is hiding deeper, and it is triggered by the firefox stopping.
It is all is not a guarantee that the root cause is bitflips, but you can rule it out by testing your memory.
I think our education system should include a unit on "marketing bullshit" sometime early in elementary school. Maybe as part of math class, after they learn inequalities. "Ok kids, remind me, what does 'up to' mean?" "less than or equal to!"
I can certainly imagine that a very small fraction of Firefox users are generating these results, so that bit flips are not a problem generally.
Has to be normalized, and outliers eliminated in some consistent manner.
Crashes caused by resource exhaustion are still software bugs in Firefox. At least on sane operating systems where memory isn't over-comitted.
Having the number of unique machines would be great to see how skewed this estimate is.
Certain data is more sensitive as well and requires extra protection. Pointers and indexes obviously, which might send the whole application on a wild goose chase around memory. But also machine code, especially JIT-generated traces, is worth to be checksummed and verified before executing it.
It is not that simple, it does not only depend on the hardware but also the code. It is like a race, what happens first - you hit a bug in the code or your hardware glitches? If the code is bug free, then all crashes will be due to hardware issues, whether faulty hardware or stray particles from the sun. When the code is one giant bug and crashes immediately every time, then you will need really faulty hardware or have to place a uranium rod on top of your RAM and point a heat gun at your CPU to crash before you hit the first bug, i.e. almost all crashes will be due to bugs.
So what you observe will depend on the prevalence of faulty hardware and how long it takes to hit an hardware issue vs how buggy the code is and how long it takes to hit a bug.
https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-...
We have long known that single bit errors in RAM are basically "normal" in terms of modern computers. Google did this research in 2009 to quantify the number of error events in commodity DRAM https://static.googleusercontent.com/media/research.google.c...
They found 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.
At the time, they did not see an increase in this rate in "new" RAM technologies, which I think is DDR3 at that time. I wonder if there has been any change since then.
A few years ago, I changed from putting my computer to sleep every night, to shutting it down every night. I boot it fresh every day, and the improvements are dramatic. RAM errors will accumulate if you simply put your computer to sleep regularly.
There are power suppliers that are mildly defective but got past QC.
There are server designs where the memory is exposed to EMI and voltage differences that push it to violate ever more slightly that push it past QC.
Hardware isn't "good" or "bad", almost all chips produced probably have undetected mild defects.
There are a ton of causes for bitflips other than cosmic rays.
For instance, that specific google paper you cited found a 3x increase in bitflips as datacenter temperature increased! How confident are you the average Firefox user's computer is as temperature-controlled as a google DC?
It also found significantly higher rates as RAM ages! There are a ton of physical properties that can cause this, especially when running 24/7 at high temperatures.
The sentiment was always ECC is a waste and a scam. My goodness the unhinged posts from people who thought it was a trick and couldn't fathom you don't know you're having bits flipped without it. "it's a rip off" without even looking and seeinf often the price was just that of the extra chip.
I've discussed it for 20 years since the first Mac Pro and people just did not want to hear that it had any use. Even after the Google study.
Consumers giving professionals advice. Was same with workstation graphics cards.
I wonder sometimes if we shouldn't be doing like NASA does and triple-storing values and comparing the calculations to see if they get the same results.
That's different from what you're suggesting, because you're right that the crash reports are analyzed with heuristics to guess at memory corruption. Aside from the privacy implications, though, I think that would have too many false alarms. A single bit flip is usually going to be an out of bounds write, not bad RAM.
This isn't really feasible: have you looked at memory prices lately? The users can't afford to replace bad memory now.
I had memory issues with my PC build which I fixed by reducing the speed to 2800MHZ, which is much lower than its advertised speed of 5600MHZ. Actually looking back at this it might've configured its speed incorrectly in the first place, reducing it to 2800 just happened to hit a multiple of 2 of its base clock speed.
Pentium G4560 supports ECC, Core i7 10700 doesn't.
Is it a difference between server hardware managed by knowledgeable people and random hardware thrown together by home PC builders?
10+% is huge
> That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much.
So the data actually only supports 5% being caused by bitflips, then there's a magic multiple of 2? Come on. Let alone this conservative heuristic that is never explained - what is it doing that makes him so certain that it can never be wrong, and yet also detects these at this rate?
CPU caches and registers - how exactly are they different from a RAM on a SoC in this regard?
CPUs tend to be built to tolerate upsets, like having ECC and parity in arrays and structures whereas the DRAM on a Macbook probably does not. But there is no objective standard for these things, and redundancy is not foolproof it is just another lever to move reliability equation with.
I run four Firefox instances simultaneously, most of the time. No issues to report.
It seems more likely to happen when the profile has been running for a long time (a couple weeks?) and/or using a large amount of RAM.
There's a 60-secish timeout before it gives up and pops that crash report window. I don't think it's a crash per se, just an unresolved file lock or similar. I haven't noticed whether there's any relationship to running multiple profiles. I am almost always running several at a time, and the issue only occurs sometimes. It has no (other) negative side effects, as far as I can tell, but it was unsettling at first.
I'm on macOS also, and I launch from the command line (effectively, I actually have separate launchers for each profile, but they just run a shell script with different arguments).
Honestly, I've been blaming MacOS for it since other apps also crashed at the same time (the first time it was Microsoft Intune, the second time it was Slack - I doubt either uses Firefox internally). I don't recall seeing a Firefox crash on my personal laptop running Linux at any point in the past few years.
My guess is that it's trying to obtain or release a filesystem lock, possibly one that it's lost track of in some trivial way.
I've never seen any damage or inconsistencies in the resulting environment. So I don't think it's a dramatic event, just a safety timer that isn't resolved correctly.
Probably a simple, dumb, but harmless bug.
Contrast that with the dreadful corporate-supplied Edge AI browser I have to use for one client, which seems to randomly close windows without being asked, and never seems to be able to restore them.
I was hinting in my original comment if these cases are contributing to crash reports in any capacity there is a small chance they could be misattributed towards the claims in the post, especially if memory is not freed correctly on shutdown. Even more so if any memory allocation is shared between processes / helpers.
If I quit normally, don't wait for the "timeout" and force quit I still get the crash report UI immediately which suggests to me something funky going on.
10% is a crazy high percentage to claim for bitflips.
bitflippin...
https://stackoverflow.com/questions/2580933/cosmic-rays-what...
This will bloat the code a bit.
Also, at the machine code level, a Boolean controlling a branch or a while loop often doesn't ever make it out of the flags register, where it'll only be a single bit anyway because that's how the hardware works. Not really changeable in software.
The only explanation I can see is if Firefox is installed on a user base of incredibly low quality hardware.
If Firefox itself has so few bugs that it crashes very infrequently, it is not contradictory to what you are saying.
I wouldn't be surprised if 99% of crashes in my "hello world" script are caused by bit flips.
https://www.memtest86.com/blacklist-ram-badram-badmemorylist...
The most expensive memory failure I had was of this sort, and frustratingly came from accidentally unplugging the wrong computer.
After this I did buy some used memory from a recycling center that had the sorts of problems you described and was able to employ them by masking off the bad regions.
Errors may be caused by bad seating/contact in the slots or failing memory controllers (generally on the CPU nowadays) but if you have bad sticks they're generally done for.
https://github.com/prsyahmi/BadMemory
I've used it for many years. It only fixes physical hardware faults, not timing errors. For example if a RAM cell is damaged by radiation, not if you're overclocking your RAM.
From what he's saying they run an actual memory test after a crash, too.
These are potential bitflips.
I found an issue only yesterday in firefox that does not happen in other browsers on specific hardware.
My guess is that the software is riddled with edge-case bugs.
As we know from Google and other papers, most of these 10% of flips will be caused by broken or marginal hardware, of which a good proportion of which could be weeded out by running a memory tester for a while. So if you do that you're probably looking a couple out of every hundred crashes being caused by bitflips in RAM. A couple more might be due to other marginal hardware. The vast majority software.
How often does your computer or browser crash? How many times per year? About 2-3 for me that I can remember. So in 50 years I might save myself one or two crashes if I had ECC.
ECC itself takes about 12.5% overhead/cost. I have also had a couple of occasions where things have been OOM-killed or ground to a halt (probably because of memory shortage). Could be my money would be better spent with 10% more memory than ECC.
People like to rave and rant at the greedy fatcats in the memory-industrial complex screwing consumers out of ECC, but the reality is it's not free and it's not a magical fix. Not when software causes the crashes.
Software developers like Linus get incredibly annoyed about bug reports caused by bit flips. Which is understandable. I have been involved in more than one crazy Linux kernel bug that pulled in hardware teams bringing up new CPU that irritated the bug. And my experience would be far from unique. So there's a bit of throwing stones in glass houses there too. Software might be in a better position to demand improvement if they weren't responsible for most crashes by an order of magnitude...
"I can't answer that question directly because crash reports have been designed so that they can't be tracked down to a single user. I could crunch the data to find the ones that are likely coming from the same machine, but it would require a bit of effort and it would still only be a rough estimate."
You can't claim any percentage if you don't know what you are measuring. Based on his hot take, I can run an overclocked machine have firefox crash a few hundred thousand times a day and he'll use my data to support his position. Further, see below:
First: A pre-text: I use Firefox, even now, despite what I post below. I use it because it is generally reliable, outside of specific pain points I mention, free, open source, compatible with most sites, and for now, is more privacy oriented than chrome.
Second: On both corporate and home devices, Firefox has shown to crash more often than Chrome/Chromium/Electron powered stuff. Only Safari on Windows beats it out in terms of crashes, and Safari on Windows is hot garbage. If bit flips were causing issues, why are chromium based browsers such as edge and Chrome so much more reliable?
Third: Admittedly, I do not pay close enough attention to know when Firefox sends crash reports, however, what I do know is that it thinks it crashes far more often than it does. A `sudo reboot` on linux, for example, will often make firefox think it crashed on my machine. (it didn't, Linux just kills everything quickly, flushes IO buffers, and reboots...and Firefox often can't even recover the session after...)
Fourth: some crashes ARE repeatable (see above), which means bit flips aren't the issue.
Just my thoughts.
Have we considered that maybe Firefox is the cause of bad memory?
/s
67k crashes / day
claim: "Given # of installs is X; every install must be crashing several times a day"
We'll translate that to: "every install crashes 5 times a day"
67k crashes day / 5 crashes / install
12k installs
Your claim is there's 12k firefox users? Lol
Granted, they're probably just as accurate as netcraft. /shrug
470k crashes in a single week, and this is under-reported! I bet the number of crashes is far higher. My snap Firefox on Ubuntu would lock-up, forcing me to kill it from the system monitor, and this was never reported as a crash.
Once upon a time I wrote software for safety critical systems in C/C++, where the code was deployed and expected to work for 10 years (or more) and interact with systems not built yet. Our system could lose power at any time (no battery) and we would have at best 1ms warning.
Even if Firefox moves to Rust, it will not resolve these issues. 5% of their crashes could be coming from resource exhaustion, likely mostly RAM - why is this not being checked prior to allocation? 5% of their crashes could be resolved tomorrow if they just checked how much RAM was available prior to trying to allocate it. That accounts for ~23k crashes a week. Madness.
With the RAM shortages and 8GB looking like it will remain the entry laptop norm, we need to start thinking more carefully about how software is developed.
I find this impossible to believe.
If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.
>> This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem.
Might be the case, but 10% is still huge.
There imo has to be something else going on. Either their userbase/tracking is biased or something else...
Browsers, videogames, and Microsoft Excel push computers really hard compared to regular applications, so I expect they're more likely to cause these types of errors.
The original Diablo 2 game servers for battle.net, which were Compaq 1U servers, failed at astonishing rates due to their extremely high utilization and consequent heat-generation. Compaq had never seen anything like it; most of their customers were, I guess, banking apps doing 3 TPS.
The more RAM you have, the higher the probabilty that there will be some bad bits. And the more RAM a program uses, the more likely it will be using some that is bad.
Same phenomenon with huge hard drives.
Every frame (i.e. ~60FPS) Guild Wars would allocate random memory, run math-heavy computations, and compare the results with a table of known values. Around 1 out of 1000 computers would fail this test!
We'd save the test result to the registry and include the result in automated bug reports.
The common causes we discovered for the problem were:
- overclocked CPU
- bad memory wait-state configuration
- underpowered power supply
- overheating due to under-specced cooling fans or dusty intakes
These problems occurred because Guild Wars was rendering outdoor terrain, and so pushed a lot of polygons compared to many other 3d games of that era (which can clip extensively using binary-space partitioning, portals, etc. that don't work so well for outdoor stuff). So the game caused computers to run hot.
Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.
And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.
Sometimes I'm amazed that computers even work at all!
Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.
Case in point: I was getting memory errors on my gaming machine, that persisted even after replacing the sticks. It caused windows bluesreen maybe once a month so I kinda lived with it as I couldn't afford to replace whole setup (I theoretized something on motherboard is wrong)
Then my power supply finally died (it was cheap-ish, not cheap-est but it had few years already). I replaced it, lo and behold, memory errors were gone
I had a PC come to me that would boot fine, but if you opened the CD drive it'd shut off instantly.
- Firefox may be more prevalent on those using Linux, since FF is less “corporate” than Chrome or Edge.
- People using Linux are probably putting Linux on old machines that had versions of Windows that are no longer supported.
However, what I can’t say next is “PSUs would get old and stop putting out as much” because that doesn’t tend to happen. They just die.
Those running Linux on some old tower may hook up too many devices to an underpowered PSU which could cause problems, but I doubt this is the norm.
If it’s not PSUs, what is it? It’s not electromagnetic radiation doing the bitflipping because that’s too rare.
Maybe bitflips could be caused by low-quality peripherals.
People also don’t vacuum out laptops like they used to vacuum out towers and desktops, so maybe it’s dust.
Or maybe it’s all a ruse and FF is buggy, but they don’t have time to figure it out.
Maybe for linux noobs. But i would suggest that most linux users are not noobs booting a disused pentium from a live CD. They are running linux on the same hardware as windows users. I would further suggest that as anyone installing a not-windows OS is more tech savvy than the average, that linux users actually take better care of thier machines. Linux users take pride in thier machines whereas the average windows user barely knows that computers have fans.
As any linux user for thier specifications and they will quote system reports and memory figues like Marisa Tomei discussing engine timings. Ask a random windows user and they will probably start with the name of the store that sold it.
So much for taking pride in my machine :)
I did basically the same thing recently when I built an AI rig. I tried to put it in a sever rack case but the fan noise was too much. So I ditched the rack and put in an open mining frame.
Could also fake spike to force the other team’s healer to waste their good heal on the wrong player while you downed the real target. Good times.
https://wiki.guildwars.com/wiki/Guild_Wars_Reforged
It did rekindle my love for the game, but most outposts are empty, even in the international districts, so I think it's hard to get hooked on it for new joiners.
But when you take a bird's eye view, it's interesting and great to see how over the years, games where you can build your own games remain popular and a common entryway into software development.
But also how Epic went from ZZT via Unreal to Fortnite, with the latter now being another platform (or what Zucc wanted to call a metaverse) for creativity.
Other notable mentions off the top of my head where people can build or invent their own games (in-game, via an external editor or through community support) or go crazy in besides Roblox are Second Life (...I think), LittleBigPlanet, Warcraft/Starcraft (which led to the genre of MOBAs), Geometry Dash, Mario Maker, TES, Source engine games, Minecraft, etc etc.
I also was introduced to programming through Roblox.
As an aside, Apple and Google's phone home crash reports is a really good system and it's one factor that makes mobile app development fun / interesting.
Unfortunately I've never looked at crashes this way when I worked at VKontakte because there were just too many crashes overall. That app had tens of millions of users so it crashed a lot in absolute numbers no matter what I did.
In an app with >billion users you get all kinds of wild stuff.
GPS location and movement data is what gives Google maps its near-real-time view of traffic on all roads, and busy-ness of all shops.
I think they collect location data from people riding public transport so they can tell you how long people wait on average at bus stops before getting on a bus.
Does Google collect atmospheric pressure readings from phone altimeters and use it for weather models? Could they?
Kindle collects details on books people read, how far they read, where they stop, which sections they highlight and quote, which words they look up in dictionaries.
I wonder if anyone’s curated a list of things like this which do happen or have been tried, excluding the “gathers user data for advertising” category which would become the biggest one, drowning out everything else.
I think current phones use accelerometer data to detect possible car crashes and call emergency services. Google could use that in aggregate to identify accident blackspots but I don’t know if they do. But that would be less useful because the police already know everywhere a big accident happens because people call the police. So that’s data easily found a different way.
I don't know whether you mean it's a shame that people consider it spyware, or if you meant that it's a shame that it manifests as spyware typically. I agree with the latter, not the former. It usually is spyware. If companies went for simple opt-in popups with a brief description of the reasoning, I'd be all for that. I sometimes opt-in to these requests myself, despite being a fairly privacy-conscious person, because I understand the benefit they have to the people collecting the data for good purposes. But when surveillance is opt-out (or no choice given), it's just spyware.
I asked to put the spyware aside for one sub-thread and focus on the astonishing worldwide sensor array, and you talked about the spyware and nothing else.
I’ve had plenty of servers with faulty ecc dimms that didn’t trigger , and would only show faults when actual memory testing. I had a hard time convincing some of our admins the first time ( ‘no ecc faults you can’t be right ‘ ) but I won the bet.
Edit: very old paper by google on these topics. My issues were 6-7 years ago probably.
https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
There's a pretty good set of diagrams and descriptions of the faults in this paper https://dl.acm.org/doi/10.1145/3725843.3756089.
Also to the parent: there's an updated public paper on DDR4 era fault observations https://ieeexplore.ieee.org/document/10071066
I suppose what winds up where is up to the memory controller but (for DDR5) in each BL16 transaction beat you're usually getting 32 bits of data value and 8 bits of ECC (per sub channel). Those ECC bits are usually called check bits CB[7:0] and they accompany the data bits DQ[31:0] .
If you're talking about transactions for LPDDR things are a bit different there, though as that has to be transmitted inband with your data
We scanned all our machines following this ( a few thousand servers ) and found out that ram issues were actually quite common, as said in the paper.
EDIT: took a look on the paper you linked and it basically says the same thing I did. The probability of these cases becomes increasingly and increasingly small and while ECC would indeed, not reduce it to _zero_ it would greatly greatly reduce it.
We also had a few thousands of physical servers with about of terabyte of ram each.
You are right : we did see repaired errors, but we also saw (indirectly, and after testing ) unrepaired ones
But the initial discussion was that ECC ram makes it go away and your point that it doesn't. And the vast vast majority of the errors, according to my understanding and to the paper you pointed to, are repairable. About 1 out of 400 ish errors are non-repairable. That's a huge improvement! If you had ECC ram, the failures Firefox sees here would drop from 10% to 0.025%! That is highly significant!
Even more! 2 bit errors now you would be informed of! You would _know_ what is wrong.
You could have 3(!) bit errors and this you might not see, but they'd be several orders of magnitude even rarer.
So yes, it would not 100% go away, but 99.9 % go away. That's... Making it go away in my book.
And last but not least, this paper mentions uncorrectable errors. It says nothing of undetectable ecc errors! You said _undetectable_ errors. I'm sure they happen, but would be surprised if you have any meaningful incidence of this, even at terabytes of data. It's probay on the order of 0.000625 of errors you can get ( but if you want I can do more solid math)
I think we diverge on ‘making it go away in my book’.
When you’re the one having to debug all these bizarre things ( there were real money numbers involved so these things mattered ), over millions of jobs every day , rare events with low probability don’t disappear - they just happen and take time to diagnose and fix.
So in my book ecc improves the situation, but I still had to deal with bad dimms, and ecc wasn’t enough. We used not to see these issues because we already had too many software bugs, but as we got increasingly reliable, hardware issues slowly became a problem, just like compiler bugs or other elements of the chain usually considered reliable.
I fully agree that there are lots of other cases where this doesn’t matter and ecc is good enough.
Thanks for taking the time to reply !
But this is sort of the march of nines.
My knee jerk reaction to blaming ECC is "naaah". Mostly because it's such a convenient scapegoat. It happens, I'm sure, but it would not be the first explanation I reach for. I once heard someone blame "cosmic rays" on a bug that happened multiple times. You can imagine how irked I was on the dang cosmic rays hitting the same data with such consistency!
Anyways, I'm sorry if my tone sounded abrasive, I, too, have appreciated the discussion.
No you were not abrasive at all - I’ve learned to assume good faith in forum conversations.
In retrospect I should have started by giving the context ( march of 9s is a good description) actually, which would have made everything a lot clearer for everyone.
E.g. EU enforced mandatory USB-C charging from 2025, and pushes for ending production of combustion engine cars by 2035. Why not just make ECC RAM mandatory in new computers starting e.g. from 2030?
AMD is already one step away from being compliant. So, it's not an outlandish requirement. And regulating will also force Intel to cut their BS, or risk losing the market.
With no regulations in place, companies would rather innovate in profit extraction rather improving technology. And if they have enough market capture, they may actually prefer to not innovate, if that would hurt profits.
Computers also aren't used much these days, and phones and tables don't have ECC
Also, while computers may not be used much for cosmic rays to be a risk factor, but they're still susceptible to rowhammer-style attacks, which ECC memory makes much harder.
Finally, if you account for the current performance loss due to rowhammer counter-measures, the extra cost of ECC memory is partially offset.
It's possible DDR6 will help. If it gets the ability to do ECC over an entire memory access like LPDDR, that could be implemented with as little as 3% extra chip space.
> x80 ECC (x40, 2 independent I/O sub channels)
On the other hand ECC UDIMMs (U=unbuffered) have only 8. From the specifications for Kingston's KSM56E46BS8KM-16HA [2]:
> x72 ECC (x36, 2 independent I/O sub channels)
Though if I remember correctly, the specifications for the older DDR4 ECC RDIMMs mention only 72 bits.
[1]: https://www.kingston.com/datasheets/KSM64R52BS8-16HA.pdf
[2]: https://www.kingston.com/datasheets/KSM56E46BS8KM-16HA.pdf
AMD has been better on it but BIOS/mobo vendors not so much
Most motherboards supported both, and the choice of which to use came down to the cost differential at the time of building a particular machine. The wild swings in DRAM prices meant that this could go from being negligible to significant within the course of a year or two!
When 72 pin SIMMs were introduced, they could in theory also come in a parity version but in reality that was fairly rare (full ECC was much better, and only a little more expensive). I don't think I ever saw an EDO 72 pin SIMM with parity, and it simply wasn't an option for DIMMs and later.
Also, in a game, there is a tremendously large chance that any particular bit flip will have exactly 0 effect on anything. Sure you can detect them, but one pixel being wrong for 1/60th of a second isn't exactly ... concerning.
The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully. There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.
Nobody does
> There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.
And again, nobody except stuff that goes to space and few critical machines does. The closest normal user will get to code written like that are probably car ECUs, there are even automotive targeted MCUs that not only run ecc but also 2 cores in parallel and crash if they disagree
It boils down to exception handling, you don't expect all of your bugs or security vulnerabilities to be known and write your code to be able to react to unplanned states without crashing. Bugs or security vulnerabilities can look a lot like a cosmic ray... a buffer overflow putting garbage in unexpected memory locations vs a cosmic ray putting garbage in unexpected memory locations... a lot of the mitigations are quite the same.
I’m aware of code that detects bit flips via unreasonable value detection (“this counter cannot be this high so quickly”). What else is there?
If they're hundreds of bytes or more, then two copies plus two hashes will do a better job.
If you're doing something that's more centralized then one hash might be simpler, but if you're centralized then you should probably use your own error correction codes instead of having multiple copies.
That was in the 2000s though, and for embedded memory above 65nm. I would expect smaller sizes to be more error-prone.
But there's volatile and nonvolatile memory all over in a computer and anywhere data is in flight be it inside the CPU or in any wires, traces, or other chips along the data path can be subject to interference, cosmic rays, heat or voltage related errors, etc.
The number of bits in registers, busses, cache layers is very small compared to the number in RAM. Obviously they might be hotter or more likely to flip.
I've read this decade ago... https://www.codeofhonor.com/blog/whose-bug-is-this-anyway
I eventually discovered with some timings I could pass all the usual tests for days, but would still end up seeing a few corrected errors a month, meaning I had to back off if I wanted true stability. Without ECC, I might never have known, attributing rare crashes to software.
From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.
I found that DDR3 and DDR4 memory (on AMD systems at least) had quite a bit of extra “performance” available over the standard JEDEC timings. (Performance being a relative thing, in practice the performance gained is more a curiosity than a significant real life benefit for most things. It should also be noted that higher stated timings can result in worse performance when things are on the edge of stability.)
What I’ve noticed with DDR5, is that it’s much harder to achieve true stability. Often even cpu mounting pressure being too high or low can result in intermittent issues and errors. I would never overclock non-ECC DDR5, I could never trust it, and the headroom available is way less than previous generations. It’s also much more sensitive to heat, it can start having trouble between 50-60 degrees C and basically needs dedicated airflow when overclocking. Note, I am not talking about the on chip ECC, that’s important but different in practice from full fat classic ECC with an extra chip.
I hate to think of how much effort will be spent debugging software in vain because of memory errors.
Some of the (legitimately) extreme overclockers have been testing what amounts to massive hunks of metal in place of the original mounting plates because of the boards bending from mounting pressure, with good enough results.
On top of all of this, it really does not help that we are also at the mercy of IMC and motherboard quality too. To hit the world records they do and also build 'bulletproof', highest performance, cost is no object rigs, they are ordering 20, 50 motherboards, processors, GPUs, etc and sitting there trying them all, then returning the shit ones. We shouldn't have to do this.
I had a lot of fun doing all of this myself and hold a couple very specific #1/top 10/100 results, but it's IMHO no longer worth the time or effort and I have resigned to simply buying as much ram as the platform will hold and leaving it at JEDEC.
If we had a time series graph of this data, it might be revealing.
Every single time I've had someone pay me to figure out why their build isn't stable, it's always some combination of cheap power supply with no noise filtering, cheap motherboard, and poor cooling. Can't cut corners like that if you want to go fast. That is to say, I've never encountered "almost ok" memory. They're quite good at validation.
I now just run at the standard 5600MHz timing, I really don't find the potential stability trade off worth it. We already have enough bugs.
This attitude is entirely corporate-serving cope from Intel to serve market segmentation. They wanted to trifurcate the market between consumers, business, and enthusiast segments. Critically, lots of business tasks demand ECC for reliability, and business has huge pockets, so that became a business feature. And while Intel was willing to sell product to overclockers[0], they absolutely needed to keep that feature quarantined from consumer and business product lines lest it destroy all their other segmentation.
I suspect they figured a "pro overclocker" SKU with ECC and unlocked multipliers would be about as marketable as Windows Vista Ultimate, i.e. not at all, so like all good marketing drones they played the "Nobody Wants What We Aren't Selling" card and decided to make people think that ECC and overclocking were diametrically supposed.
[0] In practice, if they didn't, they'd all just flock to AMD.
only when AMD had better price/performance, not because of ECC. At best you have a handful of homelabbers that went with AMD for their NAS, but approximately nobody who cares about performance switched to AMD for ECC ram, because ECC ram also tend to be clocked lower. Back in Zen 2/3 days the choice was basically DDR4-3600 without ECC, or DDR4-2400 with ECC.
Any workstation where you are getting serious work done should use ECC
P.S. GW1 remains one of my favorite games and the source of many good memories from both PvP and PvE. From fun stories of holding the Hall of Heroes to some unforgettable GvG matches, y'all made a great game.
I remember in the earlier builds we only had a “heal area” spell, which would also heal monsters, and no “resurrect” spell, so it was always a challenge to take down a boss and not accidentally heal it when trying to prevent a player from dying.
No, seriously did you actually verify the code for correctness before relying on it's results?
For that one I'd guess no, because under normal circumstances hot locations like that will stay in cache.
Funny you say this, because for a good while I was running OC'd RAM
I didn't see any instability, but Event Viewer was a bloodbath - reducing the speed a few notches stopped the entries (iirc 3800MHz down to 3600)
Oh god yes… Dell OptiPlexes and bad caps went together in those days. I’m half convinced Valve put the gray towers in Counter-Strike so IT employees wasting time could shoot them up for therapy.
The vast majority of crashes came from two buckets:
1. PCs running below our minimum specs
2. Bugs in MSI Afterburner.
Do you mean the OSD?
I dialed the machine back to the rated speed but it failed completely within 6 months.
We need GW3 already but my fear is mmo as a genre is dying.
Price itself has nothing to cause problems, it is either bad design or false or incomplete data on datasheets or all of it. Please STOP spreading this narrative, the right thing is to make ads, datasheets, marketing materials etc, etc to tell you the truth that is necessary for you to make proper decision as client/consumer.
General memory testing programs like memtest86 or memtester sets random bits into memory and verify it.
Yikes. Dude, you're getting a Packard Bell.
The Turion64 chipset was the worst CPU I've ever bought. Even 10 years old games had rendering artefacts all over the place, triangle strips being "disconnected" and leading to big triangles appearing everywhere. It was such a weird behavior, because it happened always around 10 minutes after I started playing. It didn't matter _what_ I was playing. Every game had rendering artefacts, one way or the other.
The most obvious ones were 3d games like CS1.6, Guild Wars, NFSU(2), and CC Generals (though CCG running better/longer for whatever reason).
The funny part behind the VRAM(?) bitflips was that the triangles then connected to the next triangle strip, so you had e.g. large surfaces in between houses or other things, and the connections were always in the same z distance from the camera because game engines presorted it before uploading/executing the functional GL calls.
After that laptop I never bought these types of low budget business laptops again because the experience with the Turion64 was just so ridiculously bad.
God I love C/C++. It’s like job security for engineers who fix bugs.
I imagine the largest volume of game memory consumption is media assets which if corrupted would really matter, and the storage requirement for important content would be reasonably negligible?
I don't think engineering effort should ever be put into handling literal bad hardware. But, the user would probably love you for letting them know how to fix all the crashing they have while they use their broken computer!
To counter that, we're LONG overdue for ECC in all consumer systems.
It significantly overlaps the engineering to gracefully handle non-hardware things like null pointers and forgetting to update one side of a communication interface.
80/20 rule, really. If you're thoughtful about how you build, you can get most of the benefits without doing the expensive stuff.
Better the user sees some lag due to state rebuild versus a crash.
Most consumers have what they have, and use what they have. Upgrading everything is now rare. If they got screwed, they'll remain screwed for a few years.