Hacker News

10% of Firefox crashes are caused by bitflips

875 points by marvinborner 3 days ago | 458 comments

I've told this story before on HN, but my biz partner at ArenaNet, Mike O'Brien (creator of battle.net) wrote a system in Guild Wars circa 2004 that detected bitflips as part of our bug triage process, because we'd regularly get bug reports from game clients that made no sense.

Every frame (i.e. ~60FPS) Guild Wars would allocate random memory, run math-heavy computations, and compare the results with a table of known values. Around 1 out of 1000 computers would fail this test!

We'd save the test result to the registry and include the result in automated bug reports.

The common causes we discovered for the problem were:

- overclocked CPU

- bad memory wait-state configuration

- underpowered power supply

- overheating due to under-specced cooling fans or dusty intakes

These problems occurred because Guild Wars was rendering outdoor terrain, and so pushed a lot of polygons compared to many other 3d games of that era (which can clip extensively using binary-space partitioning, portals, etc. that don't work so well for outdoor stuff). So the game caused computers to run hot.

Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.

And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

Sometimes I'm amazed that computers even work at all!

Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.

PunchyHamster 19 hours ago

> Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.

Case in point: I was getting memory errors on my gaming machine, that persisted even after replacing the sticks. It caused windows bluesreen maybe once a month so I kinda lived with it as I couldn't afford to replace whole setup (I theoretized something on motherboard is wrong)

Then my power supply finally died (it was cheap-ish, not cheap-est but it had few years already). I replaced it, lo and behold, memory errors were gone

versteegen 18 hours ago

I'm surprised "faulty PSU" is not on GP's list of common problems. Almost every unstable computer I've ever experienced has been due to either a dying PSU (not an under-specced one) or dying power conversion capacitors on the motherboard.

chedabob 17 hours ago

Ye some of the weirdest issues I've fixed have been PSU related.

I had a PC come to me that would boot fine, but if you opened the CD drive it'd shut off instantly.

urxvtcd 13 hours ago

There's a Polish electronics forum that's infamous because it's kind of actively hostile to them noobs. "Blacklisted power supply, closing thread." is a micro meme at this point.

drob518 14 hours ago

I concur. A lot of “flakey” issues can be traced to poor quality power supplies. That’s a component that doesn’t get any attention in spec sheets other than a max power rating and I think a lot of manufacturers skimp there. As long as the system boots up and runs for a few minutes, they ship it.

MrDrMcCoy 10 hours ago

Heck, even dirty power from the wall can contribute. I've seen improvements in stability from putting things behind power conditioners.

drob518 10 hours ago

Definitely that too, particularly in 2nd-world countries. I remember having a difficult time with dirty power for some hardware products I was responsible for at one time, where the customers were in the Middle East nd Africa in the 1990s. We ended up having to have the PS manufacturer do a redesign to help compensate for dirty power. It can be done, but it costs a bit more.

likelystory 16 hours ago

I could see that:

- Firefox may be more prevalent on those using Linux, since FF is less “corporate” than Chrome or Edge.

- People using Linux are probably putting Linux on old machines that had versions of Windows that are no longer supported.

However, what I can’t say next is “PSUs would get old and stop putting out as much” because that doesn’t tend to happen. They just die.

Those running Linux on some old tower may hook up too many devices to an underpowered PSU which could cause problems, but I doubt this is the norm.

If it’s not PSUs, what is it? It’s not electromagnetic radiation doing the bitflipping because that’s too rare.

Maybe bitflips could be caused by low-quality peripherals.

People also don’t vacuum out laptops like they used to vacuum out towers and desktops, so maybe it’s dust.

Or maybe it’s all a ruse and FF is buggy, but they don’t have time to figure it out.

sandworm101 12 hours ago

>> People using Linux are probably putting Linux on old machines

Maybe for linux noobs. But i would suggest that most linux users are not noobs booting a disused pentium from a live CD. They are running linux on the same hardware as windows users. I would further suggest that as anyone installing a not-windows OS is more tech savvy than the average, that linux users actually take better care of thier machines. Linux users take pride in thier machines whereas the average windows user barely knows that computers have fans.

As any linux user for thier specifications and they will quote system reports and memory figues like Marisa Tomei discussing engine timings. Ask a random windows user and they will probably start with the name of the store that sold it.

PaulDavisThe1st 11 hours ago

Unix user for 35 years, Linux for 30+ years ... my case fan died during the summer of last year ... just took the side panel off and kept things running.

So much for taking pride in my machine :)

sandworm101 2 hours ago

An exception to prove the rule. You fixed it yourself and are here proud of your machine.

I did basically the same thing recently when I built an AI rig. I tried to put it in a sever rack case but the fan noise was too much. So I ditched the rack and put in an open mining frame.

BorisMelnik 14 hours ago

yeah dell consumer pc psus were so awful

mock-possum 11 hours ago

Which is kinda crazy to me, in light of how durable their business laptops have been in my experience. I’ve owned maybe 6 pc laptops in my career, and the only 2 that’ve survived that nearly 20 year space are both dells.

stubish 34 minutes ago

Does Dell design and/or build their own laptops? Depending on the year it is likely just their brand and specs, designed and built by an ODM.

dvngnt_ 2 days ago

GW1 was my childhood. The MMO with no monthly fees appealed to my Mom and I met friends for years. The 8 skill build system was genius, as was the cut scenes featuring your player character. If there's ever a 3rd game I would love to see something allowing for more expression through build creation though I could see how that's hard to balance.

alexchantavy 20 hours ago

The PvP was so deep too. You would go 4v4 or 8v8 and coordinate a “3, 2, 1 spike” on a target so that all your damage would arrive at the same time regardless of spell windup times and be too much for the other team’s healer to respond to.

Could also fake spike to force the other team’s healer to waste their good heal on the wrong player while you downed the real target. Good times.

ndesaulniers 2 days ago

I still remember summoning flesh golems as a necromancer! Too much of my life sunk into GW1. Beat all 4(?) expansions. Logged in years later after I finally put it down to find someone had guessed my weak password, stole everything, then deleted all my characters. C'est la vie.

jiggunjer 2 days ago

Didn't they launch a remake of gw1 recently. Maybe I can get my kids hooked on that instead of this Roblox crap.

pndy 2 days ago

Yes, they did relaunch it as Guild Wars Reforged with Steam Deck and controller support and other changes

https://wiki.guildwars.com/wiki/Guild_Wars_Reforged

hobofan 18 hours ago

Yes they did, but the social bump that was there shortly after release has significantly calmed down already.

It did rekindle my love for the game, but most outposts are empty, even in the international districts, so I think it's hard to get hooked on it for new joiners.

post-it 2 days ago

For what it's worth, Roblox is how I discovered code at age 10.

Cthulhu_ 19 hours ago

It was ZZT for me, no idea how old I was, probably 8-10 or so.

But when you take a bird's eye view, it's interesting and great to see how over the years, games where you can build your own games remain popular and a common entryway into software development.

But also how Epic went from ZZT via Unreal to Fortnite, with the latter now being another platform (or what Zucc wanted to call a metaverse) for creativity.

Other notable mentions off the top of my head where people can build or invent their own games (in-game, via an external editor or through community support) or go crazy in besides Roblox are Second Life (...I think), LittleBigPlanet, Warcraft/Starcraft (which led to the genre of MOBAs), Geometry Dash, Mario Maker, TES, Source engine games, Minecraft, etc etc.

youarentrightjr 2 days ago

How do you mean? Is there programming inside the game (ala Minecraft or Factorio)?

cortesoft 24 hours ago

Roblox is basically a developer platform for making games

LoganDark 2 days ago

Roblox has a development environment for creating games (Roblox Studio) and the engine uses a fork of Lua as a scripting language.

I also was introduced to programming through Roblox.

dpe82 2 days ago

As a mobile dev at YouTube I'd periodically scroll through crash reports associated with code I owned and the long tail/non-clustered stuff usually just made absolutely no sense and I always assumed at least some of it was random bit flips, dodgy hardware, etc.

Cthulhu_ 19 hours ago

I heard the same thing from a colleague who worked on a Dutch banking app, they were quite diligent in fixing logic bugs but said that once you fix all of those, the rest is space rays.

As an aside, Apple and Google's phone home crash reports is a really good system and it's one factor that makes mobile app development fun / interesting.

grishka 20 hours ago

For the Mastodon Android app, I also sometimes see crashes that make no sense. For example, how about native crashes, on a thread that is created and run by the system, that only contains system libraries in its stack trace, and that never ran any of my code because the app doesn't contain any native libraries to begin with?

Unfortunately I've never looked at crashes this way when I worked at VKontakte because there were just too many crashes overall. That app had tens of millions of users so it crashed a lot in absolute numbers no matter what I did.

gf000 19 hours ago

Well, vendors' randomly modified android systems are chock full of bugs, so it could have easily been some fancy os-specific feature failing not just in your case, but probably plenty other apps.

dpe82 11 hours ago

Usually I'd just look at clusters of crashes (those that had similar stack traces) but sometimes when you're running a very small % experiment there's not enough signal so you end up looking at everything. And oh boy was there a lot of noise.

In an app with >billion users you get all kinds of wild stuff.

saagarjha 15 hours ago

Bugs in the system libraries?

jodrellblank 11 hours ago

This is getting off-topic but I’m amazed by this ability to reach out to computers around the world as a sensor array and infer things we can’t easily find out in other ways. It’s in popular culture and HN comments most often as spyware and mass surveillance of people, and that’s a bit of a shame.

GPS location and movement data is what gives Google maps its near-real-time view of traffic on all roads, and busy-ness of all shops.

I think they collect location data from people riding public transport so they can tell you how long people wait on average at bus stops before getting on a bus.

Does Google collect atmospheric pressure readings from phone altimeters and use it for weather models? Could they?

Kindle collects details on books people read, how far they read, where they stop, which sections they highlight and quote, which words they look up in dictionaries.

I wonder if anyone’s curated a list of things like this which do happen or have been tried, excluding the “gathers user data for advertising” category which would become the biggest one, drowning out everything else.

I think current phones use accelerometer data to detect possible car crashes and call emergency services. Google could use that in aggregate to identify accident blackspots but I don’t know if they do. But that would be less useful because the police already know everywhere a big accident happens because people call the police. So that’s data easily found a different way.

seanw444 10 hours ago

> It’s in popular culture and HN comments most often as spyware and mass surveillance of people, and that’s a bit of a shame.

I don't know whether you mean it's a shame that people consider it spyware, or if you meant that it's a shame that it manifests as spyware typically. I agree with the latter, not the former. It usually is spyware. If companies went for simple opt-in popups with a brief description of the reasoning, I'd be all for that. I sometimes opt-in to these requests myself, despite being a fairly privacy-conscious person, because I understand the benefit they have to the people collecting the data for good purposes. But when surveillance is opt-out (or no choice given), it's just spyware.

jodrellblank 7 hours ago

I mean what you did is a shame.

I asked to put the spyware aside for one sub-thread and focus on the astonishing worldwide sensor array, and you talked about the spyware and nothing else.

MBCook 11 hours ago

Doesn’t Google also use the phone accelerometer to try and spot earthquakes?

Helmut10001 23 hours ago

I don't understand why ECC memory is not the norm these days. It is only slightly more expensive, but solves all these problems. Some consumer mainboards even support it already.

Agingcoder 21 hours ago

No it doesn’t :-)

I’ve had plenty of servers with faulty ecc dimms that didn’t trigger , and would only show faults when actual memory testing. I had a hard time convincing some of our admins the first time ( ‘no ecc faults you can’t be right ‘ ) but I won the bet.

Edit: very old paper by google on these topics. My issues were 6-7 years ago probably.

https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

thebruce87m 20 hours ago

That shouldn’t make sense. It’s not like the ECC info is stored in additional bits separate from the data, it’s built in with the data so you can’t “ignore” it. Hmm, off to read the paper.

smalley 7 hours ago

The ECC information is stored in separate DRAM devices on the DIMM. This is responsible for some of the increased cost of DIMMs with ECC at a given size. When marketed the extra memory for ECC are typically not included in the size for DIMMs so a 32GB DIMM with and without ECC will have differing numbers of total DRAM devices.

There's a pretty good set of diagrams and descriptions of the faults in this paper https://dl.acm.org/doi/10.1145/3725843.3756089.

Also to the parent: there's an updated public paper on DDR4 era fault observations https://ieeexplore.ieee.org/document/10071066

thebruce87m 5 hours ago

I think you responded to the wrong person, unless you think I was implying that the extra bits needed for ECC didn’t need extra space at all? I wasn’t suggesting that - just that they aren’t like a checksum that is stored elsewhere or something that can be ignored - the whole 72 bits are needed to decode the 64 bits of data and the 64 bits of data cannot be read independently.

smalley 3 hours ago

If we're talking about standard server RDIMMs with ECC (or the prosumer stuff) the CPU visible ECC (excluding DDR5's on-die ECC) is typically implemented as a sideband value you could ignore if you disabled the correction logic.

I suppose what winds up where is up to the memory controller but (for DDR5) in each BL16 transaction beat you're usually getting 32 bits of data value and 8 bits of ECC (per sub channel). Those ECC bits are usually called check bits CB[7:0] and they accompany the data bits DQ[31:0] .

If you're talking about transactions for LPDDR things are a bit different there, though as that has to be transmitted inband with your data

Agingcoder 16 hours ago

I fully agree with you ! Neither soft nor hard memory errors, nothing… but but flips ,and reproducible at that.

We scanned all our machines following this ( a few thousand servers ) and found out that ram issues were actually quite common, as said in the paper.

close04 17 hours ago

If we’re being pragmatic, it solves enough problems that you could still call it an undisputed win for stability.

RealityVoid 14 hours ago

I'm sorry, but I, just like your admins, don't believe this. It's theoretically possible to have "undetectable" errors, but it's very unlikely and you'd see a much higher than this incidence of detected unrecoverable errors and you'd see a much higher incidence than this of repaired errors. I just don't buy the argument of "invisible errors".

EDIT: took a look on the paper you linked and it basically says the same thing I did. The probability of these cases becomes increasingly and increasingly small and while ECC would indeed, not reduce it to _zero_ it would greatly greatly reduce it.

Agingcoder 13 hours ago

Well my admins eventually believed me , so I’m fairly comfortable with what I said.

We also had a few thousands of physical servers with about of terabyte of ram each.

You are right : we did see repaired errors, but we also saw (indirectly, and after testing ) unrepaired ones

RealityVoid 11 hours ago

Ok, I am sure there is _some_ amount of unrepairable errors.

But the initial discussion was that ECC ram makes it go away and your point that it doesn't. And the vast vast majority of the errors, according to my understanding and to the paper you pointed to, are repairable. About 1 out of 400 ish errors are non-repairable. That's a huge improvement! If you had ECC ram, the failures Firefox sees here would drop from 10% to 0.025%! That is highly significant!

Even more! 2 bit errors now you would be informed of! You would _know_ what is wrong.

You could have 3(!) bit errors and this you might not see, but they'd be several orders of magnitude even rarer.

So yes, it would not 100% go away, but 99.9 % go away. That's... Making it go away in my book.

And last but not least, this paper mentions uncorrectable errors. It says nothing of undetectable ecc errors! You said _undetectable_ errors. I'm sure they happen, but would be surprised if you have any meaningful incidence of this, even at terabytes of data. It's probay on the order of 0.000625 of errors you can get ( but if you want I can do more solid math)

Agingcoder 10 hours ago

We’re in agreement.

I think we diverge on ‘making it go away in my book’.

When you’re the one having to debug all these bizarre things ( there were real money numbers involved so these things mattered ), over millions of jobs every day , rare events with low probability don’t disappear - they just happen and take time to diagnose and fix.

So in my book ecc improves the situation, but I still had to deal with bad dimms, and ecc wasn’t enough. We used not to see these issues because we already had too many software bugs, but as we got increasingly reliable, hardware issues slowly became a problem, just like compiler bugs or other elements of the chain usually considered reliable.

I fully agree that there are lots of other cases where this doesn’t matter and ecc is good enough.

Thanks for taking the time to reply !

RealityVoid 9 hours ago

Oh, I get this point. If you have a sufficiently large amount of data an you monitor the errors and your software gets better and better even low probability cases will happen and will stand out.

But this is sort of the march of nines.

My knee jerk reaction to blaming ECC is "naaah". Mostly because it's such a convenient scapegoat. It happens, I'm sure, but it would not be the first explanation I reach for. I once heard someone blame "cosmic rays" on a bug that happened multiple times. You can imagine how irked I was on the dang cosmic rays hitting the same data with such consistency!

Anyways, I'm sorry if my tone sounded abrasive, I, too, have appreciated the discussion.

Agingcoder 6 hours ago

:-) never forget Occam’s razor !

No you were not abrasive at all - I’ve learned to assume good faith in forum conversations.

In retrospect I should have started by giving the context ( march of 9s is a good description) actually, which would have made everything a lot clearer for everyone.

kasabali 20 hours ago

were they 3-bit flips?

thfuran 14 hours ago

It seems extremely unlikely that you’d end up with a lot of those but no smaller detectable errors.

hurfdurf 19 hours ago

Why? Intel making and keeping it workstation/Xeon-exclusive for a premium for too long. And AMD is still playing along not forcing the issue with their weird "yeah, Zen supports it, but your mainboard may or may not, no idea, don't care, do your own research" stance. These days it's a chicken and egg problem re: price and availability and demand. See also https://news.ycombinator.com/item?id=29838403

m000 18 hours ago

Maybe it's high time for some regulation?

E.g. EU enforced mandatory USB-C charging from 2025, and pushes for ending production of combustion engine cars by 2035. Why not just make ECC RAM mandatory in new computers starting e.g. from 2030?

AMD is already one step away from being compliant. So, it's not an outlandish requirement. And regulating will also force Intel to cut their BS, or risk losing the market.

funcDropShadow 16 hours ago

OMG no. Politician have no business making technological decisions. They make it harder to innovate, i.e. to invent the next generation of ECC with a different name.

m000 15 hours ago

I would argue that in the present conditions, regulation can actually foster and guide real innovation.

With no regulations in place, companies would rather innovate in profit extraction rather improving technology. And if they have enough market capture, they may actually prefer to not innovate, if that would hurt profits.

cestith 13 hours ago

ECC is like Ethernet. The name doesn’t have to change for the technology to update.

saagarjha 15 hours ago

Politicians don’t have to be dumb.

free652 16 hours ago

Cost. You are about to making computers 10-20% more expensive.

Computers also aren't used much these days, and phones and tables don't have ECC

m000 15 hours ago

ECC has only 10-15% more transistor count. So you're only making one component of the computer 15% more expensive. This should have been a non-brainer, at least before the recent DRAM price hikes.

Also, while computers may not be used much for cosmic rays to be a risk factor, but they're still susceptible to rowhammer-style attacks, which ECC memory makes much harder.

Finally, if you account for the current performance loss due to rowhammer counter-measures, the extra cost of ECC memory is partially offset.

Helmut10001 19 hours ago

Thanks for the details. I agree and had the same experience, trying to figure out if an AMB motherboard supports ECC or not. It is almost impossible to know ahead of trying it. At least we have ZFS now for parity checks on cold storage.

Dylan16807 21 hours ago

Well for DDR5 that's 25% more chips which isn't great even if you don't get ripped off by market segmentation.

It's possible DDR6 will help. If it gets the ability to do ECC over an entire memory access like LPDDR, that could be implemented with as little as 3% extra chip space.

hikarudo 16 hours ago

Why 25%, shouldn't it be 12.5%? 8 ECC bits for every 64 bits.

ciupicri 15 hours ago

DDR5 ECC RDIMMs (R=registered) have 16 extra bits. From the specifications for Kingston's KSM64R52BS8-16MD [1]:

> x80 ECC (x40, 2 independent I/O sub channels)

On the other hand ECC UDIMMs (U=unbuffered) have only 8. From the specifications for Kingston's KSM56E46BS8KM-16HA [2]:

> x72 ECC (x36, 2 independent I/O sub channels)

Though if I remember correctly, the specifications for the older DDR4 ECC RDIMMs mention only 72 bits.

[1]: https://www.kingston.com/datasheets/KSM64R52BS8-16HA.pdf

[2]: https://www.kingston.com/datasheets/KSM56E46BS8KM-16HA.pdf

epx 16 hours ago

And checksummed filesystems.

PunchyHamster 19 hours ago

In case of Intel it's mostly coz they want to sell it as enterprise/workstation feature and make people pay extra.

AMD has been better on it but BIOS/mobo vendors not so much

sznio 18 hours ago

What I'm wondering, even without ECC, afaik standard ram still has a parity bit, so a single flip should be detected. With ECC it would be fixed, without ECC it would crash the system. For it to get through and cause an app to malfunction you need two bit flips at least.

ciupicri 16 hours ago

I think standard RAM used to have long long time ago, but not anymore. DDR5 finally readd it sort of.

roryirvine 13 hours ago

Yes, 30 pin SIMMs (the most common memory format from the mid-80s to the mid-90s) came in either '8 chip' or '9 chip' variants - the 9th chip being for the parity bit.

Most motherboards supported both, and the choice of which to use came down to the cost differential at the time of building a particular machine. The wild swings in DRAM prices meant that this could go from being negligible to significant within the course of a year or two!

When 72 pin SIMMs were introduced, they could in theory also come in a parity version but in reality that was fairly rare (full ECC was much better, and only a little more expensive). I don't think I ever saw an EDO 72 pin SIMM with parity, and it simply wasn't an option for DIMMs and later.

meindnoch 17 hours ago

Wrong. Regular RAM has no parity bit.

colechristensen 23 hours ago

Bit flips do not only happen inside RAM

Also, in a game, there is a tremendously large chance that any particular bit flip will have exactly 0 effect on anything. Sure you can detect them, but one pixel being wrong for 1/60th of a second isn't exactly ... concerning.

The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully. There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.

PunchyHamster 19 hours ago

> The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully.

Nobody does

> There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.

And again, nobody except stuff that goes to space and few critical machines does. The closest normal user will get to code written like that are probably car ECUs, there are even automotive targeted MCUs that not only run ecc but also 2 cores in parallel and crash if they disagree

colechristensen 10 hours ago

Sure they do, you just have to think about it a different way.

It boils down to exception handling, you don't expect all of your bugs or security vulnerabilities to be known and write your code to be able to react to unplanned states without crashing. Bugs or security vulnerabilities can look a lot like a cosmic ray... a buffer overflow putting garbage in unexpected memory locations vs a cosmic ray putting garbage in unexpected memory locations... a lot of the mitigations are quite the same.

colinb 21 hours ago

> code for radiation hardened environments

I’m aware of code that detects bit flips via unreasonable value detection (“this counter cannot be this high so quickly”). What else is there?

gmueckl 21 hours ago

For safety critical systems, one strategy is to store at least two copies of important data and compare them regularly. If they don't match, you either try to recover somehow or go into a safe state, depending on the context.

d1sxeyes 21 hours ago

At least three copies, so you can recover based on consensus.

Dylan16807 20 hours ago

If your pieces of important data are very tiny, that's probably your best option.

If they're hundreds of bytes or more, then two copies plus two hashes will do a better job.

d1sxeyes 16 hours ago

Ah, true! You just restore the one that matches its hash. Elegant.

rixed 12 hours ago

A single hash should be enough.

Dylan16807 8 hours ago

Yes, but what's easier depends on layout. "Consensus" makes me think of multiple entire nodes, and in that situation you can have a nice symmetry by making each node store one copy and one small hash.

If you're doing something that's more centralized then one hash might be simpler, but if you're centralized then you should probably use your own error correction codes instead of having multiple copies.

qznc 12 hours ago

In many cases the system is perfectly safe when it shuts off. Two is enough for that.

pizza 17 hours ago

“never go to sea with two chronometers, take one or three”

DennisP 11 hours ago

Seems like chronometers would be a case where two are better than one, because the mistakes are analog. If they don't exactly agree, just take the average. You'll have more error than if you were lucky enough to take the better chronometer, but less than if you had taken only the worse one. Minimizing the worst case is probably the best way to stay off the rocks.

Helmut10001 19 hours ago

I use ZFS even on consumer devices, these days. Parity checks all the way!

vntok 21 hours ago

You can have voting systems in place, where at least 2 out of 3 different code paths have to produce the same output for it to be accepted. This can be done with multiple systems (by multiple teams/vendors) or more simply with multiple tries of the same path, provided you fully reload the input in between.

qznc 21 hours ago

The simplest one is a watchdog: If something stops with regular notifications, then restart stuff.

gmueckl 21 hours ago

A watchdog guards against unresponsive software. It doesn't protect against bad data directly. Not all bad data makes a system freeze.

Helmut10001 23 hours ago

Interesting, I was not aware! Do you have a statistics for the bit flips in RAM %? My feeling would be its the majority of bit flips that happen, but I can be wrong.

Tomte 21 hours ago

IEC 61508 estimates a soft error rate of about 700 to 1200 FIT (Failure in Time, i.e. 1E-9 failures/hour).

That was in the 2000s though, and for embedded memory above 65nm. I would expect smaller sizes to be more error-prone.

colechristensen 23 hours ago

It would be quite hard to gather that data and would be highly dependent on hardware and source of bit flip.

But there's volatile and nonvolatile memory all over in a computer and anywhere data is in flight be it inside the CPU or in any wires, traces, or other chips along the data path can be subject to interference, cosmic rays, heat or voltage related errors, etc.

ZiiS 22 hours ago

It should be fairly easy to see statistically if ECC helps, people do run Firefox on it.

The number of bits in registers, busses, cache layers is very small compared to the number in RAM. Obviously they might be hotter or more likely to flip.

bpye 21 hours ago

I believe caches and maybe registers often have ECC too though I'm sure there are still gaps.

bell-cot 18 hours ago

Talk to someone in consumer sales about customer priorities. A bit-cheaper computer? Or one which which is, in theory, more resilient against some rare random sort of problem which customers do not see as affecting them.

mobilio 2 days ago

Yup!

I've read this decade ago... https://www.codeofhonor.com/blog/whose-bug-is-this-anyway

john_strinlai 2 days ago

for people that dont know, www.codeofhonor.com is netcoyotes (the gp comment) blog, and there is some good reading to be had there

Modified3019 2 days ago

Thanks to asrock motherboards for AMD’s threadripper 1950x working with ECC memory, that’s what I learned to overclock on.

I eventually discovered with some timings I could pass all the usual tests for days, but would still end up seeing a few corrected errors a month, meaning I had to back off if I wanted true stability. Without ECC, I might never have known, attributing rare crashes to software.

From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.

I found that DDR3 and DDR4 memory (on AMD systems at least) had quite a bit of extra “performance” available over the standard JEDEC timings. (Performance being a relative thing, in practice the performance gained is more a curiosity than a significant real life benefit for most things. It should also be noted that higher stated timings can result in worse performance when things are on the edge of stability.)

What I’ve noticed with DDR5, is that it’s much harder to achieve true stability. Often even cpu mounting pressure being too high or low can result in intermittent issues and errors. I would never overclock non-ECC DDR5, I could never trust it, and the headroom available is way less than previous generations. It’s also much more sensitive to heat, it can start having trouble between 50-60 degrees C and basically needs dedicated airflow when overclocking. Note, I am not talking about the on chip ECC, that’s important but different in practice from full fat classic ECC with an extra chip.

I hate to think of how much effort will be spent debugging software in vain because of memory errors.

monster_truck 2 days ago

DDR4 and 5 both have similar heat sensitivity curves which call for increased refresh timings past 45C.

Some of the (legitimately) extreme overclockers have been testing what amounts to massive hunks of metal in place of the original mounting plates because of the boards bending from mounting pressure, with good enough results.

On top of all of this, it really does not help that we are also at the mercy of IMC and motherboard quality too. To hit the world records they do and also build 'bulletproof', highest performance, cost is no object rigs, they are ordering 20, 50 motherboards, processors, GPUs, etc and sitting there trying them all, then returning the shit ones. We shouldn't have to do this.

I had a lot of fun doing all of this myself and hold a couple very specific #1/top 10/100 results, but it's IMHO no longer worth the time or effort and I have resigned to simply buying as much ram as the platform will hold and leaving it at JEDEC.

golem14 2 days ago

Hmm, I wonder if we see, now since we are in a RAM availability crisis, more borderline to bad RAMs creep into the supply chain.

If we had a time series graph of this data, it might be revealing.

monster_truck 2 days ago

If you look around you'll see people already putting the new, chinese made DDR4 through its paces, it's holding up far better than anyone expected.

Every single time I've had someone pay me to figure out why their build isn't stable, it's always some combination of cheap power supply with no noise filtering, cheap motherboard, and poor cooling. Can't cut corners like that if you want to go fast. That is to say, I've never encountered "almost ok" memory. They're quite good at validation.

iamflimflam1 20 hours ago

The danger is we’ll start to see more QA rejects coming into the market. The temptation to mix in factory rejects into your inventory is going to get very high for a lot of resellers.

kombine 20 hours ago

Where does one find these? I'm looking for DDR4 ECC for my homelab.

bpye 21 hours ago

Similar experience. I played with overclocking the DDR5 ECC memory I have on my system, it would appear to be stable and for quite a while it would be. But after a few days I'd notice a handful of correctable errors.

I now just run at the standard 5600MHz timing, I really don't find the potential stability trade off worth it. We already have enough bugs.

kmeisthax 2 days ago

> From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.

This attitude is entirely corporate-serving cope from Intel to serve market segmentation. They wanted to trifurcate the market between consumers, business, and enthusiast segments. Critically, lots of business tasks demand ECC for reliability, and business has huge pockets, so that became a business feature. And while Intel was willing to sell product to overclockers[0], they absolutely needed to keep that feature quarantined from consumer and business product lines lest it destroy all their other segmentation.

I suspect they figured a "pro overclocker" SKU with ECC and unlocked multipliers would be about as marketable as Windows Vista Ultimate, i.e. not at all, so like all good marketing drones they played the "Nobody Wants What We Aren't Selling" card and decided to make people think that ECC and overclocking were diametrically supposed.

[0] In practice, if they didn't, they'd all just flock to AMD.

gruez 2 days ago

>[0] In practice, if they didn't, they'd all just flock to AMD.

only when AMD had better price/performance, not because of ECC. At best you have a handful of homelabbers that went with AMD for their NAS, but approximately nobody who cares about performance switched to AMD for ECC ram, because ECC ram also tend to be clocked lower. Back in Zen 2/3 days the choice was basically DDR4-3600 without ECC, or DDR4-2400 with ECC.

pushedx 2 days ago

At the beginning of your comment I was wondering if the "attitude" that was corporate serving was the anti-ECC stance or the pro-ECC stance (based on the full chunk that you quoted). I'm glad that by the end of the comment you were clearly pro ECC.

Any workstation where you are getting serious work done should use ECC

aiiane 20 hours ago

I remember one of the first impressions I had in GW1 during test events was the sense of scale in the world that still managed to avoid excessive harsh geometry angles for the most part. Not surprised to hear it was pushing more polygons than average.

P.S. GW1 remains one of my favorite games and the source of many good memories from both PvP and PvE. From fun stories of holding the Hall of Heroes to some unforgettable GvG matches, y'all made a great game.

jug 2 days ago

As a community alpha tester of GW1, this was a fun read! Such an educational journey and what a well organized and fruitful one too. We could see the game taking shape before our eyes! As a European, I 100% relied on being young and single with those American time zones. :D Tests could end in my group at like 3 am, lol.

netcoyote 23 hours ago

Oh yeah, those were some good times. It was great getting early feedback from you & the other alpha testers, which really changed the course of our efforts.

I remember in the earlier builds we only had a “heal area” spell, which would also heal monsters, and no “resurrect” spell, so it was always a challenge to take down a boss and not accidentally heal it when trying to prevent a player from dying.

samiv 17 hours ago

Plot twist. The memory bit flip checking code was actually buggy and contained UB.

No, seriously did you actually verify the code for correctness before relying on it's results?

pndy 2 days ago

I didn't expect to read bits of GW story here from one of the founders - thanks!

Dylan16807 21 hours ago

> And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

For that one I'd guess no, because under normal circumstances hot locations like that will stay in cache.

arprocter 2 days ago

>Sometimes I'm amazed that computers even work at all!

Funny you say this, because for a good while I was running OC'd RAM

I didn't see any instability, but Event Viewer was a bloodbath - reducing the speed a few notches stopped the entries (iirc 3800MHz down to 3600)

monster_truck 2 days ago

Every interesting bug report I've read about Guild Wars is Dwarf Fortress tier. A very hardcore, longtime player who was recounting some of the better ones to me shared a most excellent one wrt spirits or ghosts, some sort of player summoned thing that were sticking around endlessly and causing OOM errors?

nxobject 17 hours ago

Oh god yes… Dell OptiPlexes and bad caps went together in those days. I’m half convinced Valve put the gray towers in Counter-Strike so IT employees wasting time could shoot them up for therapy.

Analemma_ 2 days ago

There's a famous Raymond Chen post about how a non-trivial percentage of the blue screen of death reports they were getting appeared to be caused by overclocking, sometimes from users who didn't realize they had been ripped off by the person who sold them the computer: https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35.... Must've been really frustrating.

jnellis 2 days ago

This was a design choice by AMD at the time for their Athlon Slot A cpus. Use the same slot A board which you could set the cpu speed by bridging a connections. Since the Slot A came in a package, you couldn't see the actual cpu etching. So shady cpu sellers would pull the cover off high speed cpus, and put them on slow speed cpus after overclocking them to unstable levels.

projektfu 2 days ago

E.g., running a Pentium 75, at 75MHz.

Agentlien 23 hours ago

That's a really cool anecdote. The overclock makes sense. When we released Need For Speed (2015) I spent some time in our "war room", monitoring incoming crash reports and doing emergency patches for the worst issues.

The vast majority of crashes came from two buckets:

1. PCs running below our minimum specs

2. Bugs in MSI Afterburner.

kasabali 20 hours ago

> Bugs in MSI Afterburner.

Do you mean the OSD?

Agentlien 17 hours ago

It seemed to be the monitoring side of it which caused a lot of crashes. It was apparently a very common issue in many games around that time.

PaulHoule 15 hours ago

Back in the 90's I had an overclocked AMD486 machine which seemed OK most of the time but had segfaults compiling the Linux kernel. I sent in a bug report and Alan Cox closed it saying it was the fault of my machine being overclocked.

I dialed the machine back to the rated speed but it failed completely within 6 months.

fennecbutt 10 hours ago

That's awesome. But also guild waaars, GW2 I played from beta for years, but it just got boring. Endless expansions with weird story.

We need GW3 already but my fear is mmo as a genre is dying.

uncSoft 10 hours ago

They just need to call it GW Classic apparently and it will sell

sidewndr46 15 hours ago

Well wow I wasn't expecting to see yet another story from Patrick Wyatt here in the comments! Much appreciated, I've enjoyed reading everything you've written over the years.

danielEM 16 hours ago

> problems because Dell sourced the absolute cheapest stuff for their computers;

Price itself has nothing to cause problems, it is either bad design or false or incomplete data on datasheets or all of it. Please STOP spreading this narrative, the right thing is to make ads, datasheets, marketing materials etc, etc to tell you the truth that is necessary for you to make proper decision as client/consumer.

taneq 23 hours ago

Wow, that’s really interesting! I always suspected bit flips happened undetected way more than we thought, so it’s great to get some real life war stories about it. Also thanks for Guild Wars, many happy hours spent in GW2. :)

just_testing 2 days ago

I loved reading your comment and got curious: how he detected the bitflips?

mayama 2 days ago

It looks like computing math heavy process with known answer, like 301st prime, and comparing the result.

General memory testing programs like memtest86 or memtester sets random bits into memory and verify it.

Salgat 2 days ago

Mike is such a legend.

benatkin 12 hours ago

Yikes. Dude, you're getting a Packard Bell.

SunnyNeon 18 hours ago

How did you determine which of the causes it was?

cookiengineer 2 days ago

I kind of wanted to confirm that. At that time I was still using a Compaq business laptop on which I played Guild Wars.

The Turion64 chipset was the worst CPU I've ever bought. Even 10 years old games had rendering artefacts all over the place, triangle strips being "disconnected" and leading to big triangles appearing everywhere. It was such a weird behavior, because it happened always around 10 minutes after I started playing. It didn't matter _what_ I was playing. Every game had rendering artefacts, one way or the other.

The most obvious ones were 3d games like CS1.6, Guild Wars, NFSU(2), and CC Generals (though CCG running better/longer for whatever reason).

The funny part behind the VRAM(?) bitflips was that the triangles then connected to the next triangle strip, so you had e.g. large surfaces in between houses or other things, and the connections were always in the same z distance from the camera because game engines presorted it before uploading/executing the functional GL calls.

After that laptop I never bought these types of low budget business laptops again because the experience with the Turion64 was just so ridiculously bad.

andrepd 15 hours ago

Amazing story! Reminds me of old gamasutra posts like these https://web.archive.org/web/20170522151205/http://www.gamasu...

yownie 8 hours ago

this exactly the type of stories I come to HN to read, thanks!

jiggawatts 2 days ago

Some multiplayer real-time strategy (RTS) games used deterministic fixed-point maths and incremental updates to keep the players in sync. Despite this, there would be the occasional random de-sync kicking someone out of a game, more than likely because of bit flips.

netcoyote 23 hours ago

For RTS games I wish we could blame bit flips, but more typically it is uninitialized memory, incorrectly-not-reinitialized static variables, memory overwrites, use-after-free, non-deterministic functions (eg time), and pointer comparisons.

God I love C/C++. It’s like job security for engineers who fix bugs.

blep-arsh 19 hours ago

Some games are reliable enough. I found out the DRAM in my PC was going bad when Factorio started behaving weird. Did a memory test to confirm. Yep, bitflips.

hsbauauvhabzb 2 days ago

Did you/he ever consider redundant allocation for high value content and hash checks for low value assets that are still important?

I imagine the largest volume of game memory consumption is media assets which if corrupted would really matter, and the storage requirement for important content would be reasonably negligible?

nomel 2 days ago

I think the most reasonable take would be to just tell the users hardware is borked, they're going to have a bad outside the game too, and point them to one of the many guides around this topic.

I don't think engineering effort should ever be put into handling literal bad hardware. But, the user would probably love you for letting them know how to fix all the crashing they have while they use their broken computer!

To counter that, we're LONG overdue for ECC in all consumer systems.

AlotOfReading 2 days ago

I put engineering effort into handling bad hardware all the time because safety critical, :)

It significantly overlaps the engineering to gracefully handle non-hardware things like null pointers and forgetting to update one side of a communication interface.

80/20 rule, really. If you're thoughtful about how you build, you can get most of the benefits without doing the expensive stuff.

shakna 2 days ago

I think I sit in another camp. A lot of my engineering efforts are in working around bad hardware.

Better the user sees some lag due to state rebuild versus a crash.

Most consumers have what they have, and use what they have. Upgrading everything is now rare. If they got screwed, they'll remain screwed for a few years.

andai 2 days ago

That's an interesting idea. How might you implement that? Like RAID but on the level of variables? Maybe the one valid use case for getters/setters? :)

hsbauauvhabzb 2 days ago

As another user fairly pointed out, ECC. But a compiler level flag would probably achieve the redundancy, sourcing stuff from disk etc would probably still need to happen twice to ensure that bit flips do not occur, etc.

rurban 21 hours ago

I hate HW soo much. To revise the biggest problems in computing, beside out of tokens: HW bugs

Animats 2 days ago

ECC should have become standard around the time memories passed 1GB.

It's seriously annoying that ECC memory is hard to get and expensive, but memory with useless LEDs attached is cheap.

loeg 2 days ago

It's not even ECC price/availability that bothers me so much, it's that getting CPUs and motherboards that support ECC is non-trivial outside of the server space. The whole consumer class ecosystem is kind of shitty. At least AMD allows consumer class CPUs to kinda sorta use ECC, unlike Intel's approach where only the prosumer/workstation stuff gets ECC.

rpcope1 2 days ago

I've been honestly amazed people actually buy stuff that's not "workstation" gear given IME how much more reliably and consistently it works, but I guess even a generation or two used can be expensive.

throwaway85825 2 days ago

Very few applications scale with cores. For the vast majority of people single core performance is all they care about, it's also cheaper. They don't need or want workstation gear.

rpcope1 11 hours ago

I have come to doubt that single core or CPU performance in general, other than maybe specialty applications like CAD and some games, is all that noticeable for most computer users in the last decade. I can take relatively pedestrian users like my parents or my wife and put them in front of a decade old high end Haswell system or a brand new mega-$$$ threadripper/epyc and for almost all intents and purposes they don't notice a different. What they do notice is when things die. I'm sure consumer hardware might be OK for 2-3 years (maybe), but like for my parents, they're happier to keep using the same computer, and honestly the same Dell Precision system I gave them almost 10 years ago works great today, and I have a suspicion that the hardware, outside of maybe the SSD finally wearing out, will probably work right a decade from now too.

rafaelmn 19 hours ago

> Very few applications scale with cores

You mean like compilers and test suites ? Very few professional workloads don't parallelize well these days.

VorpalWay 17 hours ago

Compilers and test suits do scale (at least for C/C++ and Rust, which is what I work with). But I think the parent comment referred to consumer applications: games, word processing, light browsing, ...

(Though games these days scale better than they used to, but only up to a to a point.)

I find that most tools I write for my own use can be made to scale with cores, or run so fast that the overhead of starting threads is longer than the program runtime. But I write that in Rust which makes parallelism easy. If I wrote that code in C++ I would probably not bother with trying to parallelize.

rafaelmn 14 hours ago

But those tools aren't really compute bound anyway - you're not buying a workstation to do them, you're getting a consumer laptop or a tablet.

loeg 11 hours ago

And that consumer device should have ECC! That's the whole discussion here.

zadikian 11 hours ago

It's confusing because a few comments up is "for the vast majority of people single core performance is all they care about, it's also cheaper" which is unrelated to ECC.

loeg 10 hours ago

I think it's coherent -- it's an argument for why most people don't want to buy Workstation class products just to get ECC. (Prices scale with core count. Not linearly, but still.)

loeg 11 hours ago

Test suites often don't scale, actually. Unit tests usually run single-threaded by default, and also relatively often have side effects on the system that mean they're unsafe to run in parallel. (Sure, sure, you could definitely argue the latter thing is a skill issue.)

zadikian 11 hours ago

In theory, do you need a single machine for any of that, or would it be cheaper to use a low-availability cloud cluster? Tests are totally independent, and builds probably parallel enough.

throwaway85825 12 hours ago

Only a small percentage of computer users are programmers.

zadikian 21 hours ago

There were several years where used cheese grater Mac Pros could be bought and upgraded for very cheap, and were still not too outdated. I only replaced my MacPro4,1 when the M1 mini came out, mainly cause of wattage.

loeg 2 days ago

I've had zero issues with AMD's consumer tier of non-WX Threadripper and Ryzen models, FWIW.

thousand_nights 2 days ago

overblown? billions of users use consumer tier hardware just fine. i have servers at home with years of uptime without any ECC memory

conception 24 hours ago

But how much bit rot? You’ll never know.

Maxion 20 hours ago

If I don't know about it, then how does it affect me / why should I care? My home server does what it is supposed to do and has done so for a decade. If bit rot /bit flips in memory does not affect my day-to-day life I much prefer cheaper hardware.

I do hope the nuclear powerplant next door uses more fault tolerant hardware, though.

loeg 11 hours ago

Eventually you might notice the pictures or other documents you were saving on your home server have artifacts, or no longer open. This is undesireable for most people using computer storage.

> I much prefer cheaper hardware.

The cost savings are modest; order of magnitude 12% for the DIMMs, and less elsewhere. Computers are already extremely cheap commodities.

zadikian 11 hours ago

12% for the DIMMs only, but with Intel you need Xeon and its accompanying motherboard for it. Someone said AMD "kinda" lets you do ECC on consumer hardware, not sure what the caveats are besides just being unbuffered.

Assuming that's more due to intentional market segmentation than actual cost, yeah I would pay 12% more for ECC. But I'm with the other guy on not valuing it a ton. I have backups which are needed regardless of bitrot, and even if those don't help, losing a photo isn't a huge deal for me.

loeg 10 hours ago

> Someone said AMD "kinda" lets you do ECC on consumer hardware, not sure what the caveats are besides just being unbuffered.

That was me. It isn't "officially" supported by AMD, but it should work. You can enable EDAC monitoring in Linux and observe detected correction events happening.

> Assuming that's more due to intentional market segmentation than actual cost

That's the argument, yeah.

zadikian 11 hours ago

I'm more concerned how the Mac filesystems don't have payload checksums.

deepsun 20 hours ago

I hate my workstation desktop I assembled 15 years. It just doesn't break! I have no excuses to buy a new one (except for video card).

justin66 22 hours ago

> ECC should have become standard around the time memories passed 1GB.

Ironically, that's around the time Intel started making it difficult to get ECC on desktop machines using their CPUs. The Pentium 3 and 440BX chipset, maxing out at 1GB, were probably the last combo where it pretty commonly worked with a normal desktop board and normal desktop processor.

WatchDog 2 days ago

All DDR5 ram has some amount of error correction built in, because DDR5 is much more prone to bit flipping, it requires it.

I'm not really sure if this makes it overall more or less reliable than DDR2/3/4 without ECC though.

jml7c5 16 hours ago

As I understand it, DDR5's on-die ECC is mostly a cost-saving measure. Rather than fab perfect DRAM that never flips a bit in normal operation (expensive, lower yield), you can fab imperfect DRAM that is expected to sometimes flip, but then use internal ECC to silently correct it. The end result to the user is theoretically the same.

Because you can't track on-die ECC errors, you have no way of knowing how "faulty" a particular DRAM chip is. And if there's an uncorrected error, you can't detect it.

jcalvinowens 10 hours ago

DDR5 on-die ECC detects and corrects one-bit errors. It cannot detect two-bit errors, so it will miscorrect some of them into three-bit errors. However, the on-die error correction scheme is specifically specially designed such that the resulting three-bit errors are mathematically guaranteed to be detected as uncorrectable two-bit errors by a standard full system-level ECC running on top of the on-die ECC.

himata4113 24 hours ago

that doesn't help when the bit is lost between the cpu and the memory unfortunately, it only really helps passing poor quality dram as it gets corrected for single bit flips, not that reliable either it's a yield / density enabler rather than a system reliability thing.

it's "ECC" but not the ecc you want, marketing garbage.

matheusmoreira 6 hours ago

ECC also reports error recovery statistics to the operating system. Lets you know if any unrecoverable errors happened. Lets you calculate the error rate which means you can try to predict when your memory modules are going bad.

I think this sort of reporting is a pretty basic feature that should come standard on all hardware. No idea why it's an "enterprise" feature. This market segmentation is extremely annoying and shouldn't exist.

tombert 24 hours ago

I am not sure I've ever seen a laptop that has ECC memory. I'm sure they exist but I don't think I've seen it.

I would definitely like to have a laptop with ECC, because obviously I don't want things to crash and I don't want corrupted data or anything like that, but I don't really use desktop computers anymore.

bpye 21 hours ago

There are 16" laptops with ECC, you can get a ThinkPad P16 with it for example. I've yet to find any 14" devices with ECC though.

tombert 20 hours ago

Interesting, I actually have a thinkpad p16s, surprised I didn’t notice ECC availability.

oybng 2 days ago

For the unaware, Intel is to blame for this

johanyc 20 hours ago

Can you explain

samus 18 hours ago

It makes economic sense to keep selling non-ECC hardware to maintain market segmentation.

aforwardslash 2 days ago

ECC are traditionally slower, quite more complex, and they dont completely eliminate the problem (most memories correct 1 bit per word and detect 2 bits per word). They make sense when environmental factors such as flaky power, temperature or RF interference can be easily discarded - such as a server room. But yeah, I agree with you, as ECC solves like 99% of the cases.

indolering 2 days ago

Being able to detect these issues is just as important as preventing them.

aforwardslash 2 days ago

Thing is, every reported bug can be a bit flip. You can actually in some cases have successful execution, but bitflips in the instrumentation reporting errors that dont exist.

russdill 17 hours ago

The amount of overhead a few bits of ECC has is basically a rounding error, and even then, the only time the hardware is really doing extra work is when bit errors occur and correction has to happen.

The main overhead is simply the extra RAM required to store the extra bits of ECC.

jeffbee 2 days ago

ECC are "slower" because they are bought by smart people who expect their memory to load the stored value, rather than children who demand racing stripes on the DIMMs.

matja 17 hours ago

The actual RAM chips on a ECC DIMM are exactly the same as a non-ECC DIMM, there's just an extra 1/2/4 chips to extend to 72 bit words.

The main reason ECC RAM is slower is because it's not (by default) overclocked to the point of stability - the JEDEC standard speeds are used.

The other much smaller factors are:

* The tREFi parameter (refresh interval) is usually double the frequency on ECC RAM, so that it handles high-temperature operation. * Register chip buffers the command/address/control/clock signals, adding a clock of latency the every command (<1ns, much smaller than the typical memory latency you'd measure from the memory controller) * ECC calculation (AMD states 2 UMC cycles, <1ns).

Dylan16807 20 hours ago

ECC keeps your bits safe from random flips to a ridiculously large factor. You can run the memory at high consumer speeds, giving up some of that safety margin, while still being more reliable than everything else in your computer.

And there's non-random bit errors that can hit you at any speed, so it's not like going slow guarantees safety.

undersuit 2 days ago

ECC is actually slower. The hardware to compute every transaction is correct does add a slight delay, but nothing compared to the delay of working on corrupted data.

throwaway85825 2 days ago

There's just no demand for high speed ECC aside from a few people making their own dimms.

hedora 2 days ago

ECC is standard at this point (current RAM flips so many bits it's basically mandatory). Also, most CPUs have "machine checks" that are supposed to detect incorrect computations + alert the OS.

However, there are still gaps. For one thing, the OS has to be configured to listen for + act on machine check exceptions.

On the hardware level, there's an optional spec to checksum the link between the CPU and the memory. Since it's optional, many consumer machines do not implement it, so then they flip bits not in RAM, but on the lines between the RAM and the CPU.

It's frustrating that they didn't mandate error detection / correction there, but I guess the industry runs on price discrimination, so most people can't have nice things.

ece 13 hours ago

Looking back, I actually think the older the RAM the more likely you're able to notice bit-flips and they harm your workflow. EDO RAM was the worst in my experience (my first computer), SDRAM was a bit better, and random bit-flips atleast under load got very rare after DDR2. I think Google even had a paper comparing DDR1 vs DDR2 (link: https://static.googleusercontent.com/media/research.google.c...).

That said, memory DIMM capacity increases with even a small chance of bit-flips means lots of people will still be affected.

adonovan 3 days ago

Very interesting. The Go toolchain has an (off by default) telemetry system. For Go 1.23, I added the runtime.SetCrashOutput function and used it to gather field reports containing stack traces for crashes in any running goroutine. Since we enabled it over a year ago in gopls, our LSP server, we have discovered hundreds of bugs.

Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.

However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.

In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.

In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.

I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.

aforwardslash 2 days ago

> In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

Thats the thing. Bit flips impact everything memory-resident - that includes program code. You have no way of telling what instruction was actually read when executing the line your instrumentation may say corresponds to the MOV; or it may have been a legit memory operation, but instrumentation is reporting the wrong offset. There are some ways around it, but - generically - if a system runs a program bigger than the processor cache and may have bit flips - the output is useless, including whatever telemetry you use (because it is code executed from ram and will touch ram).

adonovan 24 hours ago

Good point: I-cache is memory too. (Indeed it is SRAM, so its bits might be even more fragile than DRAM!)

c-c-c-c-c 20 hours ago

Why would a 6T cell (SRAM) be more fragile than a 1T1C (DRAM) cell?

zinekeller 16 hours ago

Because it's SRAM, and therefore it still can lose its electrons because we're working with cells a few atoms thick? The loss is not necessarily in L1 (where it's replaced frequently), but in L3 which now has memory comparable to PCs in the early 2000s (and can have its data "stuck" in the same physical area for minutes).

nitwit005 2 days ago

You might consider adding the CPU temperature to the report, if there's a reasonable way to get it (haven't tried inside a VM). Then you could at least filter out extremely hot hardware.

hedora 2 days ago

CPU model / stepping / microcode versions are probably at least as useful as temperature. I'd also try to get things like the actual DRAM timing + voltage vs. what the XMP extensions (or similar) advertise the manufacturer tested the memory at.

I have at least one motherboard that just re-auto-overclocks itself into a flaky configuration if boot fails a few times in a row (which can happen due to loose power cords, or whatever).

tczMUFlmoNk 10 hours ago

> Even with only about 1 in 1000 users enabling telemetry

How do you know the number/proportion of users who run without telemetry enabled, since by definition you're not collecting their data?

(Not imputing any malice, genuinely curious.)

adonovan 5 hours ago

Good question. We don't know the true figure, but we extrapolate the denominator from estimates of the total number of Go users and the fraction of Go users that run gopls.

jamesfinlayson 2 days ago

Interesting reading - I've occasionally seen some odd crashes in an iOS app that I'm partly responsible for. It's running some ancient version of New Relic that doesn't give stack traces but it does give line numbers and it's always on something that should never fail (decoding JSON that successfully decoded thousands of times per day).

I never dug too deeply but the app is still running on some out of support iPads so maybe it's random bit flips.

sieep 2 days ago

Ive been trying to push my boss towards more analytics/telemetry in production that focus on crashes, thanks for sharing.

charcircuit 20 hours ago

>All of these point to memory corruption.

Actually "dereferencing a pointer that had just passed a nil check" could be from a flow control fault where the branch fails to be taken correctly.

kleiba 21 hours ago

Firefox is about the only piece of software in my setup that occasionally crashes. I say "occasionally" for lack of a better word, it's not "all the time", but it is definitely more than I would want to.

If that was caused by bad memory, I would expect other software to be similarly affected and hence crash with about comparable frequency. However, it looks like I'm falling more into the other 90% of cases (unsurprisingly) because I do not observe other software crashing as much as firefox does.

Also, this whole crashing business is a fairly recent effect - I've been running firefox for forever and I cannot remember when it last was as much of an issue as it has become recently for me.

tuetuopay 20 hours ago

Just check your memory with memtest.

Two years ago, I've had Factorio crash once on a null pointer exception. I reported the crash to the devs and, likely because the crash place had a null check, they told me my memory was bad. Same as you I said "wait no, no other software ever crashed weirdly on this machine!", but they were adamant.

Lo and behold, I indeed had one of my four ram sticks with a few bad addresses. Not much, something like 10-15 addresses tops. You need bad luck to hit one of those addresses when the total memory is 64GB. It's likely the null pointer check got flipped.

Browsers are good candidates to find bad memory: they eat a lot of ram, they scatter data around, they have a large chunk, and have JITs where a lot of machine code gets loaded left and right.

crossroadsguy 15 hours ago

… and are almost always active so that would add to that spread, wouldn’t it?

Copyrightest 19 hours ago

I think the most salient point about Factorio here is that its CPU-side native core was largely hammered out by 2018, most of the development since then has been in Lua or GPU-side. The devs could be quite confident their code didn't have any unhandled null pointers. That's not really the case for Chromium or (God help us) WebKit.

vultour 19 hours ago

I spend probably thousands of hours in Firefox every year and I don't think I've ever had it crash.

dmos62 16 hours ago

Same. I don't think I've had a crash in 10+ years.

glenstein 14 hours ago

Same for me, it's simply never crashing for my day to day use. It doesn't mean there aren't idiosyncratic cases out there but anecdata can easily paint any number of pictures.

mathw 19 hours ago

Of course, nobody is claiming that there aren't lots of Firefox crashes which are caused by bugs in Firefox. Quite the opposite, based on these figures. What people find interesting is that the amount they're suspecting are down to hardware faults is way higher than most people would have expected.

pflanze 14 hours ago

Do you happen to use memory resource limits? I used to run Firefox under some, like everything, to prevent it from potentially making the whole system unresponsible, and at the same time had frequent cases of Firefox showing random visual corruptions and crashes. At some point I realized that it was because it was running out of memory, and didn't check malloc failures, thus just continued to run and corrupting memory. (That was some 6-8 years ago, maybe Firefox does better now?)

gcp 12 hours ago

You were seeing issues from the graphics driver, not Firefox.

Any memory allocation failing within the browser forces an instant crash unless the callsite explicitly opts in to handling the allocation failure.

"Check malloc failure" is an opt-out feature in browsers, not opt-in. It's the same in Chromium. Failing to check would cause too many security issues. (One more reason new stuff tends to prefer Rust, etc)

pflanze 9 hours ago

Thanks for the info! I guess it also makes sense as I realized after posting, if it did use the result of malloc unused it should crash immediately due to references into the zero page segment, thus can't have been what I saw.

Agingcoder 21 hours ago

It depends on what you bitflip.

I once had a bitflip pattern causing lowercase ascii to turn into uppercase ascii in a case insensitive system. Everything was fine until it tried to uppercase numbers and things went wrong

The first time I had to deal with faulty ram ( more than 20y ago ), the bug would never trigger unless I used pretty much the whole dimm stick and put meaningful stuff in it etc in my case linking large executables , or untargzipping large source archives.

Flipping a pixel had no impact though

bmicraft 19 hours ago

I've had some very bad ram (lots of errors found when tested) and consistently the only thing that actually crashed because of it was Firefox.

zvqcMMV6Zcr 19 hours ago

For me the only software crashing(CTD ) was Factorio. Nothing else had any issues. I tried removing mods, searching for one that started causing issues. Memtestx86 said everything is OK. Replacing one stick of RAM instantly fixed all issues.

haspok 18 hours ago

The most frequent crashes I have with Firefox are when I type in a text area (such as this one right now, or on Reddit, for example). The longer the text I type is, the more probable it is that it's going to crash. Or maybe it doesn't crash, just grinds to such a slow pace that it is equivalent to a crash.

My suspicion has always been some kind of a memory leak, but memory corruption also makes sense.

Unfortunately, Chrome (which I use for work - Firefox is for private stuff) has NEVER crashed on me yet. Certainly not in the past 5 years. Which is odd. I'm on Linux btw.

gcp 12 hours ago

I'm quite confident to say that millions of people use Firefox to comment on Reddit or similar sites every day, or write long posts, without seeing this problem.

Without knowing more about your configuration, it's hard to give advice, but definitely worth trying with a clean profile first.

If you don't report this problem upstream it will never get fixed, as obviously no-one else is seeing this. Firefox has a built-in profiler that you can use to report performance problems like this.

Delk 15 hours ago

I almost never get Firefox crashes on Linux, and I don't remember seeing significant slowdowns with text boxes either, at least not simple ones.

How long are the inputs that you get problems with?

AdamN 17 hours ago

It could be a leak but it could also be an inefficient piece of logic in Firefox. One could imagine that on every keystroke Firefox is scanning the entire input text for typos or malicious inputs whereas Chrome might be scanning only the text before the cursor back until the first whitespace (since the other text is already known).

gcp 12 hours ago

No.

lqet 20 hours ago

> Firefox is about the only piece of software in my setup that occasionally crashes.

I would add Thunderbird to that list.

xxs 16 hours ago

run y-cruncher if you'd like to test memory and overall stability. It's decent test and a lot better than memtest (in my experience)

jlarocco 14 hours ago

Firefox has a long history of denying problems, blaming the user, and fixing the issue years later.

It used to be memory usage, now it's crashing.

gcp 12 hours ago

Did you actually read the posts that started this topic, or are you being an ass for no reason?

Hint: No-one is claiming memory is to blame for 100% of the Firefox crashes. No-one is claiming it's 99% either.

jlarocco 11 hours ago

Which part of my post was being an ass?

Sorry, but I experienced first hand Firefox's memory leaks not being taken seriously. This "bitflips" news is just released, but I fully expect anybody complaining about Firefox crashes to be met with low effort "It's your RAM," responses for the next few years now.

LunaSea 20 hours ago

If only the had written Firefox in Rust, they wouldn't have had these issues .

bob1029 20 hours ago

I've written genetic programming experiments that do not require an explicit mutation operator because the machine would tend to flip bits in the candidate genomes under the heavy system load. It took me a solid week to determine that I didn't actually have a bug in my code. It happens so fast on my machine (when it's properly loaded) that I can depend on it to some extent.

rcbdev 18 hours ago

Hyrum's law in action.

https://xkcd.com/1172/

thegrim33 3 days ago

A 5 part thread where they say they're "now 100% positive" the crashes are from bitflips, yet not a single word is spent on how they're supposedly detecting bitflips other than just "we analyze memory"?

rincebrain 2 days ago

The simplest way to do this, what I believe memtest86 and friends do, is to write a fixed pattern over a region of memory and then read it back later and see if it changed; then you write patterns that require flipping the bits that you wrote before, and so on.

Things like [1] will also tell you that something corrupted your memory, and if you see a nontrivial (e.g. lots of bits high and low) magic number that has only a single bit wrong, it's probably not a random overwrite - see the examples in [2].

There's also a fun prior example of experiments in this at [3], when someone camped on single-bit differences of a bunch of popular domains and examined how often people hit them.

edit: Finally, digging through the Mozilla source, I would imagine [4] is what they're using as a tester when it crashes.

[1] - https://github.com/mozilla-firefox/firefox/commit/917c4a6bfa...

[2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1762568

[3] - https://media.defcon.org/DEF%20CON%2019/DEF%20CON%2019%20pre...

[4] - https://github.com/mozilla-firefox/firefox/blob/main/toolkit...

wging 2 days ago

[4] looks like it's only a runner for the actual testing, which is a separate crate: https://github.com/mozilla/memtest

(see: https://github.com/mozilla-firefox/firefox/blob/main/toolkit..., which points to a specific commit in that repo - turns out to be tip of main)

rendaw 2 days ago

That would tell you if there's a bitflip in your test, but not if there's a bitflip in normal program code causing a crash, no? IIUC GP's questions was how do they actually tell after a crash that that crash was caused by a bitflip.

rincebrain 2 days ago

The example I gave in there is of adding sentinel values in your data, so you can check the constants in your data structures later and go "oh, this is overwritten with garbage" versus "oh, this is one or two bits off". I would imagine plumbing things like that through most common structures is what was done there, though I haven't done the archaeology to find out, because Firefox is an enormous codebase to try and find one person's commits from several years ago in.

kevincox 12 hours ago

This doesn't always protect against out-of-bounds writes. Although if these sentinel values are in read only memory mappings it probably gets pretty close. (Especially if you consider kernel memory corruption a "bitflip".)

patrulek 21 hours ago

But it would be also possible that sentinel value used for comparison changed because of bitflip, not data structure used by program.

tredre3 3 days ago

> last year we deployed an actual memory tester that runs on user machines after the browser crashes.

He doesn't explain anything indeed but presumably that code is available somewhere.

hedora 2 days ago

That, and 50% of the machines where their heuristics say it is a hardware error fail basic memory tests.

I've seen a lot of confirmed bitflips with ECC systems. The vast majority of machines that are impacted are impacted by single event upsets (not reproducible).

(I worded that precisely but strangely because if one machine has a reproducible problem, it might hit it a billion times a second. That means you can't count by "number of corruptions".)

My take is that their 10% estimate is a lower bound.

thatguy27 2 days ago

[flagged]

hexyl_C_gut 2 days ago

It sounds like they don't know that the crashes are from bitflips but those crashes are from people with flaky memory which probably caused the crash?

wmf 2 days ago

A common case is a pointer that points to unallocated address space triggers a segfault and when you look at the pointer you can see that it's valid except for one bit.

dboreham 2 days ago

That tells you one bit was changed. It doesn't prove that single bit changed due to a hardware failure. It could have been changed by broken software.

sfink 12 hours ago

[I work at Mozilla]

Yes, that's a confounding factor, and in fact the starting assumption when looking at a crash. Sometimes you can be pretty sure it's hardware. For example, if it's a crash on an illegal instruction in non-JITted code, the crash reporter can compare that page of data with the on-disk image that it's supposed to be a read-only copy of. Any mismatches there, especially if they're single bit flips, are much more likely to be hardware.

But I've also seen it several times when the person experiencing the crashes engages on the bug tracker. Often, they'll get weird sporadic but fairly frequent crashes when doing a particular activity, and so they'll initially be absolutely convinced that we have a bug there. But other people aren't reporting the same thing. They'll post a bunch of their crash reports, and when we look at them, they're kind of all over the place (though as they say, almost always while doing some particular thing). Often it'll be something like a crash in the garbage collector while watching a youtube video, and the crashes are mostly the same but scattered in their exact location in the code. That's a good signal to start suspecting bad memory: the GC scans lots of memory and does stuff that is conditional on possibly faulty data. We'll start asking them to run a memory test, at least to rule out hardware problems. When people do it in this situation, it almost always finds a problem. (Many people won't do it, because it's a pain and they're understandably skeptical that we might be sandbagging them and ducking responsibility for a bug. So we don't start proposing it until things start feeling fishy.)

But anyway, that's just anecdata from individual investigations. gsvelto's post is about what he can see at scale.

LeifCarrotson 2 days ago

Broken software causes null pointer references and similar logic errors. It would be extremely unusual to have an inadvertent

    ptr ^= (1 << rand_between(0,64));

that got inserted in the code by accident. That's just not the way that we write software.

vlovich123 24 hours ago

Except no one is claiming the bit flip is the pointer vs the data being pointed to or a non pointer value. Given how we write software there’s a lot more bits not in pointer values that still end up “contributing “ to a pointer value. Eg some offset field that’s added to a pointer has a bit flip, the resulting pointer also has a bit flip. But the offset field could have accidentally had a mask applied or a bit set accidentally due to the closeness of & and && or | and ||.

rockdoe 21 hours ago

I think that if you hit the crash in the same line of code many times, you can safely assume it's your own bug and not a memory issue.

If it's only hit once by a random person, memory starts being more likely.

(Unless that LOC is scanning memory or smth)

vlovich123 13 hours ago

Deduplicating and identifying the source of a crash point is surprisingly hard, to the point that “it’s the only crash of its kind” could be a bug in your logic for linking issues.

Also, in an unsafe language all bets are off. A memory clobber, UAF or race condition can generate quite strange and ephemeral crashes. Even if the majority of time it generates the “same” failure mode, it can still sporadically generate a rare execution trace. It’s best to stop thinking of these as deterministic processes and more as a distribution of possible outcomes.

gcp 12 hours ago

Deduplicating and identifying the source of a crash point is surprisingly hard, to the point that “it’s the only crash of its kind” could be a bug in your logic for linking issues.

This is a bit vague to really reply to very specifically, but yes, this is hard. Which is why quite some people work in this area. It's rather valuable to do so at Firefox-scale.

Even if the majority of time it generates the “same” failure mode, it can still sporadically generate a rare execution trace.

This doesn't matter that much because the "same" failure mode already allows you to see the bug and fix it.

hrmtst93837 22 hours ago

I think claiming '100% positive' without explaining how you detect bitflips is a red flag, because credible evidence looks like ECC error counters and machine check events parsed by mcelog or rasdaemon, reproducible memtest86 failures, or software page checksums that mismatch at crash time.

Ask them to publish raw MCE and ECC dumps with timestamps correlated to crashes, or reproduce the failure with controlled fault injection or persistent checksums, because without that this reads like a hypothesis dressed up as a verdict.

gcp 12 hours ago

I don't think Firefox has the access permissions needed to read MCE status, and the vast majority of our users don't have ECC, let alone they're going to run memtest86(+) after a Firefox crash.

If they did, we wouldn't be having this discussion to begin with!

KenoFischer 24 hours ago

I'll submit my bit flip story for consideration also :) https://julialang.org/blog/2020/09/rr-memory-magic/

shevy-java 2 days ago

> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.

WhatsTheBigIdea 2 days ago

Your gut may be leading you astray?

I also find that firefox crashes much more than chrome based browsers, but it is likely that chrome's superior stability is better handing of the other 90% of crashes.

If 50% of chrome crashes were due to bit flips, and bit flips effect the two browsers at basically the same rate, that would indicate that chrome experiences 1/5th the total crashes of firefox... even though the bit flip crashes happen at the same rate on both browsers.

It would have been better news for firefox if the number of crashes due to faulty hardware were actually much higher! These numbers indicate the vast majority of firefox crashes are actually from buggy software : (

chrismorgan 19 hours ago

I run Firefox Nightly, and occasionally a little Chromium stable. Both are running under Wayland, which I believe is still not considered stable in either. In the last year of Firefox, I had one full crash (the first in maybe three years), and about four tab crashes. Plus duplicates from deliberately reproducing issues. All but one (which I’m not certain about) were Nightly-only, fixed long before reaching stable. Were I running stable, I suspect I would not have had more than three crashes of any kind in the past five years.

I can’t say the same for Chromium. Despite barely using it, I had at least one tab or iframe crash last year, and there’s a moderate chance (I’ll suggest 15%) on any given day of leaving it open that it will just spontaneously die while I’m not paying attention to it (my wild guess, based on observations about Inkscape if it’s executing something CPU-bound for too long: it’s not responding in a timely fashion to the compositor, and is either getting killed or killing itself, not sure which that would be).

Frankly, from a crashing perspective, both are very reliable these days. Chromium is still far more prone to misrendering and other misbehaviour—they prefer to ship half-baked implementations and fix them later; Firefox, on the other hand, moves slower but has fewer issues in what they do ship.

LM358 2 days ago

10% of crashes does not imply 10% of your crashes.

BeetleB 2 days ago

Are people getting so many FF crashes? Mine rarely does. I leave it running, opening and closing tabs, for weeks on end.

tbossanova 2 days ago

Same, been using it for over 20 years and probably only a handful of crashes in that time. But I mostly look at dead simple web stuff (like hn) and run aggressive ad blocking so I might not be representative of the average user

mft_ 2 days ago

I run FF on Mac laptop, Windows/Linux laptop, and Windows desktop and can’t remember it crashing in years.

zuminator 2 days ago

Naively, the more stable a piece of software is, the more likely that its failures can be attributed to hardware error.

samus 17 hours ago

It really depends on what you're doing with your hardware. Overclocking, overheating, unstable power supply, and things like that increase the likelihood of memory bitflips.

magicalhippo 23 hours ago

Slack caused frequent FF crashes, until I realized Slack has (had?) a live leak. Added an extension which force-reloads the Slack page every 15 minutes and that stopped the crashing.

Macha 2 days ago

The only browser I’ve crashed in the last decade is mobile safari, and that’s probably because it runs out of memory

intrasight 2 days ago

Months in my case. But I have ECC. Every five years I build a new development workstation and I always have ECC.

Izkata 2 days ago

I can also go months and don't see crashes (though occasionally I'll hit a memory leak where closing tabs doesn't release it so I'll restart firefox then), but unless ThinkPads come with ECC I don't have it.

socalgal2 2 days ago

Does "Weeks on end" = 4? Or do you not take the latest update every 4 weeks?

BeetleB 12 hours ago

I run Gentoo, and compile FF from source. I don't think the Gentoo repos update the FF version that frequently. And even if they do and I compile the latest one, I don't automatically quit the existing running version.

fourthark 2 days ago

That's easy to ignore.

AngryData 2 days ago

Its pretty stable for me, except it has some memory leaks. Generally I gotta leave heavy pages open for days at a time to notice, but if I don't close it entirely for over a week or two it will start to chug and crash.

shakna 2 days ago

How many DRM-heavy websites do you use? Widevine is a buggy thing.

endemic 2 days ago

macOS crashes more than Firefox for me.

fooker 2 days ago

Yes

bichiliad 2 days ago

I think they claim that if your computer has bad hardware, you're probably sending a lot of _additional_ crashes to their telemetry system. Your hardware might be working just fine, but the guy next to you might be sending 30% more crashes.

saati 2 days ago

I haven't seen a single firefox or chrome crash in months now, you should really stress-test your hardware.

galangalalgol 2 days ago

I can't recall a single Firefox crash in at least a decade. What are people doing? I run ublock origin, nothing else. I do sometimes have Firefox mobile misbehave where it stops loading new pages and I jave to restart it, but open pages work normally as do all other operations, so not a crash exactly. Happens maybe once a month

Edit: more context, I power cycle at least once a week on desktop and the version is typically a bit behind new. I also don't have more tabs open than will fit in the row. All these habits seem likely to decrease crashes.

BenjiWiebe 21 hours ago

We have 5 computers running Firefox. One computer has regular Firefox crashes. I've done some memory testing that didn't detect anything wrong.

I've tried all kinds of things software-wise but keep getting random crashes.

I wonder if I should do a longer memory test, maybe some CPU stress testing at the same time...

sfink 12 hours ago

If you want to dig into it, you can post a bunch of that computer's crash reports (navigate to about:crashes) on bugzilla: https://bugzilla.mozilla.org/enter_bug.cgi?product=Firefox&c...

Or you can view several of them and see if there's a common pattern in the "Signature" field. Firefox really should only be regularly crashing if: (1) there's a real bug and the thing that triggers it, (2) you're running out of memory, or (3) you have hardware.

I don't know what the odds of faulty hardware are for a randomly chosen user, but they're much higher for a randomly chosen user who is seeing regular crashes.

ordu 23 hours ago

Yeah. Lately even if I OOM my system, firefox doesn't crash so easily, individual tabs do.

silon42 21 hours ago

For me, OOM effectively crashes my system 90% of the time, usually caused by firefox (chromium too), if a website goes out of control (rarely it's caused by too many pages open, as tab discarding takes care of that).

p-t 2 days ago

firefox crashes... decently often for me, but it's usually pretty clear what the cause is [having a bunch of other programs open]. every time i can recall my computer bluescreening [in the last year~, since that's how long ive had it] it was because of firefox tho.

this may have something to do with the fact that my laptop is from 2017, however.

cobalt 2 days ago

firefox should not be able to cause a bluescreen, that is a bug somewhere in the kernel (drivers)

nimih 2 days ago

> Bold claim.

I agree. Good thing he doesn't back up his claim with any sort of evidence or reasoned argument, or you'd look like a huge moron!

crazygringo 2 days ago

To be fair, he doesn't really:

> And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much.

The actual measurement is 5%. The 10% figure is entirely made up, with zero evidence or reasoned argument except a hand-wavy "conservative".

Edit: actually, the claim is even less supported:

> out of these ~25000 crashes have been detected as having a potential bit-flip. That's one crash every twenty potentially caused by bad/flaky memory

"Potential" is a weasel word here. We don't see any of the actual methodology. For all we know, the real value could be 0.1% or 0.01%.

j16sdiz 18 hours ago

It depends on how the data are distributed.

I wouldn't be too surprised if that 5% all come from a few particular bad machine.

sfink 12 hours ago

There's a very good chance your system also does not have flaky memory. Most don't. You're not contradicting the post.

shakna 2 days ago

Chromium has better handling for bitflip errors. Mostly due to the Discardable buffers they make such extensive use of.

The hardware bugs are there. They're just handled.

saagarjha 15 hours ago

By what?

shakna 6 hours ago

With Discardables. When Blink's allocator detects a fault in a memory section it swaps it out for a new one, and taints the old so it is only reused when no more remains.

Live objects get swapped between Discardable buffers quite frequently. They're not expected to stay at the same position in memory.

bsder 2 days ago

> Bold claim. From my gut feeling this must be incorrect

RAM flips are common. This kind of thing is old and has likely gotten worse.

IBM had data on this. DEC had data on this. Amazon/Google/Microsoft almost certainly had data on this. Anybody who runs a fleet of computers gets data on this, and it is always eye opening how common it is.

ZFS is really good at spotting RAM flips.

hedora 2 days ago

I've had zero crashes in safari, ff or chrome in recent memory (except maybe OOMs). (Though I don't use Windows, so maybe that's part of the reason stuff just works?)

Perhaps you're part of the group driving hardware crashes up to 10% and need to fix your machine.

sgt 21 hours ago

I think most of it is just bad hardware, not specifically the RAM. Been using non-ECC desktop and laptop hardware for decades and I can't remember the machine crashing for .. I don't know, but a LONG time.

Zambyte 2 days ago

What do you mean "the same amount"? If your browser never crashes, 10% of zero is zero.

pizza234 2 days ago

>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

> Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.

That's a misinterpretation. The finding refers to the composition of crashes, not the overall crash rate (which is not reported by the post). Brought to the extreme, there may have been 10 (reported) crashes in history of Firefox, and 1 due to faulty hardware, and the statement would still be correct.

phyzome 2 days ago

...normally browsers don't crash at all. Something's wrong with your computer.

maxerickson 2 days ago

I mean, I've had quite some number of crashes that I can't correlate to anything.

Hardware problems are just as good a potential explanation for those as anything else.

estimator7292 2 days ago

He addresses this in the thread.

KennyBlanken 22 hours ago

"Software engineer thinks everyone's hardware is broken, couldn't possibly be bugs in his code" sums it up about right.

cellular 2 days ago

Maybe if Firefox tabs weren't such a memory hog it would be only 0.005% !

camkego 3 days ago

It is rumored heavily on HN that when the first employee of Google, Craig Silverstein was asked about his biggest regret, he said: "Not pushing for ECC memory."

keyringlight 2 days ago

One of the points Linus Torvalds made a few years back was that enthusiasts/PC gamers should be pissed that consumer product availability/support for ECC is spotty because as mentioned up-thread they're the kind of user that will push their system, and if memory is the cause of instability there will be a smoking gun (and they can then set the speed within its stable capacity). Diagnosing bad RAM is a pain in the rear even if you're actively looking for a cause, never mind trying to get a general user to go further than blaming software or gremlins in the system for weirdness on whatever frequency it's occurring at.

adonovan 3 days ago

It's true that in the very early days Google used cheap computers without ECC memory, and this explains the desire for checksums in older storage formats such as RecordIO and SSTable, but our production machines have used ECC RAM for a long time now.

srean 3 days ago

One of the nicest guys I have met. Was an intern at Google at that time, firing off mapreduces then (2003-2004) was quite a blast. The Peter Weinberger theme T-shirt too.

petterroea 21 hours ago

As someone who has a strong background from hobby projects with five-digit users before going into work, I think one of the most interesting differences I experienced was that the problems you see at scale simply don't exist on small scale projects. Bit flips/bad memory is one of them.

kdklol 3 days ago

I'm glad to see somebody is getting some data on this, I feel bad memory is one of the most underrated issues in computing generally. I'd like to see a more detailed writeup on this, like a short whitepaper.

moconnor 21 hours ago

Bit flips aren’t always bad hardware. I remember an anecdote from Sandia from my HPC days - they found they were getting more bit flips on some machines than others on their cluster and sometimes correlated.

Turned out at their altitude cosmic rays were flipping bits in the top-most machines in the racks, sometimes then penetrating lower and flipping bits in more machines too.

bergheim 14 hours ago

Strange. I have a tab hoarding problem, I often have over 1000 tabs open [1][2], and I cannot remember the last time Firefox crashed. I'm thinking it must have been years? I use ublock origin though, which might help since ads do their best to steal your computer and soul through any means possible of course.

I also use a bunch of other extensions though, dark reader, vimium, sideberry... I'd expect me to be a bit more exposed than the average user. Yet it's just rock stable for me. Maybe it just works better on linux?

1: I know this because I installed https://addons.mozilla.org/en-US/firefox/addon/tab-counter-p... to check :)

2: However after finding Karakeep I don't actually have 1000 tabs anymore!

andoando 12 hours ago

I dont get the people with 10+ tabs open drives me crazy, how do you even know whats what?

Just bookmark shit you want to keep!

INTPenis 21 hours ago

That's super interesting because I remember Linus Torvalds saying he requires ECC RAM in his computers, because he got tired of weird issues that were resolved by a reboot.

But non-ECC is fine for most of us mortals gaming and streaming.

I would expect pro gamers to opt for ECC though.

bhelkey 2 days ago

I would love to see DDR4 vs DDR5 bitflips. As I understand it DDR5 must come with some level of ECC [1].

[1] https://www.corsair.com/us/en/explorer/diy-builder/memory/is...

drpixie 2 days ago

From Corsair

>> DDR5 technology comes with an exclusive data-checking feature that serves to improve memory cell reliability and increase memory yield for memory manufacturers. This inclusion doesn't make it full ECC memory though.

"Proper" ECC has a wider memory buss, so the CPU emits checksum bits that are saved alongside every word of memory, and checked again by the CPU when memory is read. Eg. a 64 bit machine would actually have 72 bit memory.

DDR5 "ECC" uses error correction only within the memory stick. It's there to reduce the error rate, so otherwise unacceptable memory is usable - individual cells have become so small that they are not longer acceptably reliable by themselves!

kevin_thibedeau 2 days ago

DDR5 comes with marginal DRAM that is patched up with ECC to boost yields. It's not the same as fully reliable RAM.

Aurornis 2 days ago

The net error rate is lower with the internal ECC.

DDR4 is not fully reliable memory either.

This is common for many high speed electrical engineering challenges: Running a slightly higher error rate option with ECC on top can have an overall lower error rate at higher throughput than the alternative of running it slow enough to push the error rate down below some threshold.

It makes some people nervous because they don’t like the idea of errors being corrected, but the system designers are looking at overall error rates. The ECC is included in the system’s operation so it isn’t something that is worthwhile to separate out.

Dylan16807 20 hours ago

Yeah, while it's good to be wary of error levels, the version of a hardware system where they decide they need error checking/correction is probably a lot more reliable than the version before it.

A bit error rate of one per billion with a parity bit on each packet is much more reliable than a undetectable bit error rate of one per trillion.

stinkbeetle 2 days ago

Similar to CPUs, where many arrays have spare yield capacity, even whole cores can get disabled (and possibly sold in a different bin). DRAM stores redundant electrons in capacitors to patch it up and boost yields. Everything in reliability is a spectrum.

"ECC" does not give you fully reliable RAM. UEs are still be observed.

What's the chance of fail? If you have one device that achieves equal performance with less reliable cells and redundancy to another device that uses more reliable cells without redundancy, it's not really any different.

NAND is horribly flaky, cell errors are a matter of course. You could buy boutique NOR or SLC NAND or something if you want really good cells. You wouldn't though, because it would be ruinously expensive, but also it would not really give you a result that an SSD with ECC can't achieve.

silon42 21 hours ago

I wish also for desktop vs laptop ram comparison.

newscracker 2 days ago

This is quite surprising to me, since I thought the percentage would be a lot lesser.

But I don’t really know what the Firefox team does with crash reports and in making Firefox almost crash proof.

I have been using it at work on Windows and for the last several years it always crashes on exit. I have religiously submitted every crash report. I even visit the “about:crashes” page to see if there are any unsubmitted ones and submit them. Occasionally I’ll click on the bugzilla link for a crash, only to see hardly any action or updates on those for months (or longer).

Granted that I have a small bunch of extensions (all WebExtensions), but this crash-on-exit happens due to many different causes, as seen in the crash reports. I’m too loathe to troubleshoot with disabling all extensions and then trying it one by one. Why should an extension even cause a crash, especially when its a WebExtension (unlike the older XUL extensions that had a deeper integration into the browser)? It seems like there are fundamental issues within Firefox that make it crash prone.

I can make Firefox not crash if I have a single window with a few tabs. That use case is anyway served by Edge and Chrome. The main reasons I use Firefox, apart from some ideological ones, are that it’s always been much better at handling multiple windows and tons of tabs and its extensibility (Manifest V2 FTW).

I would sincerely appreciate Firefox not crashing as often for me.

ordu 23 hours ago

It is hard to judge, but a crash on exit seems to me a possible consequence of a damaged memory. Firefox frees all the resources and collects the garbage. I expect it to touch a lot of memory locations, and do something with values retrieved.

> this crash-on-exit happens due to many different causes, as seen in the crash reports

It points to the same direction: all these different causes are just symptoms, the root cause is hiding deeper, and it is triggered by the firefox stopping.

It is all is not a guarantee that the root cause is bitflips, but you can rule it out by testing your memory.

asimovDev 17 hours ago

Surely hardware issues would manifest in other software or overall OS as well?

rebelwebmaster 23 hours ago

Can you share a link to a crash report from about:crashes? Sounds like some kind of shutdown hang getting force-killed maybe?

ryukoposting 13 hours ago

It's worth noting that the thread says "up to 10%," not "10%" as the title suggests. So it's reasonable to believe the rate is as low as 5% based on the only real figure given (25000 / 470000)

I think our education system should include a unit on "marketing bullshit" sometime early in elementary school. Maybe as part of math class, after they learn inequalities. "Ok kids, remind me, what does 'up to' mean?" "less than or equal to!"

fastaguy88 7 hours ago

It is perhaps worth noting that the 25,000 bit flips/out of 470,000 crashes (in a week) are probably not coming from all Firefox users. It would be useful to know how many of those crashes (and bit flips) are happening on the same machine. And whether the crashes/bit flips continue on the same machine continue from week to week.

I can certainly imagine that a very small fraction of Firefox users are generating these results, so that bit flips are not a problem generally.

soletta 24 hours ago

I’ve also found that compiling large packages in GCC or similar tends to surface problems with the system’s RAM. Which probably means most typical software is resilient to a bit-flip; makes you wonder how many typos in actual documents might have been caused by bad R@M.

sfink 11 hours ago

That's exactly how my bad RAM manifested itself. In fact, I was compiling Firefox, and gcc would get a segmentation fault at some random point during compilation. I'd have to clobber and restart the hour-long build. It was only when gcc started crashing while compiling other things that I even started considering the possibility of hardware failure. I'm a software developer, and based on what I produce myself, I just assume that all software is horribly buggy. ;-)

jurakovic 21 hours ago

There is this app https://github.com/Smerity/bitflipped _Your computer is a cosmic ray detector. Literally._

fooker 2 days ago

This seems like the kind of metric that 3 users with 15 year old machines can skew significantly.

Has to be normalized, and outliers eliminated in some consistent manner.

rockdoe 21 hours ago

I'm pretty sure I saw them present on exactly this at FOSDEM?

tredre3 3 days ago

> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%.

Crashes caused by resource exhaustion are still software bugs in Firefox. At least on sane operating systems where memory isn't over-comitted.

rockdoe 21 hours ago

What's the expected behavior of a JavaScript program that allocates all memory on the machine?

gkbrk 16 hours ago

Browser killing the tab way before it happens

LorenPechtel 3 days ago

Memory isn't the only resource.

kev009 2 days ago

It's high enough that I would wonder if some systems software issues are mixed in, like rare races in malloc or page table management.

Habgdnv 23 hours ago

I bought my PC like 2 weeks ago and ran my ram at 5800 to test its limits and forgot to lower it. After few strange crashes of my fedora desktop - super strange behavior, apps refuse start/stop, can't even escape to the console... I ran memtest today and it lit all red in the first 2 minutes! Then I log in to my stable desktop at 5200 MT and I see this in the front HN page! What are the chances?!!

conartist6 3 days ago

Also a polite reminder that most of those crashes will be concentrated on machines with faulty memory so the naive way of stating the statistic may overestimate its impact to the average user. For the average user this is the difference between 4/5 crashes are from software bugs and 5/5 crashes are from software bugs, and for a lot of people it will still be 5/5

bilekas 17 hours ago

Just out of interest is ECC memory supposed to me more resilient to these types of failure?

m3047 9 hours ago

Stucke's talk about DNS being hazardous to your health is one of my all time favorites: https://www.youtube.com/watch?v=4PSc9BJDWhM

fasteo 19 hours ago

>>> In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger

Having the number of unique machines would be great to see how skewed this estimate is.

sfink 12 hours ago

To be fully accurate, it would also require tracking unique machines when collecting crash reports.

strongpigeon 11 hours ago

I might be too late to this thread to get an answer but I do wonder how much of those bitflips are due to rowhammer-style attacks. Firefox runs trillions of lines of untrusted code a day with a non-insignificant part that is of malicious intent. I wouldn’t be shocked if some of those “analog” crashes are due to that.

samus 19 hours ago

Maybe a partial solution would be to duplicate pointer data, compare pointers at every deference and panics if it doesn't match up. In essence a poor man's version of ECC. It's a considerable runtime overhead, but it might be possible to hide it behind a flag, only to be turned on to reproduce bugs. Also, anti-cheat measures already do something similar.

Certain data is more sensitive as well and requires extra protection. Pointers and indexes obviously, which might send the whole application on a wild goose chase around memory. But also machine code, especially JIT-generated traces, is worth to be checksummed and verified before executing it.

Neil44 19 hours ago

I guess the percentage of crashes due to hardware is high because people with faulty hardware are experiencing the vast majority of crashes. It sounds kind of dumb when put like that, I'm actually surprised it's that low a percentage.

danbruc 19 hours ago

I guess the percentage of crashes due to hardware is high because people with faulty hardware are experiencing the vast majority of crashes.

It is not that simple, it does not only depend on the hardware but also the code. It is like a race, what happens first - you hit a bug in the code or your hardware glitches? If the code is bug free, then all crashes will be due to hardware issues, whether faulty hardware or stray particles from the sun. When the code is one giant bug and crashes immediately every time, then you will need really faulty hardware or have to place a uranium rod on top of your RAM and point a heat gun at your CPU to crash before you hit the first bug, i.e. almost all crashes will be due to bugs.

So what you observe will depend on the prevalence of faulty hardware and how long it takes to hit an hardware issue vs how buggy the code is and how long it takes to hit a bug.

dbolgheroni 2 days ago

When debugging something, I often remember the the quote, often misattributed to Einstein: "Insanity is doing the same thing over and over again and expecting different results". Then I remember about bitflips, and run a second, maybe a third time, just expecting the next bit to flip to not be in the routine I'm trying to debug.

spiffy2025 2 days ago

Travis Long had done something similar in 2022 at Mozilla.

https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-...

mrguyorama 3 days ago

People I think are overindexing on this being about "Bad hardware".

We have long known that single bit errors in RAM are basically "normal" in terms of modern computers. Google did this research in 2009 to quantify the number of error events in commodity DRAM https://static.googleusercontent.com/media/research.google.c...

They found 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.

At the time, they did not see an increase in this rate in "new" RAM technologies, which I think is DDR3 at that time. I wonder if there has been any change since then.

A few years ago, I changed from putting my computer to sleep every night, to shutting it down every night. I boot it fresh every day, and the improvements are dramatic. RAM errors will accumulate if you simply put your computer to sleep regularly.

jmalicki 3 days ago

There is DRAM which is mildly defective but got past QC.

There are power suppliers that are mildly defective but got past QC.

There are server designs where the memory is exposed to EMI and voltage differences that push it to violate ever more slightly that push it past QC.

Hardware isn't "good" or "bad", almost all chips produced probably have undetected mild defects.

There are a ton of causes for bitflips other than cosmic rays.

For instance, that specific google paper you cited found a 3x increase in bitflips as datacenter temperature increased! How confident are you the average Firefox user's computer is as temperature-controlled as a google DC?

It also found significantly higher rates as RAM ages! There are a ton of physical properties that can cause this, especially when running 24/7 at high temperatures.

morelikeborelax 24 hours ago

I used to partake in all RAM discussions online. Here, reddit, every technical hardware forum and anywhere workstations were being talked about.

The sentiment was always ECC is a waste and a scam. My goodness the unhinged posts from people who thought it was a trick and couldn't fathom you don't know you're having bits flipped without it. "it's a rip off" without even looking and seeinf often the price was just that of the extra chip.

I've discussed it for 20 years since the first Mac Pro and people just did not want to hear that it had any use. Even after the Google study.

Consumers giving professionals advice. Was same with workstation graphics cards.

hinkley 2 days ago

Every so often when I'm doing refactoring work and my list of worries has decreased to the point I can start thinking of new things to worry about, I worry about how as we reduce the accidental complexity of code and condense the critical bytes of the working memory tighter and tighter, how we are leaning very hard on very few bytes and hoping none of them ever bitflip.

I wonder sometimes if we shouldn't be doing like NASA does and triple-storing values and comparing the calculations to see if they get the same results.

akoboldfrying 2 days ago

Might be worth doing the kind of "manual ECC" you're describing for a small amount of high-importance data (e.g., the top few levels of a DB's B+ tree stored in memory), but I suspect the biggest win is just to use as little memory as possible, since the probability of being affected by memory corruption is roughly proportional to the amount you use.

shiroiuma 2 days ago

It'd be interesting to see how your experience would differ if you put it to sleep at night after switching to ECC RAM.

Unfortunately, not that many consumer platforms make this possible or affordable.

SoftTalker 2 days ago

Most computers running Firefox won't have ECC RAM.

sinuhe69 15 hours ago

Oh, on my old PC, FF sometimes mysteriously crashed for apparently no reason. I sent bug reports and cleared the profile and it seemed to help for a while, then it crashed again. Much later, I suspected and tested the RAM and turned out, it had a faulty module!

kmoser 3 days ago

The next logical step would be to somehow inform users so they could take action to replace the bad memory. I realize this is a challenge given the anonymized nature of the crash data, but I might be willing to trade some anonymity in exchange for stability.

titaniumtravel 3 days ago

The easy solution for that is to just do that analysis locally... Firefox doesn't submit the full core dumps anyhow for this exact reason and therefore needs to do some preprocessing in any case.

sfink 11 hours ago

I think the firefox crash reporter does now? It does a limited memory scan and reports problems it finds. No privacy violations required.

That's different from what you're suggesting, because you're right that the crash reports are analyzed with heuristics to guess at memory corruption. Aside from the privacy implications, though, I think that would have too many false alarms. A single bit flip is usually going to be an out of bounds write, not bad RAM.

shiroiuma 3 days ago

>The next logical step would be to somehow inform users so they could take action to replace the bad memory.

This isn't really feasible: have you looked at memory prices lately? The users can't afford to replace bad memory now.

hiddendoom45 2 days ago

The memory issue may not necessarily be from bad ram, it can also be due to configuration issues. Or rather it may be fixable with configuration changes.

I had memory issues with my PC build which I fixed by reducing the speed to 2800MHZ, which is much lower than its advertised speed of 5600MHZ. Actually looking back at this it might've configured its speed incorrectly in the first place, reducing it to 2800 just happened to hit a multiple of 2 of its base clock speed.

monadgonad 17 hours ago

The current situation really has zero bearing on the principle that it’s better to inform users of this.

kmoser 2 days ago

I have two identical computers; if the RAM on one is bad, I can swap out the RAM from another. But thank you for your concern.

_0xdd 24 hours ago

So, why aren't we all using ECC in 2026?

matja 17 hours ago

Doesn't sell as well as a beautiful citrus blush milled aluminium case.

lunar_rover 24 hours ago

Intel intentionally ripped ECC out of the sweet spot products to charge premium and unfortunately they succeeded.

Pentium G4560 supports ECC, Core i7 10700 doesn't.

bpye 20 hours ago

They did improve this in more recent generations, but you need a W series chipset to use it.

haspok 18 hours ago

Because 99% of laptops don't have it, and can't be memory upgraded?

SeanSullivan86 21 hours ago

Hmm, can someone educate me here? Why don't bit flips ever seem to impact the results of calculations in settings like big-data analytics on AWS?

Is it a difference between server hardware managed by knowledgeable people and random hardware thrown together by home PC builders?

huhhuh 21 hours ago

In Belgium elections, a party received 4096 unaccounted votes likely due to a bit flip: https://en.wikipedia.org/wiki/Electronic_voting_in_Belgium#R....

matja 17 hours ago

You can only detect what you measure. Are these big-data analytics processes running multiple times to detect differences?

zadikian 21 hours ago

Servers and pro workstations normally have ECC RAM.

OkGoDoIt 21 hours ago

Presumably professional hardware uses ECC memory, which automatically corrects these kinds of errors.

lifeisstillgood 16 hours ago

I’m pretty sure Torvalds tells a story of spending days hunting down a compiler bug, only to find it was memory, and then simply never using anything other than EC memory again.

10+% is huge

seanalltogether 14 hours ago

He specifically mentions this story in the LTT video from a few months ago.

https://youtu.be/mfv0V1SxbNA?si=hS4ZMRYqqLXMkxJW&t=526

CamouflagedKiwi 2 days ago

This is a pretty big claim which seems to imply this is much more common than expected, but there's no real information here and the numbers don't even stack up:

> That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much.

So the data actually only supports 5% being caused by bitflips, then there's a magic multiple of 2? Come on. Let alone this conservative heuristic that is never explained - what is it doing that makes him so certain that it can never be wrong, and yet also detects these at this rate?

devy 2 days ago

I wonder if Chrome dev team can corroborate on this finding in their crash reporting.

AndriyKunitsyn 2 days ago

>That fancy ARM-based MacBook with RAM soldered on the CPU package? We've got plenty of crashes from those, good luck replacing that RAM without super-specialized equipment and an extraordinarily talented technician doing the job.

CPU caches and registers - how exactly are they different from a RAM on a SoC in this regard?

benjaminl 2 days ago

In just about every way. CPU caches are made from SRAM and live on the CPU itself. Main system RAM is made from DRAM and live on separate chips even if they are soldered into the same physical package (system in package or SiP). The RAM still isn't on the SoC.

brcmthrowaway 2 days ago

Unless its gpu

phs2501 2 days ago

For one thing, static vs dynamic RAM. Static RAM (which is what's used for your typical CPU cache) is implemented with flip-flops and doesn't need to be refreshed, reads aren't destructive like DRAM, etc.

wmf 2 days ago

Caches and registers are also subject to bitflips. In many CPUs the caches use ECC so it's less of a problem. Intel did a study showing that many bits in registers are unused so flipping them doesn't cause problems.

stinkbeetle 2 days ago

At that level, they are not different. They could suffer from UE due to defect, marginal system (voltage, temperature, frequency), or radiation upset, suffer electromigration/aging, etc. And you can't replace them either.

CPUs tend to be built to tolerate upsets, like having ECC and parity in arrays and structures whereas the DRAM on a Macbook probably does not. But there is no objective standard for these things, and redundancy is not foolproof it is just another lever to move reliability equation with.

stnvh 3 days ago

Try running two instances of Firefox in parallel with different profiles, then do a normal quit / close operation on one after any use. Demons exist here.

quesera 23 hours ago

Describe "demons"?

I run four Firefox instances simultaneously, most of the time. No issues to report.

stnvh 16 hours ago

Long hangs / never closes, crash report screen triggers often. macOS. This occurs for me when launching instances from the about:profiles page and using each instance for what I'd describe as normal use

quesera 14 hours ago

I see. Yeah, I've seen this.

It seems more likely to happen when the profile has been running for a long time (a couple weeks?) and/or using a large amount of RAM.

There's a 60-secish timeout before it gives up and pops that crash report window. I don't think it's a crash per se, just an unresolved file lock or similar. I haven't noticed whether there's any relationship to running multiple profiles. I am almost always running several at a time, and the issue only occurs sometimes. It has no (other) negative side effects, as far as I can tell, but it was unsettling at first.

I'm on macOS also, and I launch from the command line (effectively, I actually have separate launchers for each profile, but they just run a shell script with different arguments).

roryirvine 12 hours ago

Same, also on macOS. My "personal" firefox profile on my work Macbook Pro, which I use for occasional gmail, HN, wikipedia, and pretty much nothing else, has crashed twice in the last 6 weeks - both times when shutting down to update the OS.

Honestly, I've been blaming MacOS for it since other apps also crashed at the same time (the first time it was Microsoft Intune, the second time it was Slack - I doubt either uses Firefox internally). I don't recall seeing a Firefox crash on my personal laptop running Linux at any point in the past few years.

quesera 10 hours ago

I don't think "crash" is the right word for the Firefox behaviour. Yes it does pop a window that calls itself a "crash reporter", but in my observation it's a shutdown timer timeout that expires after ~60secs.

My guess is that it's trying to obtain or release a filesystem lock, possibly one that it's lost track of in some trivial way.

I've never seen any damage or inconsistencies in the resulting environment. So I don't think it's a dramatic event, just a safety timer that isn't resolved correctly.

Probably a simple, dumb, but harmless bug.

roryirvine 10 hours ago

Yes, you're right - the tabs restored fine afterwards and the restart was only delayed for a minute or so, so it was barely even a minor inconvenience.

Contrast that with the dreadful corporate-supplied Edge AI browser I have to use for one client, which seems to randomly close windows without being asked, and never seems to be able to restore them.

stnvh 8 hours ago

Reassuring to hear I'm not the only one, and would consider this a normal use case for the browser, in fact one of the main reasons I use Firefox over chrome as it's simpler to manage than the latter.

I was hinting in my original comment if these cases are contributing to crash reports in any capacity there is a small chance they could be misattributed towards the claims in the post, especially if memory is not freed correctly on shutdown. Even more so if any memory allocation is shared between processes / helpers.

If I quit normally, don't wait for the "timeout" and force quit I still get the crash report UI immediately which suggests to me something funky going on.

10% is a crazy high percentage to claim for bitflips.

rs_rs_rs_rs_rs 21 hours ago

What is your expected behaviour?

pulkas 21 hours ago

what happens if bitflip occurs while you are detecting bitflip?

bitflippin...

matja 17 hours ago

I have a machine with a 6 year uptime that was slowly accumulating single bit error corrections. The EDAC counter mysteriously stopped at 308 last year, and hasn't changed since, so I wonder if a bitflip in the counter circuit made it stop...

brador 2 days ago

How many are caused by cosmic radiation bitflips?

emmelaich 2 days ago

An SO question indicates "10 GB of memory should show an ECC event every 1,000 to 10,000 hours,"

https://stackoverflow.com/questions/2580933/cosmic-rays-what...

ptek 2 days ago

So does this mean bool true = 3 or should bool true = 5?

This will bloat the code a bit.

alok-g 23 hours ago

Interesting. Seems like software could be made a notch more robust by encoding true and false with a larger number of bit differences.

jdpage 10 hours ago

The canonical Boolean values in FORTH are 0 and -1 (that is, all bits set). IIRC the point of that is to unify the bitwise and logical operators, though, not detect bitflips.

Also, at the machine code level, a Boolean controlling a branch or a while loop often doesn't ever make it out of the flags register, where it'll only be a single bit anyway because that's how the hardware works. Not really changeable in software.

est 2 days ago

so could software engineering sommehow catch those crashes?

aforwardslash 2 days ago

Going to be downvoted, but I call bullshit on this. Bitflips are frequent (and yes ECC is an improvement but does not solve the problem), but not that frequent. One can either assume users that enabled telemetry are an odd bunch with flaky hardware, or the implementation isnt actually detecting bitflips (potentially, as the messages indicate), but a plathora of problems. Having a 1/10 probability a given struct is either processed wrong, parsed wrong or saved wrong would have pretty severe effects in many, many scenarios - from image editing to cad. Also, bitflips on flaky hardware dont choose protection rings - it would also affect the OS routines such as reading/writing to devices and everything else that touches memory. Yup, i've seen plenty of faulty ram systems (many WinME crashes were actually caused by defective ram sticks that would run fine with W98), it doesnt choose browsers or applications.

tempaccount5050 2 days ago

How can you possibly be this confident if you don't know the number of times Firefox was run and number of bug reports submitted? Say it's run 100,000,000 times, 1000 reports are submitted, and 10 are bit flips. Seems reasonable. You're misinterpreting what they are saying.

aforwardslash 15 hours ago

10% of 1000 isnt 10; its 100.And no, its not reasonable - the main reason is that you cannot reliably tell if something is a bit flip or not remotely, because bitflips affect both code and data. Also, 10% of a semi-obscure specific category of failures seems to indicate that the population submitting crashes isn't random enough. I'm a layman in statistics, but this doesn't seem correct, at least not without concrete details on the kinds of bugs being reported and the methodology used. Claiming 10% and being able to demonstrate 10% are different things - and the tweet thread indicates that is this clickbait - something in the lines of "may potentially be a bit-flip". Well, every error may be a bit flip.

groundzeros2015 23 hours ago

Also having worked in big software with many users, this also doesn't match the data we had.

The only explanation I can see is if Firefox is installed on a user base of incredibly low quality hardware.

dheera 2 days ago

It says 10% of crashes

If Firefox itself has so few bugs that it crashes very infrequently, it is not contradictory to what you are saying.

I wouldn't be surprised if 99% of crashes in my "hello world" script are caused by bit flips.

aforwardslash 2 days ago

Just updated with a comment. I see firefox crash routinely, so apparently our experiences are quite different :)

jesup 2 days ago

You should look at about:crashes and see if there's any commonality in the causes, or bugs associated with them (though often bugs won't be associated with the crash if it isn't filed from crash-stats or have the crash signature in the bug)

antonf 2 days ago

Maybe you should check your memory? I recently started to get quite a lot of Firefox crashes, and definitely contributed to this statistic. In the end, the problem was indeed memory - crashes stopped after I tuned down some of the timings. And I used this RAM for a few years with my original settings (XMP profile) without issue.

aforwardslash 15 hours ago

I experience them in several different devices; On my main device, I have hundreds of chrome tabs and often many workloads running that would be completely corrupt with random bit flips. I'm not discarding the possibility of faulty RAM completely, I just take the measurement of the tweet with a huge grain of salt - after all, I still remember when the FF team constantly denied - for more than half a decade - that the browser had serious memory leak problems, so its not like there isn't a history of pointing out other causes for FF crashes.

squeaky-clean 22 hours ago

The last time I can recall Firefox crashing was when I was using Windows Vista. This definitely sounds like a problem with your system.

aforwardslash 2 days ago

I forgot to mention - yes Im assuming 100% of firefox instances crash, if run long enough; I (still) use firefox as a second browser.

vsgherzi 3 days ago

is there a way to get the memory tester he mentioned? Is it open source? Once Ram goes bad is there a way or recovering it or is it toasted forever?

foresto 3 days ago

You can map known-bad memory regions to avoid using them.

https://www.memtest86.com/blacklist-ram-badram-badmemorylist...

hinkley 2 days ago

However if the third chip on your memory stick is properly broken, then the third bit out of every word of memory may get stuck high or low, and then the whole chip is absolutely worthless.

The most expensive memory failure I had was of this sort, and frustratingly came from accidentally unplugging the wrong computer.

After this I did buy some used memory from a recycling center that had the sorts of problems you described and was able to employ them by masking off the bad regions.

vizzier 3 days ago

https://www.memtest86.com/

Errors may be caused by bad seating/contact in the slots or failing memory controllers (generally on the CPU nowadays) but if you have bad sticks they're generally done for.

RachelF 2 days ago

This is the best way of marking regions of RAM as bad in Windows:

https://github.com/prsyahmi/BadMemory

I've used it for many years. It only fixes physical hardware faults, not timing errors. For example if a RAM cell is damaged by radiation, not if you're overclocking your RAM.

bakugo 2 days ago

I was running my PC with bad memory for a few weeks last year. Firefox crashed a LOT, way more than any other application I used during that time, so I've probably contributed a decent amount to these numbers...

shevy-java 2 days ago

It could be that firefox is written inefficiently though.

black_knight 2 days ago

Or so efficiently that every bit counts and plays a vital role! Even a single bit off and the thing derails…

wosined 19 hours ago

The title should start with "Up to 10%"

charcircuit 20 hours ago

When I had bad memory, Firefox was the only program which would crash because of it. I think there is also something to say about how Firefox's design could be improved to handle them better.

1over137 2 days ago

Curious why this article is written into divided up chunks?

wmf 2 days ago

They're tweets.

d--b 21 hours ago

Does anyone know how they can detect hardware defects like this? This sounds like an incredibly hard problem. And I don’t see how they can do this without impacting performance significantly.

rockdoe 21 hours ago

If the crash is isolated (no other reports) and flipping one bit in the crashing pointer value would make the pointer valid, it's assumed to be a bitflip. This obviously will only catch a minor portion of bitflips, i.e. any image or video data with bitflips wouldn't crash.

From what he's saying they run an actual memory test after a crash, too.

chazburger 2 days ago

Yet the operating system keeps running.

190n 2 days ago

Operating systems use less RAM than Firefox.

chlorion 11 hours ago

Does it though?

People experience "blue screens" and kernel panics and such pretty often.

DangitBobby 2 days ago

I would expect operating systems to be very fault tolerant programs.

dankons 22 hours ago

Not necessarily, have had my fair share of dodgy OS behavior fixed by replacing RAM

phendrenad2 2 days ago

Guesstimation at its finest.

dana321 2 days ago

And.. how do they not know its their software being leaky and causing these bitflips?

These are potential bitflips.

I found an issue only yesterday in firefox that does not happen in other browsers on specific hardware.

My guess is that the software is riddled with edge-case bugs.

darkhorn 2 days ago

What brands or types of memory cards are less likely to crash by bitflips?

estimator7292 2 days ago

ECC

stinkbeetle 2 days ago

This matches what I have long said, which is that adding ECC memory to consumer devices will not result in any incredible stability improvement. It will barely be a blip really.

As we know from Google and other papers, most of these 10% of flips will be caused by broken or marginal hardware, of which a good proportion of which could be weeded out by running a memory tester for a while. So if you do that you're probably looking a couple out of every hundred crashes being caused by bitflips in RAM. A couple more might be due to other marginal hardware. The vast majority software.

How often does your computer or browser crash? How many times per year? About 2-3 for me that I can remember. So in 50 years I might save myself one or two crashes if I had ECC.

ECC itself takes about 12.5% overhead/cost. I have also had a couple of occasions where things have been OOM-killed or ground to a halt (probably because of memory shortage). Could be my money would be better spent with 10% more memory than ECC.

People like to rave and rant at the greedy fatcats in the memory-industrial complex screwing consumers out of ECC, but the reality is it's not free and it's not a magical fix. Not when software causes the crashes.

Software developers like Linus get incredibly annoyed about bug reports caused by bit flips. Which is understandable. I have been involved in more than one crazy Linux kernel bug that pulled in hardware teams bringing up new CPU that irritated the bug. And my experience would be far from unique. So there's a bit of throwing stones in glass houses there too. Software might be in a better position to demand improvement if they weren't responsible for most crashes by an order of magnitude...

eek2121 2 days ago

Definitely going to hard disagree with Gabriele Svelto's take. I could point to the comments, however, let me bring up my own experiences across personal devices and organizational devices. In particular, note where he says this:

"I can't answer that question directly because crash reports have been designed so that they can't be tracked down to a single user. I could crunch the data to find the ones that are likely coming from the same machine, but it would require a bit of effort and it would still only be a rough estimate."

You can't claim any percentage if you don't know what you are measuring. Based on his hot take, I can run an overclocked machine have firefox crash a few hundred thousand times a day and he'll use my data to support his position. Further, see below:

First: A pre-text: I use Firefox, even now, despite what I post below. I use it because it is generally reliable, outside of specific pain points I mention, free, open source, compatible with most sites, and for now, is more privacy oriented than chrome.

Second: On both corporate and home devices, Firefox has shown to crash more often than Chrome/Chromium/Electron powered stuff. Only Safari on Windows beats it out in terms of crashes, and Safari on Windows is hot garbage. If bit flips were causing issues, why are chromium based browsers such as edge and Chrome so much more reliable?

Third: Admittedly, I do not pay close enough attention to know when Firefox sends crash reports, however, what I do know is that it thinks it crashes far more often than it does. A `sudo reboot` on linux, for example, will often make firefox think it crashed on my machine. (it didn't, Linux just kills everything quickly, flushes IO buffers, and reboots...and Firefox often can't even recover the session after...)

Fourth: some crashes ARE repeatable (see above), which means bit flips aren't the issue.

Just my thoughts.

jesup 2 days ago

force-kills like sudo reboot will show UI on restart indicating it didn't shut down cleanly, but that isn't reported as a crash. You can see how often you actually crash via about:crashes (and also see what happened)

hedora 2 days ago

Do you have any evidence that Firefox crashes more?

Also, the latest version of Safari for Windows was released in 2012. How old is your Firefox?

wakawaka28 2 days ago

Ugh just write a real blog post dude.

nickhodge 19 hours ago

Rust would fix this. Oh wait.

titzer 14 hours ago

I had a refurbished ThinkPad that had memory corruption. I only noticed because Firefox started to crash an unreasonable amount. Ran memcheck through BIOS and sure enough it was bad RAM.

Have we considered that maybe Firefox is the cause of bad memory?

sfink 11 hours ago

It is.

If a tree falls in the forest with nobody around to hear it, does it make a sound?

If a computer flips bits while it's not doing anything with that memory, does it have bad RAM?

A fair number of people pretty much only use their computers as web browsers.

QED

wartywhoa23 17 hours ago

But muh memory-safe Rust!!! :'(

nubinetwork 3 days ago

470k crashes in a week? Considering how low their market share is, that would suggest every install crashes several times a day... I gotta call bs.

refulgentis 3 days ago

470k crashes / week

67k crashes / day

claim: "Given # of installs is X; every install must be crashing several times a day"

We'll translate that to: "every install crashes 5 times a day"

67k crashes day / 5 crashes / install

12k installs

Your claim is there's 12k firefox users? Lol

titaniumtravel 3 days ago

Based on what data? According to their reporting they have around 200 Million monthly users, which seems compatible with 470k crashes a week? See <https://data.firefox.com/dashboard/user-activity>

nubinetwork 3 days ago

2% worldwide? https://gs.statcounter.com/browser-market-share

Granted, they're probably just as accurate as netcraft. /shrug

titaniumtravel 3 days ago

The nuance here is of cause that there are a bunch of people using multiple browsers. Also I mean there are a lot of people using browsers on the world

hinkley 2 days ago

If 10% of firefox users are also iOS users, which is not unlikely, then those people get double-counted. In my case I probably use my phone and tablet for at least 50% of my web traffic, not counting youtube, which also skews things.

vizzier 3 days ago

For my part I'm not sure I recall a crash having daily driven firefox in quite some time. I'd suspect that the large number of bit errors might be driven by a small number of poor hardware clients.

pixl97 3 days ago

Wouldn't it be more likely the faulty machines are crashing pretty often.

bArray 20 hours ago

> In the last week we received ~470000 crash reports, these do not represent all crashes because it's an opt-in system, the real number of crashes will be several times larger.

470k crashes in a single week, and this is under-reported! I bet the number of crashes is far higher. My snap Firefox on Ubuntu would lock-up, forcing me to kill it from the system monitor, and this was never reported as a crash.

Once upon a time I wrote software for safety critical systems in C/C++, where the code was deployed and expected to work for 10 years (or more) and interact with systems not built yet. Our system could lose power at any time (no battery) and we would have at best 1ms warning.

Even if Firefox moves to Rust, it will not resolve these issues. 5% of their crashes could be coming from resource exhaustion, likely mostly RAM - why is this not being checked prior to allocation? 5% of their crashes could be resolved tomorrow if they just checked how much RAM was available prior to trying to allocate it. That accounts for ~23k crashes a week. Madness.

With the RAM shortages and 8GB looking like it will remain the entry laptop norm, we need to start thinking more carefully about how software is developed.

NotGMan 3 days ago

>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

I find this impossible to believe.

If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.

>> This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem.

Might be the case, but 10% is still huge.

There imo has to be something else going on. Either their userbase/tracking is biased or something else...

netcoyote 2 days ago

It is huge, but real (see https://news.ycombinator.com/item?id=47258500)

Browsers, videogames, and Microsoft Excel push computers really hard compared to regular applications, so I expect they're more likely to cause these types of errors.

The original Diablo 2 game servers for battle.net, which were Compaq 1U servers, failed at astonishing rates due to their extremely high utilization and consequent heat-generation. Compaq had never seen anything like it; most of their customers were, I guess, banking apps doing 3 TPS.

alpaca128 2 days ago

In my case it doesn't seem to be related to system load. I have an issue where (mainly) using FF can trigger random system freezes on Linux, often with the browser going down first. But running CPU/memory stress tests, compiling things etc don't cause any errors and the cooler is downright bored.

alpaca128 7 hours ago

Update: it's starting to look like CPU C-states were the problem.

plorkyeran 2 days ago

Everyone who has put serious effort into analyzing crash reports en mass has made similar discoveries that some portion of their crashes are best explained by faulty hardware. What percent that is mostly comes down to how stable your software is. The more bugs you have, the lower the portion that come from hardware. Firefox being at 10% from bad RAM just means that crashes due to FF bugs are somewhat uncommon but not nonexistent, which lines up with my experience with using FF.

bjourne 2 days ago

IME, random bitflips is the engineer's way of saying "I'm sick and tired of root cause analysis" or "I have no fucking clue what the bug is." I, like others, remain skeptical about the claim.

wmf 2 days ago

We're not talking about unexplained bugs here. We're talking about a pointer that obviously has one bit flipped and it would be correct if you flipped that one bit back.

compiler-guy 2 days ago

“I have no data, but I’m sure those who do have data, and have spent a significant amount of time analyzing it, are wrong.”

bjourne 17 hours ago

Well, touché. But I'm willing to change my mind once I've seen that data and the methodology Svelto used to analyze it. Extraordinary claims require extraordinary evidence.

SoftTalker 2 days ago

Computers today have many GB of RAM, and programs that use it.

The more RAM you have, the higher the probabilty that there will be some bad bits. And the more RAM a program uses, the more likely it will be using some that is bad.

Same phenomenon with huge hard drives.

lukev 2 days ago

And most the time a bit flips it means that there's a wonky pixel somewhere in a photo, texture or video that you'd never even notice.

A bit flip actually needs to be pretty "lucky" to result in a crash.

rockdoe 21 hours ago

If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.

But they have?