some chips set them step by step, as shown in the article
others only set them at them very end, together
and then there are chips which follow the read-modify-write op with another read, to check if the RMW succeeded... which promptly causes them to hang hard when the page tables live in read-only memory i.e. ROM... fun fun fun!
as for segmentation fun... think about CS always being writeable in real mode... even though the access rights only have a R but no W bit for it...
Using a non-standard mechanism of loading CS (LOADALL or RSM), it's possible to have a writable CS in protected mode too, at least on these older processors.
There's actually a slight difference in the access rights byte that gets loaded into the hidden part of a segment register (aka "descriptor cache") between real and protected mode. I first noticed this on the 80286, and it looks to be the same on the 386:
- In protected mode, the byte always matches that from the GDT/LDT entry: bit 4 (code/data segment vs. system) must be set, the segment load instruction won't allow otherwise, bit 0 (accessed) is set automatically (and written back to memory).
- In real and V86 mode, both of these bits are clear. So in V86 mode the value is 0xE2 instead of the "correct" 0xF3 for a ring 3 data segment, and similarly in real mode it's 0x82 (ring 0).
The hardware seems to simply ignore these bits, but they still exist in the register, unlike other "useless" bits. For example, LDT only has bit 7 (present), and GDT/IDT/TSS have no access rights byte at all - they're always assumed to be present, and the access rights byte reads as 0xFF. At least on the 286 that was the case, I've read that on the Pentium you can even mark GDT as not-present, and then get a triple fault on any access to it.
Keeping these bits, and having them different between modes might have been an intentional choice, making it possible to determine (by ICE monitor software) in what mode a segment got loaded. Maybe even the two other possible combinations (where bit4 != bit0) have some use to mark a "special" segment type that is never set by hardware?
However, Win32s was introduced in 3.11 which a subset of the Windows 32-bit API from NT.
3.11 also introduced 32-bit disk access and 32-bit drivers.
Microsoft did 32-bit in steps -- it was confusing already back then.
They gave us a win3.1 computer and Spyglass Mosaic which required the Win32s susbsystem.
http://www.win3x.org/win3board/viewtopic.php?t=4971&view=min
The full time guys all had a Sun on their desk next to their PC. We also had to run an IBM 3270 terminal emulator and X server to connect to the Suns. It was all so unstable. I rememember a bunch of "Win32s error" popups.
The other intern and I found a room full of decommissioned 486 machines, installed Linux and didn't tell anyone for a month. Everything worked great and then we started an assembly line of installing Linux on those old machines for all the older coworkers to take home.
IIRC a lot of it wasn't turned on by default due to hardware/driver compatability concerns, and there were articles all over the place about how to turn it on for extra performance. Essentially they used optimising tech-heads the world over as a giant beta-test group for parts of Win95's IO subsystem.
And also--before Linux--SCO Xenix and then SCO Unix. It was finally possible to run a real Unix on a desktop or home PC. A real game changer. I paid big $$$ (for me at the time) to get SCO Xenix for my 386 so I could have my own Unix system.
The PDP-11's MMU option was closer to the 8088's segmentation model I think, but I've never coded either, so dunno really. It does seem like it was possible to port "PDP-11 UNIX" to a lot more platforms than would get "VMUNIX".
https://en.wikipedia.org/wiki/Singularity_(operating_system)
Managed code, the properties of their C# derived programming language, static analysis and verification were used rather than hardware exception handling.
I think hardware protection is usually easier to sell but it isn't when it is slower or more expensive than the alternative.
edit: I missed it was linked on the above page
Expanding on what I wrote above about "bits of hardware acceleration", maybe adding a few primitives to the instruction set that make page table walking easier would help.
And with a trusted compiler architecture you don't need to keep the ISA stable between iterations, since it's assumed that all code gets compiled at the last minute for the current ISA.
Lots of fun things to experiment with.
As a thought experiment, imagine an extremely simple ISA and memory interface where you would do address translation or even cache management in software if you needed it... the different cache tiers could just be different NUMA zones that you manage yourself.
You might end up with something that looks more like a GPU or super-ultra-hyper-threading to get throughput masking the latency of software-defined memory addressing and caching?
Basically, you have to have out of order/speculative execution if you ultimately want the best performance on general/integer workloads. And once you have that, timing information is going to leak from one process into another, and that timing information can be used to infer the contents of memory. As far as I can see, there is no way to block this in software. No substitute for the CPU knowing 'that page should not be accessible to this process, activate timing leak mitigation'.
A far greater problem is that until very recently, practical memory safety required the use of inefficient GC. Even a largely memory-safe language like Rust actually requires runtime memory protection unless stack depth requirements can be fully determined at compile time (which they generally can't, especially if separately-provided program modules are involved).