I looked it up afterwards and xor was also a valid instruction in that architecture to zero out a register, and used even fewer cycles than the subtraction method; but it was not listed in the subset of the assembly language instructions we were allowed to use for that unit. I suspect that it was deemed a bit off-topic, since you would need to explain what the mathematical XOR operation was (if you didn't already learn about it in other units), when the unit was about something else entirely- but everyone knows what subtraction is, and that subtracting a number by itself leads to zero.
[0] Not x86, I do not recall the exact architecture.
xor was the default zeroing idiom.I onkly did sub reg,reg when I actually want its flags result. Otherwise the main rule is: do not touch either form unless flags liveness makes the rewrite obviously safe. Had about 40 such idioms for the passes.
Probably, there are ALU pipeline designs where you don't pay an explicit penalty. But not all, and so XOR is faster.
Surely, someone as awesome as Raymond Chen knows that. The answer is so obvious and basic I must be missing something myself?
A tangent, but what is Obvious depends on what you know.
Often experts don't explain the things they think are Obvious, but those things are only Obvious to them, because they are the expert.
We should all kind, and explain also the Obvious things those who do not know.
E.g. on Z80 and 6502 both have the same cycle count.
It could also be as a result of most people working in assembly being aware of the properties of logic gates, so they carry the understanding that under the hood it might somehow be better.
.b and .w -> clr eor sub are all identical
for .l moveq #0 is the winner
> It encodes to the same number of bytes, executes in the same number of cycles.
The predominance of these idioms as a way to zero out a register led Intel to add special xor r, r-detection and sub r, r-detection in the instruction decoding front-end and rename the destination to an internal zero register, bypassing the execution of the instruction entirely. You can imagine that the instruction, in some sense, “takes zero cycles to execute”.https://en.wikipedia.org/wiki/Carry-lookahead_adder
The only minor difference between the two on x86, really, is SUB sets OF and CF according to the result while XOR always clears them.
(But this does not discount the fact that basically all CPUs treat them both as one cycle)
> the processors started short-circuiting xor <register>,<register> much later.
There is no "short-circuiting", it's a regular ALU instruction like all the others. E.g. the "use xor to zero register" does not have require any special case handling, it's just as fast or slow as other ALU instructions (with the exception of mul/div of course).
PS. What is static vs dynamic count?
Dynamic count - how many times an opcode gets executed.
I. e. an instruction that doesn't appear often in code, but comes up in some hot loops (like encryption) would have low static and high dynamic.
Edit: this is apparently not the case, see @tliltocatl's comment down the thread
- Normally ALU implements all "light" operations (i. e. add/sub/and/or/xor) in a single block, separating them would result in far more interconnect overhead. Often, CPUs have specialized adder-only units for address generation, but never a xor-specialized block.
- All CPUs that implement hyper-threading also optimize a XOR EAX,EAX into MOV EAX,ZERO/SET FLAGS (where ZERO is an invisible zero register just like on Itanium and RISCs). This helps register renaming and eliminates a spurious dependency.
- The XOR trick is about as old as 8086 if not older.
XOR would also be handled by the ALU, the L is for logic.
Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side.
And this, interestingly, is why life on earth uses left-handed amino acids and right-handed sugars .. and why left handed sugar is perfect for diet sodas.eg: For one, Isaac Asimov in the 1970s wrote at length on this in his role as a non fiction science writer with a Chemistry Phd
> male/female ratios somehow tend to balance at 50/50.
This is different to the case of actual right handed dominance in humans and to L- Vs R- dominance in chirality ...
( Men and women aren't actual mirror images of each other ... )
Its basically free today. Of course, mov RAX, 0 is also free and does the same thing. But CPUs have limited decoder lengths per clock tick, so the more instructions you fit in a given size, the more parallel a modern CPU can potentially execute.
So.... definitely still use XOR trick today. But really, let the compiler handle it. Its pretty good at keeping track of these things in practice.
-----------
I'm not sure if "sub" is hard-coded to be recognized in the decoder as a zero'd out allocation from the register file. There's only certain instructions that have been guaranteed to do this by Intel/AMD.
Will remember for the next time I write asm for Itanium!
Even tiny tiny CPUs can do sub in one cycle, so I doubt that. On super-scalar CPUs xor and sub are normally issued to the same execution units so it wouldn't make a difference there either.
MIPS - $zero
RISC-V - x0
SPARC - %g0
ARM64 - XZR