Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Instruction Set Micro-Optimizations

Name: Anonymous 2012-11-28 18:01

I've always been interested in these, even if few people have a chance to experiment with them, so, yeah, let's do it.

Some 6502 derivatives feature dedicated store 0 and load 0 instructions, which saves a byte and (possibly, I don't remember) a cycle for a very common operation. Neat.

A lot of RISC architectures, on the other hand, feature a dedicated 0 register for this purpose -- iirc, in MIPS, register 0 is always 0. It seems at first glance that this would be more general and useful than load/store 0 instructions, but I'm having trouble thinking of a use for the register besides loading and storing 0 values. On second thought, maybe it's a waste of a register.

Name: Anonymous 2012-11-28 18:26

Still waiting, Cudder.

Name: Anonymous 2012-11-28 19:30

It's for implementing other functions with limited instructions. For example, nor x, zero = not x

Name: Anonymous 2012-11-28 19:42

RISC architectures have cortical asstons of registers. Reserving one of them to hold an extremely common and useful value is not nearly the same indictment that it would be on x86.

Name: Anonymous 2012-11-28 20:50

You can synthesize some instructions using the zero register, eg. to negate a value you subtract from zero. Comparison instructions can be implemented using subtraction with the zero register as destination, discarding the results but keeping the status flags.

Zero is such a common operand value that it's a clear advantage to have a dedicated register on register-rich load-store arcitectures (eg. MIPS, POWER). Architectures with fewer registers usually don't have one (eg. ARM, SH) but for instance the MSP430 has a couple of constant generator registers that can create the most common constant values.

Name: Anonymous 2012-11-28 21:09

this nerd shit is giving me a headache

Name: Anonymous 2012-11-28 21:11

>>6
Cut it down, you don't need your head anyway. ^_^

Name: Anonymous 2012-11-28 22:35

Two real-world MIPS examples: The move d, s pseudo-op is actually implemented as or d, s, $zero, and the unconditional branch b addr pseudo-op is implemented using a conditional branch instruction as beq $zero, $zero, offset.

Name: Anonymous 2012-11-28 23:21

>>4
I get that, but I was having trouble coming up with actual uses for it. The only other one I could think of was adding a carry in bignum arithmetic with adc 0.

>>5,8
Thanks for these, I knew I was forgetting some tricks. Surprised that move between registers is implemented with or, though I suppose it's not too difficult to detect that a move is intended there.
Any other instances where a zero reg could be used to implement pseudoinstructions?

In more modern times, though, it seems to me that dedicated instructions would make it slightly easier to detect dependencies (essential for stuffing pipelines and advanced branch prediction).

I suppose the ultimate question is whether "dyadic" architectures (2 operands, one of which is also the destination) with more space in the instruction for opcodes, or "triadic" architectures (2 operands, separate destination) with less space for opcodes are superior for code density etc.

That, and variable width instructions vs. fixed width ones. When I look at the MIPS architecture instruction set, a lot of times it seems to me as if they were at a loss as to what to put into the remaining space in their instructions, like the shift amount in R instructions. Same with condition codes in ARM.

Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2012-11-29 5:05

>>9
The question is whether the code has more instances where an operation produces a value from two existing ones (3-operand has advantage here) that also need preserving, or one of them is never used again (2-operand advantage). The majority of common code is probably the latter, with the former maybe more common in certain scientific computing applications.

Fixed-length RISCs certainly have a sparse (SPARCse?) opcode space, which makes them need more memory bandwidth than CISCs with variable-length instructions; they're quite wasteful of opcode space in that way. They usually compensate with more cache and higher clock frequency, which worked OK until a few years ago.

Name: Anonymous 2012-11-29 7:10

>>10
Fixed-length RISCs certainly have a sparse (SPARCse?) opcode space, which makes them need more memory bandwidth than CISCs with variable-length instructions; they're quite wasteful of opcode space in that way. They usually compensate with more cache and higher clock frequency, which worked OK until a few years ago.

Kindly explain this good sir. I don't understand.

Name: Anonymous 2012-11-29 7:27

>>11
In RISC processors, each instruction has fixed length (e. g. machine word (e. g. 32 bits)).  In CISC processors, there may be instructions of any length, from one byte (e. g. octet) to as many bytes as the power of N in nd=2N where nd is the maximum amount of dicks that penetrated processor designer's ass simultaneously.

Name: Anonymous 2012-11-29 7:33

>>12
I was referring to how it requires more memory bandwidth, although you did kinda touch on that. 32-bit MIPS has 16-bit instruction words. I don't see that being more than your average i386 (and especially not i686) instruction.

Name: Anonymous 2012-11-29 8:20

>>13
At first I thought you meant Thumb on ARM, then I realized there's MIPS16e...
Anyway, here's some interesting data:
http://www.eece.maine.edu/~vweaver/papers/iccd09/iccd09_density.pdf
tfw i386 master race

where nd is the maximum amount of dicks that penetrated processor designer's ass simultaneously
ENTERPRISE HARDWARE DESIGN TECHNIQUES

Name: Anonymous 2012-11-29 8:31

>>10
Cudder-sama!

>>14
tfw
Back to /g/, shitstain.

Name: Anonymous 2012-11-29 10:00

>>9
Other sources implement reg-reg move using add instead of or, but I don't know which one or indeed either is canonical. The book "See MIPS Run" (2nd ed) has an instruction table that includes pseudo-ops and their expansions, you should be able to find a pdf by Googling.

3-operand instructions are supposedly easier to generate code for. Fixed-size instructions make instruction sequencing simpler and thus require fewer transistors, which was one of the driving goals behind early RISC design (especially MIPS, which is also why the instruction set doesn't have redundancies).

>>10
A bit of an overgeneralization. Eg. SH has fixed 16-bit instructions.

The current trend is towards mixed-length instructions, but driven by code size for embedded microcontrollers rather than bandwidth reasons. ARM created Thumb-2, MIPS copied it as microMIPS.

Name: Anonymous 2012-11-29 10:29

This thread is amateur level.

On a lot of architectures, the DMA controllers of memory buses can be programmed to flood fill huge pages of memory with zero values, circumventing the CPU cache and running in parallel of the CPU. The DMA controller can then trigger a hardware interrupt once the operation is complete.

Doing this stuff usually requires ring 0 privileges, so it's a technique more for OS, systems or device driver authors.

Name: Anonymous 2012-11-29 10:36

>>17
Also, I forgot to mention that on architectures with virtual memory, usually when an application allocates a block of pages, the memory pages are zero filled. The way it's usually done is to set all of the pages to the same zero-filled physical page. Reading is not an issue, but on write, a software interrupt is triggered which causes the OS to change to the actual physical page that was reserved in the original allocation.

http://en.wikipedia.org/wiki/Zero-copy

Name: Anonymous 2012-11-29 12:02

I just remembered an interesting idea in the HuC6280 (6502 derivative used in the Hudson PC-Engine): It had 8 bank switching registers, and special instructions to transfer data between them and the accumulator. Rather than them being the expected "load from bank register #0, store in bank register #7" instructions, they treated the second byte of the instruction as a bitflag of registers to use; eg, "store the contents of the accumulator in bank registers #0, #1, and #2" if the byte was 0x07.

It wasn't actually that useful for bank switching, but it seems like a really good idea for stack operations! Instead of picking between "push register #x" and "push all registers," you could push and pop a series of registers with a simple bitflag, allowing you to, say, efficiently push only the registers that a subroutine or syscall will trash.

I'm not aware of any architectures that implement this type of stack operation, though. Strange, as it seems like it would be really useful alongside the simpler push single register instructions.

Name: Anonymous 2012-11-29 12:03

How the fuck do you even learn all this about CPU architectures?

Name: Anonymous 2012-11-29 12:04

>>20
What I mean is please give book suggestions on CPU architectures

Name: Anonymous 2012-11-29 12:05

>>16
fixed-sized instructions make instruction sequencing simpler
Yet, in this day and age, I think increasing complexity in the instruction sequencer would be justified if it lessened the burden on the instruction cache.

I'm also not buying the idea that minimizing the opcode field size so much (to the point of implementing register to register move with add 0, for example) actually makes efficient/powerful implementations any easier, since you have to "detect" all these special cases to avoid tying up the ALU for a stage in the pipeline actually computing add 0.

>>17,18
Yes yes, we're all aware of DMA constant fill operations. There are uses for load/store 0 or a dedicated zero register beyond filling swaths of memory, please read the thread.

Name: Anonymous 2012-11-29 12:52

>>19
ARM has the load multiple/store multiple instructions that do exactly that. I assume they're not so common because they add complexity to exception handling. If an interrupt is triggered while the instruction is executing, do you delay the interrupt or do you add logic to make it interruptible? What if the instruction triggers a fault?

>>22
In the original implementations, the zero register would have been implemented as a physically read-only register which doesn't need special handling. It's still faster than spending an instruction clearing a register.

>>21
Hennessy & Patterson, "Computer Architecture: A Quantitative Approach" (usually just referred to as "Hennessy & Patterson") is the SICP of computer architecture (except it's actually both good and relevant). If you need something more introductory, the same authors have another book, "Computer Organization and Design" (referred to as "Patterson & Hennessy" to avoid confusion between the two). Stallings' "Computer Organization & Architecture" is another common introduction textbook. There's many more, but those are the ones I've read.

Once you've understood the basic theory, just download and read the architecture manuals for different processors. Most manufacturers make them available free of charge. If you can find places where actual chip architects hang out (instead of the usual /prog/rider "experts") it's worth hanging around. The comp.sys.arch newsgroup used to have lots of interesting discussions, not so much anymore. The Real World Tech website seems OK.

Name: Anonymous 2012-11-29 13:38

>>23
I assume they're not so common because they add complexity to exception handling
Good point. I'm more familiar with microcontrollers that lack things like exceptions and virtual memory support, so I forgot all about that aspect of CPU architecture.

For what it's worth, the HuC6280 has several block move instructions that must finish completion before interrupts can be serviced. If you had something interrupt critical going on in the background, you would need to break large transfers up into several small ones. Otherwise, say, a kilobyte block transfer could cause you to be over 6000 cycles late to servicing an interrupt! Ouch.

By comparison, I don't think waiting for 16 registers to be finish being pushed to the stack would be so terrible for interrupt latency, or at least not for desktop/server/mobile workloads.

In the original implementations, the zero register would have been implemented as a physically read-only register which doesn't need special handling.
I'm aware of that, I'm saying that in any processor with a pipeline (pretty much anything non-embedded since the Pentium), it's a waste of a valuable pipeline stage to actually add 0 to the value of some register, so you need a special case in the instruction decoder to actually treat add destination, source 0 as an actual move destination, source instruction. Implementing all the special cases like that in the instruction decoder is probably more expensive in the grand scheme of things than simply implementing a real move opcode.

This probably is negligible for hardware design in 2012 (given that Intel manages to beat everyone's pants off in raw performance with such a crufty ISA), but then again, this thread is about micro-optimizations.

>>20,21
For an approach from the opposite angle, you can try programming for simple 6502 systems and reading about how they work.
Once you've achieved some level of familiarity, Agner Fog's manuals about the microarchitecture of various x86 processors get very interesting.
http://www.agner.org/optimize/
http://www.agner.org/optimize/blog/

Name: Anonymous 2012-11-29 14:11

>>24
For some embedded work, interrupt latency (or at least predictable latency) can be critical. On Cortex-M3 LDM/STM can be interrupted halfway through execution, unless it's a conditionally executed instruction in which case it's cancelled and restarted.

If the add instruction has a 1 cycle throughput (like your archetypal MIPS 5-stage pipeline), there's no performance lost, and special-casing doesn't gain you anything either. When going superscalar/OoO it's probably worth recognizing $zero for dependency tracking and register renaming purposes.

Name: Anonymous 2012-11-29 14:14

>>24
>>25
Forgot to add that interrupt latency was one reason (besides required silicon area) many architectures either didn't include a division instruction, or used single-step division.

Name: Anonymous 2012-11-29 15:00

>>23
>>24
Thank you both very much for the recommendations!

Name: Anonymous 2012-11-29 15:02

>>26-27
Ahmed-tachi, optimize your quotes!

Name: Anonymous 2012-11-29 15:02

Actual programming discussion in /prog/ is UNACCEPTABLE

Name: Anonymous 2012-11-29 15:14

optimize my triple tens

Name: Anonymous 2012-11-29 16:08

>>30
Not bad, not bad, but you ain't seen nothin' yet.

Check these out: Pale nimbus, raised lettering, binary quints.

Name: Anonymous 2012-11-29 17:52

>>16
In classical MIPS it doesn't matter which one it is, because they all execute in 1 cycle.

Name: Anonymous 2012-11-29 18:00

>>32
Indeed, but in later architectures there may be a difference. Like how on x86 there are lots of instructions that perform a no-op, but a few have been selected as official and are special-cased in hardware.

Name: Anonymous 2012-11-29 19:51

>>33
XCHG EAX, EAX my anus

Name: Anonymous 2012-11-29 22:10

>>34
u mena lock cmpxchg8b?

f00f my anus

Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2012-11-30 6:53

>>11
Instructions must be fetched into the CPU from memory to be executed. Even if everything else took 0 time that is a limit on the instruction rate. Smaller instructions = less fetch bandwidth.

>>19
VAX had something like that. I've wondered how much performance you could get from VAX if DEC had the same resources Intel does today. The main difference is VAX was designed from the beginning to be a "large" architecture with little thought put into code density at the binary level, whereas x86 started from 8-bit roots.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List