Wednesday, September 10, 2014

Why Macs will get ARM'd, part II

Update:

Be sure to check out Part III, in which I work through how this might look from a software perspective.

Last time I explained why, fundamentally, there is no reason that an Apple-designed ARM chip destined for laptops or desktops needs to have less performance than an x86 chip.  I'm not a writer, and my words were misunderstood by some, so to clarify the point, I was merely explaining why a future ARM chip, designed to be different than all currently existing ARM chips, could fill this role.  I was not suggesting that current ARM chips on the market are already good enough.

And, to be clear, I was merely addressing what is philosophically possible.  There would certainly be some problems to be addressed, and practical matters like patents, poor fab relationships, or other factors could certainly mean that while possible in a theoretical sense, such a chip might be very difficult to achieve.  Personally, I think it could be done, but that's a whole other conversation.

In this part of the analysis, I will discuss why, all else being equal - that is, with equivalent or nearly equivalent fab technology, equivalent design methodologies, similar design goals - an ARM chip would actually have an advantage in performance at any desired power consumption as compared to an x86-based chip.

One last note in response to some comments I received - I'm not biased against x86 or pro ARM. Such an accusation is weird - while I spent 9 years designing x86 chips (and a few years designing various RISC chips including PowerPC), I've never designed an ARM chip. I have no skin in the game, either - I am no longer in that industry.



In Light of Watch

In Part I I discussed, among other things, that Apple increasingly wants to be in charge of its own destiny.  They don't want to be beholden to Intel's roadmap (hello Haswell delays) or Intel's vision of what computers should be.  Further evidence of the truthiness of this argument arrived in the form of Apple's new S1 chip, the heart of the Watch.  While Apple's competitors all flock to TI's OMAP processors (the Intel x86 of the smartwatch market), Apple again did its own thing.  At the very least Apple produced its own MCM (multi-chip module) and package.  But there's some evidence that the multiprocessor on the MCM is also Apple's own design (we won't know for sure until someone takes the watch apart).

Preliminaries

When I refer to the advantages of ARM, I really mean "advantages of a design based on the ARM instruction set architecture."  I am not proposing accepting the designs that ARM licenses to most customers.  In fact, it's pretty clear Apple already does its own design based on the ARM architectural specification for its A-line of chips.  Second, I am not suggesting the design need hone precisely to the ARM instruction set.  Apple designs its own languages and compilers and is uniquely free to deviate from ARM to add instructions or instruction variations if it should choose to do so to improve performance or capability. (I am assuming there is nothing in the ARM architectural license that forbids this, but I've never seen it, so what do I know).

ARM's Advantages - RISC vs. CISC

It's pretty well understood that ARM is "RISC" and x86 is "CISC," and many people have at least a basic understanding that this in some way means that x86 is more "complicated" than ARM.  But to make sure we're all on the same page, a little basic (and extremely oversimplified and probably misleading) computer architecture discussion is in order.

We start with the idea that we have some sort of memory structure, typically RAM, filled with instructions for the processor to execute, and data for the instructions to use.  The instructions and data probably came from some sort of external storage, like a disk drive, flash memory, ROM, or over  a network of some sort.

The CPU has some sort of computation engine, which we'll call the ALU ("Arithmetic and Logic Unit") that contains the hardware to perform basic integer math and boolean math operations.  Things like adding two integers, AND'ing two integers, and the like.  Often there will be dedicated ALU hardware for multiplication (which, with some help, can also divide), adding (which can also subtract), and shifting and rotating.  By jiggling various inputs these structures can also perform boolean operations like AND, OR, NOT, XOR, and the like.

The ALU performs operations on data, producing more data, which eventually works its way into the memory.

So we start with something like this:


We can delve a little deeper, though.  The ALU receives instructions to execute.  The instructions can be things like "ADD" or "AND."  They also take "operands" which are the things (the data) that are the subject of the operation.  So a complete instruction may be represented as something more like ADD A+B=C.   So we can think of it this way:


Here' we've separated the instructions and data into two different streams.  The data stream is bi-directional - the ALU receives data corresponding to "A" and "B" and puts back data corresponding to "C" (the sum of A+B).  But the instruction itself, ADD A+B=C, is a one way stream - the ALU doesn't "create" instructions or modify them.  At least in theory.  x86 does support commingling instructions and data in such a way that one can write code that modifies itself.  Hence, in some sense, in x86 the "instructions" stream must be bidirectional, which adds complication.  Herein, when I say "complication" I mean "power slurping performance sucking transistors dedicated to the task."  In all RISC processors of which I'm aware, however, the instruction stream is a one-way road.

  • x86 Issue: Treating instructions like data is not a great idea.

Now, main memory is slow. First, it's far away. Electric signals travel at around 6ps/mm in wires in a CPU (plus or minus.  This is the right order of magnitude, however).  And the drivers that charge and discharge these long wires to get to and from RAM take some time to do their job.  And the memories, themselves, aren't all that fast - they take some time for data to read out, and even more time (usually) for data to be written into them.  They're damned slow.  An ALU that requests a particular piece of data may spend thousands or tens of thousands of "cycles" waiting.  (A "cycle" is the time it takes for the ALU clock to cycle.  Each instruction takes one or more cycles to be executed by the ALU clock. Ideally each instruction would take one cycle (in the ALU, at least), and many simple instructions do. )

A long time ago people figured out that if you stick a small amount of memory right in the ALU, this makes things much faster.  This is because when I calculate "ADD A+B=C" chances are the next instruction is waiting for C.  Say it's "SUB C-D=E."  If the ALU has to spend thousands of cycles writing C into main memory, and then has to read it back out again to perform the subsequent subtraction, that many thousands of wasted cycles.  We call this small amount of memory the "registers."

So what we have is something like this:


Let's say we have two new instructions, LOAD (or LDR) and STORE (or STR).  LOAD (or LDR) is used to retrieve something from main memory and stick it into a register.  STORE (or STR) is used to take data from a register and put it into main memory. So, our little program might now look something like:

// ADD A+B=C
mov r0, #A ; // put address A into register 0
mov r1, #B; // put address B into register 1
ldr r2, [r0]; // put contents of address A into register 2
ldr r3, [r1]; // put contents of address B into register 3
add r4, r2, r3; // add A+B and put result in r4
mov r5, #C; // put address of C into register 5
str r4,[r5]; // put result of addition into memory at address C

// SUB C-D=E
mov r6, #D; // put address D into register 6
mov r7, #E; // put address E into register 7
ldr r10, [r6]; // put contents of address D into register 10
sub r8, r5, r10; // subtract C-D and put result in r8
str r8, [r7]; // put result of subtraction into memory at address E

This is certainly more verbose, and if humans were still generally writing assembly language by hand, this could be a problem. (Hint: they really aren't).

But let's look at the benefits. We can start loading D while we're doing the add of R2+R3 because the load/store hardware won't be busy doing anything else.  And we only have to fetch from memory three times and write once, whereas without regrets we'd be doing four fetches and two writes.  All of this implies some separate hardware to handle loads and stores.  It's responsible for telling main memory when to read (and what address to read) and when to write (and what address to write), and making sure the register file (the name of the "RAM" that holds the registers) is ready to read and write as appropriate.  So we really have something more like this:


What's nice about this is that the main memory only has to talk to the LOAD/STORE unit. This simplifies the design of both the memory interface and the LOAD/STORE unit.  Similarly, the registers get their data from the ALU or LOAD/STORE, but that's it. (Folks who know all this, please forgive me for ignoring implementations where register data always flows through the ALU, etc. Otherwise we'll be here all day).

Note that bits of the instruction stream go to the LOAD/STORE unit.  If the instruction is a load or a store, the LOAD/STORE unit has to know about it, after all.

This is, more or less, conceptually how RISC processors behave. But CISC processors like x86 are more complicated.  Instructions may implicitly load or store.   I may add the contents of register 1 to the contents of memory address A and store the result in memory address B or register 2 or the memory address corresponding to the sum of the value in register 2 plus the value in memory address C.  It's a mess.  This makes the circuitry for handling memory accesses and the register file much more complicated.  (And remember what "complicated" means as used herein).  It also makes it more complicated to determine when instructions can operate in parallel (since the CPU may not have a complete understanding of which instructions actually depend on the results of others).
  • x86 Issue: Complex memory addressing is problematic.
The supposed advantage was it made it much more compact to write assembly language code. There aren't a lot of people hand-writing assembly language code to celebrate this fantastic convenience anymore, however.

There's another side effect to this complicated memory addressing.  Suppose we have only, I dunno, 32 registers.  One needs five binary digits to represent these. (00000, 00001, 00010, 00011, 00100, 00101, 00110, 00111, 01000, 01001, 01010, 01011, 01100, 01101, 01110, 01111, and then repeat all that with 1 in the first digit).  So if all my instructions are of the form of "THING-TO-DO FIRST-REGISTER SECOND-REGISTER RESULT-REGISTER" (like ADD R1+R2=R3, AND R1,R2=R3, etc.) then I need 15 bits (that is 5 x 3) for the registers and that's it.  If my operands can sometimes be registers, and other times be memory addresses, and other times be the sum of the contents of a memory address and a register, the number of bits I need varies tremendously.  I either need to allocate the worst case and use it for all instructions, or let different instructions use different numbers of bits.

Intel used the latter method.  This is a massive pain in the arse.  
  • x86 Issue: Variable length instructions add massive complications.
We pretended we had 32 registers.  This amount is pretty common in RISC processors.  For example. ARMv8-A supports 31 general purpose registers (plus some other special purpose ones).  x86-64, however, only supports 16, some of which are dual-use and sometimes have special purposes.  The advantage of having more registers is that you have more places to store intermediate results during a calculation before you have to start swapping things out to main memory.  The downside to more registers is that if there is a "context switch" (i.e. if the processor is told to stop working on one calculation and to start working on a totally unrelated calculation, like when one process is swapped for another), then the entire register file must be saved to main memory which either takes longer, or requires more bandwidth.  In practice, we know that with real life compilers and real life code, the sweet spot is probably closer to 32 registers than 16 registers.

  • x86 Issue: More registers would probably be better for performance and power.
In a RISC processor, the instructions that come from main memory are easily decoded by the Instruction Decoder, which expands on the instruction to determine where to send the operands, and informs various units of what they will need to do.  For example, if a subtract instruction is fetched from memory, the instruction decoder informs the ALU that the adder will be used, that one of the inputs should be flipped, using two's complement, so that the adder can perform subtraction, and tells the register file to send the appropriate arguments for the subtraction operation to the ALU.  This is easy to do when the instruction has fixed length, because you always know which fields of the instruction to look at.

In x86, the instruction decoder is much more complicated.  The instruction decoder must figure out where the instruction starts and ends, and where each field in the instruction begins and ends.  This may require a "state machine," which is fancy talk for what is essentially itself a tiny little computer, just to figure out how to handle the instruction being fetched from memory.

Floorpan of the Athlon 64/Opteron x86 chip.  Note the "Fetch Scan Align Micro-code" block.  Much of that block deals with complications caused by the x86 instruction set.
x86 makes it even more complicated by using something called "micro code," which is a hallmark of CISC-based designs.  The concept of microcode again stems from the idea of making it easier to code in assembly language and adding "features" to the CPU.  Imagine a 1980's-era CPU designer notices that a lot of software code seems to perform an ADD followed a subtraction of 1.  In other words, code frequently has to calculate A=B+C-1.   Maybe 2% of software does this someplace.  The CPU designer also notices that because of some quirk of the way he or she designed the adder, it would be pretty easy to add an extra NAND gate and a couple of wires and be able to support this directly, in a single instruction.  So the instruction set gets a new instruction, let's say ADDM1.

5 years later, the adder design is optimized to be much faster for the common addition cases, but it's no longer easy to support the ADDM1 instruction without a performance hit.  The solution is simple! Intercept ADDM1 instructions and automatically, in the CPU, convert them to a string of instructions, like PUSH C, ADD A+B=C, SUB C-1=D, POP C.  (I threw in a push and pop there just to amuse myself).

What you need is a tiny little ROM that, for each instruction, contains a set of "microcode instructions" that are the instructions that are really seen by the hardware.  The microcode instructions are executed in sequence.   There are a few advantages to this technique - first, it papers over differences in the underlying hardware between different chips.  The same software keeps running even if there's different hardware.  For example, maybe there's no integer multiplier hardware, but instead I do multiplication by using the adder (over and over).  (Not a good idea, by the way).  Second, if there is a bug with the hardware, rather than having to redo the whole chip, one can usually fix it by creating the appropriate microcode routine and sticking it in the ROM.

But this adds more complications.  And remember what "complicated" means herein.
  • x86 Issue: Microcode is inefficient and adds complexity.
First, it makes the instruction decoding take even longer, and makes the circuitry more complicated.  This usually means adding more pipeline stages to handle the decode (so that while instruction number 1 is being executed, instruction number 2 is simultaneously being decoded), and more pipeline stages is bad for all sorts of reasons.   It also takes up die area, which means circuits are further apart, and signals take longer to get where they are going, or use more power doing so, or both.

Second, microcode is often inefficient.  A good compiler, which can see more of the surrounding code and has more information, can often do a much better job of breaking up complicated instruction combinations into smaller pieces for execution.

Anyway, this is what we have now:



There are some more complications specific to x86. First, even though x86-64 has simplified some things so that x86-64 instruction streams can run with a little less baggage than x86 streams, all x86 chips are still compatible with all the old gunk that was in the 32-bit, 16-bit, etc. versions of the Intel architecture. There's a bunch of hardware on-chip to cope with these old modes of operating.  This also means, for example, the floating point unit behaves like a separate chip, like in the old 8087 co-processor days, which is hardly the most efficient way of behaving.

  • x86 Issue: Backwards compatibility brings a lot of baggage.

I briefly discussed pipelines, and they aren't the simplest thing to understand, but they are important.  Imagine I have a chip with two adders in the ALU.  And imagine I have two instructions:

ADD A+B=C
ADD D+E=F

Suppose we fetch them both, from main memory,  into the instruction decoder.  The instruction decoder decodes them, and sends the appropriate signals to the registers and ALU.  We talked about the system clock earlier.  Imagine a ticking clock, and each unit has to get its work done between the ticks.  The first instruction would proceed through the CPU something like:


The instruction moves from unit to unit, taking a cycle each time. (In reality, it may take some non-1 but still integer number of cycles within a unit).  Once the instruction has been fetched from memory, the memory has nothing to do.  So pipelining allows the memory to start fetching the second instruction, before the first instruction has finished executing.  Once the decoder has finished decoding the first instruction, it's free to start decoding the second, etc.  This looks something like:


This works because the second instruction is independent of the first. It's inputs do not depend from the outputs of the prior instruction (for those in the know - I'm ignoring register bypass to make a point).

But what if my instructions were:

ADD A+B=C
ADD C+E=F

Then I can't perform the second instruction until I've calculating the answer to the first instruction.  Like this:


Now we have a bubble in the pipeline where the ALU has nothing to do.  This reduces performance. And if the ALU can't be powered down during this time, it wastes power as well.

x86 has a particular feature - the "flags register," that makes these sorts of dependencies more likely.  "Flags" is a term that refers to special events that can occur during a calculation.

  • x86 Issue: The flags register can reduce performance and efficiency.
The flags register is an implicit output of many x86 instructions.  So when I do ADD A+B=C, the flags register is a second output, in addition to C.  The flags register contains bits that are set or cleared depending on whether the main result is zero, is too big to represent, is negative, etc.  Before pipelining and particularly before multiprocessing (where multiple instructions can execute at the same time), this was somewhat clever.  But now it creates dependencies between instructions that cause bubbles in the pipeline, complexity to avoid bubbles in the pipeline, or both.

We haven't talked about caches, branch prediction, TLB's, etc.  The tl;dw (too long, didn't write) version of that stuff is that a lot of structures on the chip either need to get more complicated, bigger, or do a worse job in CISC processors because of many of the things I've written about above.  Generally speaking, variable length and complicated instructions with complex addressing mode support and backwards compatibility prevents the processor circuitry from having as much clear information as it could about the relationships and interdependencies between instructions, which results in bad guesses being made more often, which means wasted power and poorer performance.

A RISC architecture keeps the hardware much more simple, and pushes more of the work on the compiler which has better information (at least about static considerations) and more time and resources to optimize things.  Keep in mind that each time Apple releases new compiler technology performance improves, and that Apple's new Swift language appears to generate much faster code than Objective C, its old language.  Imagine if Apple could optimize the entire stack, from OS/SDKs to programming language to compiler to the hardware that runs the compiled code. It can do this already with iOS which is one reason iOS tends to be more fluid than its competitors on equivalent hardware with equivalent RAM.

One thing to keep in mind as canonical evidence that x86 designs have "gunk" that RISC designs don't need is the fact around 20 years ago x86 designers realized that designing the hardware to keep all these crazy addressing modes, variable length instructions, and other stuff working is too hard.  As a result, all modern CISC processors can really be thought of as RISC processors with some extra hardware to convert the stream of crazy complex instructions into a set of nice, constant-sized RISC instructions.  The rest of the hardware then just has to cope with this new pseudo instruction set.

Another datapoint - PowerPC and ARM are really approximately equally complicated (other than IBM's unfortunate predilection for number the bits in a word backwards from the rest of right-thinking society).  People generally accept that PowerPC could compete with x86 in the "PC" market, and that it's failure to do so long term had more to do with failure to keep up with process technology innovations, and failure to execute on design roadmaps.  Apple is pretty good at executing roadmaps, and there is much more parity in process technology than there used to be, particularly because by bundling its mobile and Mac chip production business Apple would be able to drive huge volumes and could afford the sort of experimentation that used to be the sole providence of the Intel's of the world.

non-X86 Advantages - Other

Apple would achieve other advantages by spinning its own chips.  Some of these advantages would apply equally if Apple designed its own x86 chips, but that simply isn't in the cards due to licensing issues and complexity (it's very very hard to design a functioning x86 chip, due to the complexity of the instruction set).

First, look at what Apple's been able to do in the mobile market with A7.  It was able to integrate an image processing circuit, graphics unit, secure enclave, caches with sizes chosen by Apple, along with lesser-ticket items like LCD controller, various serial interfaces, ethernet, usb, etc.  It didn't have to convince Intel or AMD to build a chip with the functionality it preferred for its product.  It was able to jettison interfaces and logic blocks that it didn't need for its own products.  In short, it didn't have to buy "off the rack."

Combining what would otherwise be multiple chips into a single package reduces overall cost, power consumption, and the size of the overall device.

Conclusion

I have been accused of being an "ARM fanboy" or "biased against x86."  This is nonsense. I have no experience at all with ARM and don't have any particular feelings about it.  And I was an x86 chip designer for nine years.  I simply have an engineering mindset, and believe in the right tool for the right job, and maximizing efficiency.  There is no questioning that x86 adds complexity vs. ARM (or any other RISC processor) and the only question is whether the complexity benefits more than it hurts. Any x86 is really a RISC processor plus extra stuff and that extra stuff is never free - it costs die area, increases the price of the device, increases the likelihood of bugs, increases the time needed to design the chip, increases power consumption, and reduces performance.  And there's no better proof that this extra stuff hurts more than it helps than the fact that the x86 chip makers have been working hard to eliminate it (e.g. by leaving a lot of it out of the x86-64 spec in the hopes that someday backwards compatibility, and its resulting extra stuff, could be jettisoned).

There's also no question that Apple is increasingly taking charge of its vertical supply chain.  It wants to own the parts of the supply chain that it can use to differentiate itself from the competition.  Hence the purchase of sapphire capacity, biometric sensor companies, creation of world class voice recognition teams, and, most relevantly, the purchases of Intrinsity (which traces back to Exponential Technology, my former employer) and PA Semiconductor.   The rationale for this applies equally to Macs as it does to iOS devices.

In the next part I will describe how I think the switch to ARM might work, what will be lost in the transition, and how the things most important to most Mac users will continue to work.


2 comments:

  1. Very informative. Looking forward to your next article about why Mac switching to ARM might work.

    ReplyDelete