Sunday, September 7, 2014

Why Macs will get ARM'd

This is Part I of a multi-part analysis of Mac on ARM.  Be sure to check out Part II in which I explain why x86 inherently has a small but real disadvantage in optimizing cost, performance and power and why ARM would offer Apple some real advantages in the desktop and laptop market.  Also check out Part III I will examine how the transition would work from a software perspective and what the result would look like.  In Part IV I will investigate what the hardware transition would look like.

Part I: Microprocessor Design

Exponential x704 microphotograph with overlaid floorplan.  One of the chips I helped design.

Some folks noticed that I recently predicted that within a couple of years Apple will be selling at least some Macs running on an ARM-based architecture (as opposed to the AMD x86-64 architecture they currently use. Yeah, I call it that. Suck it Intel.)  People questioned how this can be, given what they foresee as Intel's inherent performance advantage and progress toward reducing power consumption.  Some of the doubters also believe that Intel has inherent advantages that cannot be reproduced on ARM.  I disagree, but understanding why requires an understanding of where the three key metrics in microprocessor design - performance, power consumption, and cost - can be affected in the design process.

CPU design is all about tradeoffs

I've worked on designing MIPS-like CPUs, x86, x86-64, SPARC, and PowerPC processors, as well as some non-commercial architectures that no one ever heard of, and have been involved, at least to some extent, in most of the tasks that are involved in starting from a clean sheet of paper and ending up with functioning silicon. (And, in my case, with functioning gallium arsenide as well).  Not every company does it the same way, but based on who Apple has hired and where their design teams come from, I'd bet I have a fairly decent understanding of how they might do things.

In Part I of this blog entry, I will describe the microprocessor design process and explain how the different steps of the process can qualitatively affect the key metrics (performance, power consumption, and cost).  I will focus on the methodology I am most familiar with, both because that's the easiest for me to talk about and because I think Apple's design methodology isn't significantly different.  In subsequent parts I will explain why I think ARM offers Apple advantages over Intel, and how Apple could make it work.

Step 1: Requirements

The original Athlon 64/Opteron die and floorpan.  Hmm, that Micro-code block looks pretty big, huh?
A new processor design always starts with an evaluation of the basic requirements.  These may come from surveying customers, analyzing market direction, analyzing the competition, or identification of new markets to attack.  Typically the requirements, at least at first, are quite general.  There's a certain minimum acceptable performance, a certain power budget, and a certain maximum cost.

For example, processors intended for use in server farms need to consume very little power but don't need too much performance, and certainly don't need much floating point performance.  Processors for workstations need high graphics performance and good multiprocessing performance, but power consumption and cost may be less important.

The requirements also include other things, some of which often go unsaid.  At AMD we knew we had to be compatible with Microsoft operating systems (which meant we either had to be compatible with Intel or we had to convince Microsoft to support our deviation from Intel).  At Sun we knew we were going to design a SPARC (I steadfastly refused to have anything to do with the other team which was designing "Java processors").  At Exponential we were pretty agnostic about what we were designing, but given Apple's investment we were lead to focus on PowerPC (for awhile).

Of course, much like Sun could feel free to migrate its customers from SPARC to "Java chip" (unsuccessfully) or to x86 (more successfully), Apple is not constrained by such factors from switching to ARM.  Much like the transition it made from Motorola 68k to PowerPC and then again to Intel, Apple could switch to ARM (since it controls the OS, the compiler, and has shown it knows how to support fat binaries, emulators, and other technologies for smoothing over the transition period).

Sometimes we knew we had to have certain features - marketing wanted to differentiate by providing encryption functions or support for some new SIMD instructions or the like.  Being able to differentiate in this way is something that OEM's who limit themselves to buying the same Intel chips as their competition can't do, of course.  Apple does compete on processing specs where it can; in the PowerPC days it advertised the advantages provided by the processors it used (heck, Apple even named some models after the chips they used), and it has begun to do the same thing with the ARM chips it uses in iDevices (Secure enclaves. 64 bit. Performance. etc.)  Apple cannot currently do this with Macs because it is using the same chips as everyone else.  If Apple wants to introduce Touch ID with a secure enclave to protect fingerprint and other sensitive data, they can't do it unless they convince Intel to go along (and then, perhaps after a short exclusivity period, all of Apple's competitors have access to the same technology).

Apple using unique features of its specially-designed A7 chip for iDevices as a point of differentiation in marketing


The Requirements stage is where the basic contours of performance/power/cost are set.  Frequently there will be some back and forth with the engineering team (sometimes the requirements are not possible, or sometimes the engineers come up with alternate proposals that marketing never considered).  But if it's a chip for ultralight laptops, one can be sure it won't run at maximum possible clock frequency and with a 100W TDP.

So:

  1. Apple's reliance on Intel (or, in a broader sense, x86-based chips) prevents differentiation based on features and, more importantly, on capabilities.  Apple has shown that it prefers such points of differentiation. 
  2. Apple's reliance on Intel hinders its ability to offer features that it might wish to offer, like Touch ID, regardless of whether it's a point of differentiation.
  3. Apple has, through two prior instruction set architecture transitions, demonstrated it has the capability to make such transitions smoothly. 
  4. Apple controls the entire stack, from OS, to SDKs, to programming languages (Objective C and SWIFT) to compiler.  This provides Apple unique freedom to change its processor architecture.

Step 2: Architecture

Layout of the RPI F-RISC/G cache controller
Outside the industry, people often use the term "architecture" to refer to the instruction set (and associated specifications) the processor uses.  To differentiate this other concept I will call that Instruction Set Architecture ("ISA" for short).

When I refer to architecture, I refer to the high level description of the operation of the processor.  Does it have a cache? How big? One each for data and instructions? How many instructions can it process at once? Does it have multiple cores? How big is the register file (if not determined by the ISA)? Does it have trace-back caches? How big are the TLBs? Does it support out-of-order issue? Out-of-order retirement?  How many instructions can be in flight at any time? How deep are the pipelines? How many cycles to do a 64-bit addition?  The list of issues goes on and on.

Sometimes this includes whether to support optional portions of instruction sets (and instruction set extensions).

Some of this is often called "microarchitecture" but I'll include it all as "architecture" because the folks who did this work were universally called "architects" at the places where I worked.

The architecture makes a big difference in our key metrics.  Performance, power consumption and cost all directly are affected by architectural decisions.

For example, doubling the size of the L2 cache may increase performance on key benchmarks by 10%.  But doing so may double the die size (and hence the cost). And it may increase power consumption of the chip by 15%.  It gets more complicated.  While power consumption of the chip may increase by 15%, by reducing the frequency of main memory reads and writes the system power may decrease by 2%, which may mean that, overall, the entire system consumes less power.   Of course, that's of little value if the power dissipated by the chip per square centimeter is such that the chip can't be properly cooled because the volume of the phone it's going in does not allow a sufficiently sized heatsink.

Perhaps, instead of increasing the cache size, the architect decides to double the speed of the CPU clock (assuming the engineers down the line can make this work).  Even if doing so could be accomplished without increasing the CPU voltage (unlikely), this doubling of clock frequency will cause lots of wires and transistor gates to charge and discharge twice as fast, which will double power consumption (or, at least, double the portion of the power consumption that derives from switching, which can range from 50% to 80% of the overall power consumption, depending on factors to be described later).

Further, increasing clock speed may require shorter wires (electrons move through wires at finite speed) with less capacitance (wires take awhile to charge and discharge, and longer wires take longer).  To accomplish this, more mask layers may be needed, which increases the price of the part.

The architect may add more registers, which speeds up some benchmarks but slows down others that involve a lot of task switching.

In short, there are many choices to be made, and each of them has a real effect on key processor metrics.

What's important to note here, is that the vast majority of the choices available in the architect's toolbox apply whether the chip is an Intel x86 or an ARM-based chip.  In each case the architect can choose the number of cores, whether to support hardware multithreading, bus sizes, cache line widths, cache organizations, the dimensions of memory structure like translation lakeside buffers and caches, register renaming techniques, branch prediction strategies, etc.

As wafer sizes increase, transistors and wires decrease in size, and the number of transistors per die increases, the new architectural techniques that become available to Intel architects also become available to ARM architects.

So:
  1. Nearly all architectural techniques for increasing performance and decreasing power consumption are equally available regardless of the instruction set of the CPU.
  2. Apple controls its compiler, so it can make sure that code takes full advantage of its architectural decisions.
The architects are typically responsible for developing a "behavioral model" of the CPU.  This is essentially a software program, typically written in Verilog (though C++ and VHDL are reasonably common as well), that simulates the operation of the CPU.   In high-end designs, of the sort I've worked on, this model is not very detailed.  If the CPU has an adder, it's represented by code like result<=register0 + register1.  There's nothing in the code to indicate the structure of the adder or the other lower level blocks.  Rather, the design is divided into high-level blocks (things like "instruction decode" and "integer unit") corresponding to the organization of the design team.  The integer unit design team can use the integer unit behavioral code to make sure that their design behaves in the way the architect imagined.

In lower-end designs, the behavioral code contains more detail, because rather than letting a designer determine the structure of these blocks, a synthesis tool (i.e. software - typically from Synopsys) does the heavy lifting.  The extra detail in the behavioral model provides guidance to the synthesis tool so it doesn't go too far off the rails.  Interestingly, the choice of whether or not to use a synthesis tool is another opportunity to affect our key metrics.  In my experience (and we tested this extensively over the course of a decade), using a synthesis tool universally resulted in 20% worse outcome than allowing trained designers to do the work.  You can pick your 20% - either 20% worse performance, 20% worse power consumption, or 20% worse cost (stemming from 20% more space on the die).  Or various combinations that add up to 20%.

Step 3: Logic Design (and Circuit Design) (and Physical Design)

AMD K6-II micro photograph

The next step, once the overall behavior of the various CPU blocks is determined, is to design the circuitry that produces that behavior.  Here the division of labor varies from company to company, but I'll use the broadest definition.

First, it's important to understand that a block, say the "integer execution unit," is designed from smaller basic building blocks.  These blocks generally fall into two types: "standard cells" and "macro cells."  Standard cells are generic, reusable, circuits that perform basic functions.  These cells have a predefined "layout" (i.e. the set of polygons on different mask layers that form the transistors and wires in the circuit) and logical behavior.  For example, there are standard cells to perform basic Boolean functions such as NAND, NOR, NOT, XOR, and the like.  Moreover, there are different versions of the standard cell depending on the number of inputs.  So there's a NOR2 that performs a logical-NOR on 2 inputs, and a NOR3 that does the same for 3 inputs.  Then there are different versions of each of these that have different drive strengths; this enables the designer to choose the cell that's just strong enough to drive its output load at the necessary speed to meet the clock frequency goal, but not so strong as to waste power.  So there's a NOR3x1, an NOR3x2, etc.

Depending on the situation, the standard cell library may be provided, as-is, by the foundry.  So, for example, TSMC may provide its customers with a cell library, and leave the customer with little option to deviate.   I very strongly suspect Apple is not in this boat; it's a huge customer which has hired a lot of folks who would not be interested in using an as-is standard cell library that isn't optimized for its own needs.

By optimizing the standard cell library in various ways, one can affect performance, power, and cost. For example, one can choose the aspect ratio of the cells - are they tall and skinny, short and squat, or in-between?  Are there special cells for certain types of structures?  How do the cells connect to the wires? (i.e. are the pins drawn vertically or horizontally, and in what layer?)  What cells are in the library?  What's the power grid look like?  How about the clock grid?  Do I use flip-flops or latches? For one of our designs, we eliminated the so-called "positive polarity" cells like AND and OR and forced designers to create AND using a NAND followed by a NOT.  This was more efficient because an AND is really just a NAND followed by a NOT anyway, and decoupling them encouraged the designer to move the NOT away from the NAND, where the NOT could perform a power-saving signal-repeating function.   This set of choices, however, is independent of whether one is designing an x86 or ARM part.  So if there's an optimal solution, it's equally available to everyone.

The other type of cell, the macro cell, is a customized cell that performs a more complicated function or a function that can't be implemented in a standard cell.  For example, in the integer execution unit, the register file is likely to be a macro cell; essentially it's a highly optimized, albeit small, SRAM with a lot of read ports.  Circuit designers design this cell on a transistor-by-transistor level and produce a block that can be snapped together with the standard cells to produce the block.  While different instruction set architectures may require different macro cells (e.g. an x86 has a small register file while RISC architectures tend to have bigger ones), the circuit design tricks used by the designer to increase speed or reduce power are independent of ISA.

So:

  1. Choices of standard cell architecture that improve performance and power can be made independently of instruction set choice.
  2. Circuit design choices are independent of instruction set choice. 

Once there is a library of cells to choose from, the next step is to arrange them so they perform the proper Boolean functions.  This is the "synthesis" I referred to earlier.  We usually did it by hand, though the trend is to do at least some parts of even high-end chips using automated tools (bad idea, but no one listens to me).

The designer also has to physically position the cells on the chip, a process called "placement."  Again, this is often automated, but we typically did it by hand.  Synthesis and placement must be done in coordination - if two cells are far apart, then they may require a repeater between them in order for the signals not to degrade too much.  And the drive strength of cells depends on how far apart the cells are, and which cells are connected to which.  An x1 cell shouldn't drive more than x4, whether it be 2 x2's, 1 x4, or 4 x1's.  But if there's a long wire between the driver and receivers, it can't drive the full x4 because it must also charge and discharge the long wire.  It gets complicated!

Moreover, the wires between the cells (the actual metal) must be designed.  This is called "routing."  This is almost always automated, albeit we always did some "pre-routing" - i.e. hand routing - of the most critical wires, forcing the less critical wires to work around them.  (Wires can't cross on the same layer, so sometimes wires had to move up and down to get around obstructions like pre-routed wires).   Synthesis, placement and routing is an iterative process.  You do it, find out if you meet all the specifications for speed and other electrical properties, and adjust.  Hopefully you converge on a solution that meets your speed and power budget.   But, it's important to note, that there's nothing in this process that's specific to any particular choice of instruction set.

So:

  1. Logic, circuit and physical design techniques do not provide any particular instruction set choice with any notable advantage.

Step 4: Technology

This isn't really a "step," but it's another important factor in determining performance, power consumption and cost.  I lump the electronic package and the semiconductor fabrication process into this category.

The choices here include all sorts of things - process node (i.e. minimum drawn transistor sizes), metallization (alloy, width, thickness), dielectric choices, substrate (SOI? Bulk?), transistor design (3D gates? Number of pillars?), number of metal layers, etc.

These choices have a huge effect on performance, power and cost.

Now, Intel may have the best fab (good argument for that. At least it seems the most reliable), but there's nothing in the choice of instruction set that inherently prevents the use of any of these choices.  An ARM produced on Intel's best fab will benefit just as much as an x86. 

So:
  1. Fabs are instruction-set neutral. 

Then Explain This, Mister...

The obvious question, then, is why aren't ARM chips already competing with Intel in the "PC" market? Why have ARM chips always had less performance than x86 chips?

Design techniques

Remember that 20% you lose by doing synthesis? Well, almost all ARM designs use the so-called "ASIC" (or, more recently, the related "SoC") flow, which involves a tremendous amount of software automation of the design process.   Part of this is the way ARM is licensed - many licensees receive just synthesizable Verilog or "hard blocks" that have already been synthesized.  There have been some notable exceptions (StrongARM at DEC, for sure, and presumably the Apple A7), but as a general rule ARM designs haven't been lovingly hand-crafted the way Intel, AMD, and the like design their processors.  This hasn't been much of an issue, though, since these designs were not intended to compete with high-end microprocessors anyway.  Of course, this problem is easily overcome...

Fabs

Generally-speaking, most ARM processors are not produced on the best fab lines.  Apple's A7 is built on Samsung's 28-nm process.  Intel's state-of-the-art processors are fabbed using a 14-nm process with 3-D transistor gates.  GlobalFoundries' Fab 7 purportedly runs at 13-nm (and I assume may be using SOI wafers).  TSMC offers a 16-nm process with 3-D gates.  It's hard to compete (in performance, power, or cost) when you are using a technology node that's a generation and a half behind in terms of transistor size and with older transistor structures.  Again, this hasn't been too much of an issue since these designs were not intended to compete with high-end microprocessors.  And, of course, this too would be easy enough for Apple to deal overcome.  (Arguably Apple couldn't achieve complete fabrication parity with Intel, but it could certainly come close enough).

Goals

Until now, it hasn't been anyone's goal to compete on the "desktop (or laptop)" and produce an x86-class processor using ARM. Why would it have been? The graveyards of silicon valley are filled with the discarded remnants of past instruction set architectures.  SPARC never got much further than it's Sun and Sun-clone (Fujitsu/HaL) roots.  MIPS managed to spread beyond Silicon Graphics to a few handheld devices and some car engines, but so what.  PowerPC's made a decent run of it - powering IBM RS workstations, Macs for a few years, and last generation's game consoles - but it's run is about over.  Even competing with Intel directly by adopting x86 hasn't worked out for anyone (ask Cyrix, Transmeta, Rise, National, Nexgen, Exponential).  AMD made a good go of it for a brief time with Opteron, forcing Intel to clone it, but those glory days were short-lived.  For a long time, if it didn't run Windows, forget about it.  If it did run Windows, it had to be cheaper and faster than Intel to even get a nibble from the OEMs, and outrunning Intel while all else is equal is not sustainable in the long run.  The solution, of course, is to make sure that all else is not equal, which hasn't been possible until very recently.  

Today is different, because among other things, the markets have changed - who cared about battery life 10 years ago? Who would have believed that laptops like the MacBook Air, marketed not for being the most powerful but for having all day battery life, being thin and light, and for removing common features, would be driving the laptop market?  Who would have thought that Windows compatibility would today not be a huge requirement?  (If it still is, it won't be for much longer. And it sure isn't an issue Apple cares about - who talks about Bootcamp anymore? And something like Parallels would still be able to run Windows regardless of the underlying ISA).  


Part I Summary

The point of all this was to explain why there's nothing in the chip designer's toolbox of tricks that provides x86 with a particular advantage.  The reason Intel performance is better than ARM performance is not because the choice of instruction set provides Intel with an inherent advantage, but rather because of history and market forces. 

In the next part I will explain why x86 inherently has a small but real disadvantage in optimizing cost, performance and power and why ARM would offer Apple some real advantages in the desktop and laptop market.

No comments:

Post a Comment