Wednesday, September 17, 2014

Why Macs Will Get ARM'd, Part III

This is Part III of continuing series of articles that explain why it is likely that Apple will port its Mac line of desktop and laptop computers from x86 to the ARM architecture, and why it would be beneficial, both to customers and to Apple, for it to do so.

Here are quick links to the prior parts.  I suggest reading them before reading this part.

Part I: Why Apple is Motivated
Part II: Why ARM is a Better Solution for Apple

In this Part III, I will discuss how this could work from a software point of view.  Part IV will discuss hardware options.



Instruction Sets

It'll be obvious to many readers, but just so we're all on the same page, let's start by describing the basic problem that arises when one switches from using an Intel x86 chip to using an ARM-based processor in Macs.

Microprocessors work by fetching a stream of "instructions," deciphering them, and executing them.  For example, the processor may understand an instruction which tells it to add two numbers together. The processor sees these instructions as binary strings of digits - strings of 1's and 0's.  Humans aren't very good at working with instructions in this format, so those who need to worry about the processor instructions - compiler writers, operating system authors, people writing special device drivers - work in something called "assembly language."

Assembly language tracks very closely with the instructions that the processor understands.  It's very easy to translate from one to the other.

For example, the assembly language to add the value 23 to a particular register would look something like:

add eax, 23

This adds 23 to the contents of the register eax and stores the result in that same register. This is the "human readable" form.  The version that the processor actually sees would look something like:

00000101 00000000 00000000 0000000 00010111

The "00000101" at left tells the processor that this is an "add" instruction that takes eax as an input and stores the result in eax as well.  The rest of the 0's and 1's are binary for "23."  (Experts: cut me some slack if I messed this up a bit. It's been a long time since I hand-compiled assembly code :-)

In a compiled executable - in other words, in software residing on your hard disk - there will be this series of 0's and 1's.

How would this work on ARM?

First, although the name of the instruction is also "add" (a coincidence stemming from the fact that "add" is a nice short word that describes the operation and is used in many different instruction sets), ARM instructions behave differently. With ARM you generally must specify the destination - there isn't a particular output location that is implicitly implied.  (This adds flexibility - the output and input can be arbitrary registers.  This also (generally) improves performance by eliminating bottlenecks caused by situations in which its desired to use the same register as a source in multiple instruction but specify different outputs for each, by allowing these instructions to run concurrently without first having to copy the register contents to the registers).

Anyway, the ARM assembly language would look like:

add r0, r1, 23

This adds 23 to r1 and puts the result in r0.

ARM binary encoding is slightly complicated, but the critical thing to note is that the strings of 1's and 0's will look completely different than the x86 string of 1's and 0's.

The result of this is if one takes software designed to run on current Macs, and tries to run it on a Mac using an ARM processor, it will not work even a little bit.  Complete fail.

Nature of Modern Software

Consider modern day software of the sort that runs on your Mac or iOS device.  Here's some source code for one of my apps:

-(void)showActionSheet:(id)sender {
    UIActionSheet *actionSheet=[[UIActionSheet alloc] initWithTitle:nil delegate:self cancelButtonTitle:@"Cancel" destructiveButtonTitle:nil otherButtonTitles:@"Share Table",@"Share Graph", @"Export .csv",@"Add Steps to Healthkit",@"Export SQL",nil];
    [actionSheet showInView:self.view];     
}


(This is temporary code. In production code the stuff in quotes would be localized for different languages, and some of those entries wouldn't exist. So quit picking on me.)  This obviously corresponds to a bunch of instructions -  a bunch of strings of 0's and 1's - but where are they? That is, how many of the 0's and 1's are located in the app on disk, and where are the rest located?

This code is used to bring up an "action sheet" - a user interface element - with specific buttons, as shown to the right.

This code gets "compiled" by Apple's Xcode into binary instructions - the 0's and 1's we referred to earlier.  These 0's and 1's are designed to be executed by an ARM processor in this case, because this is an iPhone app, but Mac apps are compiled in the same way, and their source code looks very similar.  In the case of Macs, however, Xcode will generate binary instructions targeted to execute on an x86 processor, so the 0's and 1's are not compatible with ARM.

But let's look more closely at this source code, which is, for the purposes of this discussion, typical of the source code you'd find with most Mac apps.

Breaking it down one line at a time:

-(void)showActionSheet:(id)sender {

This declares a new method (sometimes called a function, or subroutine, or some other name depending on in what language the source code is written).  It defines a reusable routine, named "showActionSheet."  This routine can be "called" by other source code and thus reused over and over without having to be copied and pasted in multiple places.  The routine takes a single argument - "sender" - which is the id of an object.  (What this means isn't particularly important for the purpose of this discussion) and does not produce any result (that's what the "void" means).  Instead of producing a result (like the answer to a mathematical problem, for example), it draws buttons.

There is not a lot of binary code generated from this statement, as it is mostly to inform the compiler of how the programmer wants to organize the code.  Let's look at the next statement:

    UIActionSheet *actionSheet=[[UIActionSheet allocinitWithTitle:nil delegate:self cancelButtonTitle:@"Cancel" destructiveButtonTitle:nil otherButtonTitles:@"Share Table",@"Share Graph"@"Export .csv",@"Add Steps to Healthkit",@"Export SQL",nil];

This looks more complicated than it is.  This creates a new "UIActionSheet" - that thing with the buttons at the bottom of the screenshot.  UIActionSheet is a type of UI object that Apple provides in the iOS SDK.  The Mac has similar sorts of UI objects made available in the OS X SDK.  When Apple provides these sorts of objects, it has already compiled the code, so the binary 0's and 1's already exist on the device.  Sometimes Apple has provided multiple versions of the binary code.  For example, when Apple simultaneously supported PowerPC and x86, it had to compile its SDKs (and, indeed, the entire operating system) for both types of processors.  Apple does something similar on x86 with respect to 32-bit vs. 64-bit code.  A compiled app would "link" against the proper version based on which type of computer it was running on.

The code statement above sends an "alloc" message to the UIActionSheet class, which creates a new UIActionSheet object.  The statement then sends an "initWithTitle" message to this newly created UIActionSheet object, telling it a delegate, the title of the cancel button, and various other button titles.  When this statement is compiled, the compiler only has to create binary code corresponding to the sending of the message.  The handling of the message, which involves filling a data structure with all of this information, is handled by code that Apple already provides.  In other words, if I am running on a PowerPC, this code would already be PowerPC code, or if I am running on an ARM processor, this code would already be ARM code.

This is important because, as I will discuss later, if I compiled it for x86 and then tried to run it on a hypothetical ARM-based Mac, only the code for sending the message would be the wrong type. If we could somehow handle that code - say by converting it on-the-fly - then the rest of the code would take care of itself.

    [actionSheet showInView:self.view];     

This causes the actionSheet to actually pop into view.  Again, all this code is send a message to the actionSheet telling it to show itself.  It doesn't handle any of the actual drawing, which is where most of the work is. It also doesn't handle dealing with button presses, making the sheet go away if the user presses Cancel, etc.  In other words, the vast majority of the work done by this code is not done by what I wrote, but is done by what Apple wrote.   And on a hypothetical ARM-based Mac, that means that the vast majority of the work to be done by software is done by code that is already using the right instruction set.  Nothing special needs to be done for that code.

I've seen studies that say that on modern operating systems, processors spend around 85% of their time running code that's provided by the SDKs.  This is important, because it means we only have to come up with a solution for the other 15%.

Transitions

In a hypothetical transition to ARM, all Mac models won't switch at once.  Some (like the Mac Pros) may never switch before they are replaced by something new and completely different or are simply abandoned.  The switch-over process may take years.  Moreover, for a long time, software developers will want to provide the best possible solution for its customers regardless of whether they are using ARM-based or x86-based Macs.

So how to deal with this?

Well, Apple's had some experience with this.

Prior to 1994, all Macs used Motorola 68k microprocessors.  The Motorola 68k family was the leading competitor to Intel up until that point, and was used in many successful computers including the Mac, TRS-80's, Sun's Sun-1, DEC's VAXstation 100, and the Silicon Graphics's IRIS computers.  It was also used in the Apple Lisa, Commodore Amiga, and Atari ST.  It was also popular in arcades and game consoles, like the Sega Saturn, Sega Genesis, and was used in other devices like HP's original LaserJet and Apple's LaserWriter printers.  Like the Intel x86, the Motorola 68k series are "CISC", not RISC, processors.

Starting in 1994, Apple began the transition to the more modern, and higher performing, PowerPC processors.  At that time Apple provided a feature called "fat binaries."  Apple provided developers with the ability to compile their code so that both the PowerPC binary (0's and 1's intended for PowerPC) and the Motorola 68k binary (0's and 1's intended for the 68k) were packaged into the same "file."  While this meant that the apps took more space on disk, it also meant that no matter which variety of Mac you ran an app on, it always had "native" binary code to execute.  This code executed at full speed and with full functionality regardless of whether the Mac was a PowerPC Mac or a Motorola 68k Mac.

Meanwhile, in a parallel but soon-to-converge universe, the fathers and mothers of OS X were busy working on NeXTStep, the operating system that ran on the NeXT computers.   Starting in NeXTStep 3.1, a feature called "multi-architecture binaries" were made available.  These were similar in principle to Apple's "fat binaries."  They combined binary code intended for the Motorola 68k with binary code intended for the Intel x86 (32-bit) architecture.  This combination stores each version as a "sub-file" in a single "Mach-O" file.  (Sort of like storing multiple files in a .zip file).

In 1996 NeXT was acquired by Apple and NeXTStep became the basis of what we now know as OS X.  And Mach-O files live on!   Apple used the same technology, renamed "Universal Binaries," when it underwent the transition from PowerPC to Intel x86 starting in 2005.  The same technology has been used by Apple to support its transition from 32-bit to 64-bit Intel x86 chips.

In fact, if you're not afraid of launching the terminal app, you can check various binaries to see if they  are fat binaries or not, and, if they are, to see which architectures are supported.  For example, on my Yosemite retina MacBook Pro:

% lipo -info /bin/tcsh 
Non-fat file: /bin/tcsh is architecture: x86_64
% lipo -info /bin/sync
Architectures in the fat file: /bin/sync are: x86_64 i386

This tells us that "tcsh" is an app that only has native binary code for the 64-bit version of x86 (the version invented by AMD).  On the other hand, the "sync" binary contains code both for 32-bit and 64-bit x86 processors.

This also works with "apps" that aren't simple Unix-based utilities. For example:

% lipo -info Growl.app/Contents/MacOS/*
Non-fat file: Growl.app/Contents/MacOS/Growl is architecture: x86_64
% lipo -info Kindle.app/Contents/MacOS/*
Non-fat file: Kindle.app/Contents/MacOS/Kindle is architecture: i386

On the Mac, apps on your disk can be though of as being folders or directories, rather than files.  In fact, they're something called "packages," which behave much like folders.  If you control-click or right-click an application in your Applications folder, and select "show package contents," you can see what is inside these "folders":


It would be trivial in an upcoming ARM migration for Apple to either include both ARM and x86 instructions in the single file under Contents/MacOS/[application name] or to simply add another file into this directory structure ([application name-ARM] or Contents/MacOS/ARMv8/[application name] or the like).  When you open the application, the operating system figures out which file to load and which portion of the file is relevant to that particular  machine.  There would be no performance penalty for this, and only a minor disk space penalty (that, as in past transitions, is easily solved by using utilities to remove unnecessary portions of applications).

(Lack of) Complications

In order for fat binaries to be a viable solution, ideally a programmer only has to write source code once and then have it automatically compiled into both types of binary code.  Xcode is designed to do this.  But sometimes there can be differences between different types of processors that cause difficulties for the programmer for certain types of code.  The primary two troublemakers are the number of bits in a word, and the "endianness" of the processor.  Luckily neither of these should cause any problems in a transition to ARM.

Bits in a word

When we refer to a "32-bit" processor or a "64-bit" processor we are usually referring to how many bits are included in each "word."  (Note: some people use the term "word" to refer to 32-bits and use other terms, like "long word" to mean 64-bits.  That's not how I was raised.  "Word" is the native computational unit of the processor.)  So, for example, a 32-bit processor is designed so that it can handle basic computations (like add and subtract) on quantities that are 32 binary digits long.  A 64-bit processor is designed to handle basic computations on quantities that are 64 binary digits long.

Programmer's can get into trouble when, in their software, they think they are working with 32-bit quantities but the processor treats them as 64-bit quantities (and vice versa).  This can cause malfunctions, wasted memory space, or reduced performance.   The good news in the coming ARM transition is that any ARM processor used in Macs will be 64-bits (after all, even A7 and A8 used in iPhones and iPads are 64-bits), just like the x86-64 chips currently used in Macs.  This will reduce any effort needed by developers to support both types of chips.


Endianness

Processors are characterized as being either "big endian" or "little endian."  "Endian" refers to how bytes are stored in memory.  The following figure illustrates the difference.


Here we start with the text string "BIRD."  It consists of four 8-bit bytes.  "B" is the first byte, "I" the second, etc.  Assume we have a 32-bit processor word.  Such a word can hold four 8-bit bytes.  So a 32-bit word can hold the entire string.  But the question is where, within the word, are each of the four bytes located.  In a big endian system, the first byte is located at the lowest memory address (you start putting bytes into the word starting at the "big end" - the left end - of the original string).   In a little endian system, this is reversed.  The "little end" (right end) of the string is stored in the lowest memory address.

Intel chips are "little endian."

If someone writes code that pokes into a word and tries to manipulate bytes directly (uncommon, but certainly quite possible for certain types of code), things can go haywire if the processor endianness changes.  The good news is that ARM chips can operate in either endianness and are little endian by default.  This means that they can be operated in the same little endian configuration as Intel chips, avoiding this problem.

The result of Intel and ARM being operated with the same size words and the same endianness is that, at least for code that is written for the CPU (we're not talking about GPU's yet), the transition should be easy for developers, and almost always will require no coding work.

Translation

So far I've explained why it would be easy for developers, with minor efforts by Apple leveraging mechanisms that have long existed, to simultaneously support both Intel and ARM architectures during the probably long transition period between the architectures.   I've also explained why old code, intended only for Intel x86 and not containing a "fat binary" for use on ARM, presents a problem only for something like 15% of the instructions in the software.

The next question is what is to be done about that 15%.  The most likely solution is "translation."

Binary translation refers to the process of taking the binary code intended for one microprocessor and converting it into the binary language spoken by a different microprocessor.

There are two ways this can be done - either statically or dynamically.  "Statically" means the translation is done only once and produces a new file on the disk.  Future attempts to run the code will use this new, translated code.  The advantage of static translation is that the overhead of the translation process is borne only once, when the translation is done.  After that, you just are running native code forevermore.  The downside is that static translation is hard.  It may also be copyright infringement.  But more importantly it is hard.  There may be things that the translator just can't know.  For example, Intel x86 code can modify itself in memory.  The next instruction to be executed may have its address calculated on the fly based on data that isn't present unless the software is actually running.  It's a mess.  Let's just leave static translation out of it then.

The more likely approach is dynamic translation.  This refers to the process of translating the code while running it.  This is usually accomplished by translating one block of code at a time, as the code runs.  Code that may be present in the software but which is never run (like, for example, that Microsoft Word mail merge feature you never run) never has to be translated.  But if you were to try to use that feature at some point in the future, it would be translated at that time.   To reduce the penalty associated with having to translate code as it runs, translated blocks of code are cached and reused.  If the same code is executed over and over (as is very common in most real world code), it does not have to be translated repeatedly.  Some dynamic translators actually optimize the translations over time and write translated portions of code to disk so they don't need to be re-translated in the future.

There is a rich history of dynamic translation in the computer industry.  The first such translator I was aware of was called FX!32 and was used by DEC to dynamically translate Intel x86 code to work on the Alpha architecture (Alpha was DEC's very nice RISC architecture).

Sun also had similar software for translating x86 to SPARC (another RISC architecture).

And Apple has lots of experience in this area as well.

When Mac underwent the Motorola 68K to PowerPC transition, Apple provided a dynamic translating emulator.  The OS X "Classic" runtime environment continued to support this sort of emulation as late as OS X 10.5.

When Apple switched again, this time to Intel x86, it provided Rosetta, a dynamic translator developed for apple by Transitive Corporation.  It was based on Transitive's QuickTransit technology, which also supported many other types of translation (e.g. x86 to PowerPC, MIPS to Intel Itanium, etc.)  Rosetta was supported by Apple up until OS X 10.7.  Rosetta provided less compatibility than its earlier efforts in the Motorola 68K→PowerPC transition, mostly because it ran at a less privileged layer of the operating system.  In a future ARM transition, Apple could choose to let a new translator run with more privileges if it felt that maximizing compatibility was important.  But even Rosetta provided good compatibility for the vast majority of PowerPC software running on x86.

But what about performance?

Initially, when Rosetta first became an option, its performance was "not good" but varied widely from scenario to scenario.  Over time, Rosetta became better.  More importantly, processor became faster and processor caches became bigger.   I first started using Macs as my daily home machine in 2007.  At that time there was still a lot of PowerPC software floating around, including a version of Microsoft Office that was still heavily used because the initial x86 Office version had some showstopper bugs and compatibility issues.  I found that all of the PowerPC software I used was quite useable on my 2007 MacBook Pro.  I dare say that most PowerPC software ran faster on my 2007 MacBook Pro under Rosetta than it would have run on a 2005 PowerPC Mac.

Highly imprecise capture of multi-core benchmark scores on mid-range Macs through the years.  Always rising, but sometimes more than others.


In the ARM transition as I imagine it, I believe the first ARM Macs will be a new MacBook Air-like model.   Thinner, lighter, with better battery life.  Performance will likely be approximately equal to that of bulkier Intel-compatible options at the time, with ARM winning some benchmarks and Intel winning others (particularly things like 3D graphics).  By generation two of the ARM era, I believe you will see the full benefits of the switch, with ARM Macs achieving 120% performance on integer code with maybe 80% of the power consumption.  But in any event, due to the year-after-year improvement in CPU performance, it is highly likely that the portions of your old software that require dynamic translation will run at least as fast as that software ran on average machines at the time the software was first published.  After all, we are all still using Microsoft Word 2011.  If Apple released an ARM MacBook today, it would quite likely have no trouble running Word 2011 in translation mode at least as fast as an average 2011-vintage x86 MacBook.

In other words, if you look at the graph above, performance in translation is "good enough" if it's somewhere in the range of various machines available in the last couple of years.  Right now most Mac users are running mid-range MacBooks, iMacs, or MacBook Airs that are probably at least 18-months old.  Some machines, like Mac Mini's range much older.  An iMac running Word as well as last year's MacBook Air is probably "good enough."  Those who need more performance will have other options (until the entire line is switched to ARM), and eventually software will be recompiled and optimized for ARM (major software within a couple of years).  This conclusion will undoubtedly upset those who need some very specific piece of software to run as fast as it can and want to have the freedom to buy any Mac in the lineup, but Apple has time and again shown that it will do what's best for the entire customer base, not for the small minority with very peculiar needs.

Virtualization

There is more than one way to run software intended for one type of microprocessor on a computer that uses a different type of microprocessor.  There are various virtualization tools that enable one to do so.  For example QEMU and ExaGear are tools that allow execution of x86 code on other platforms, including ARM.  The line between dynamic translation and virtualization is blurry, but it is probably easiest to think of virtualization as a special form of dynamic translation that improves compatibility by focussing on simulating the behavior of the processor itself.  That is, each of the major CPU structures is simulated by a different software routine, so that the behavior of the processor can be more closely modeled.  Sometimes virtualization also produces faster performance because the tool has more information about the overall context of software routines and can analyze larger blocks of code at once, or keep longer history about the behavior of software routines.  In the ARM-based Mac future, there is little doubt that Parallels and VMWare will offer tools for virtualizing x86-based Mac software and allowing it to run on these new Macs.

Remember that Apple controls its own compilers (indeed, it controls its own programming languages) and if it goes ARM it will doubtless design its own chip, meaning it could also deviate from the pure ARM instruction set and add its own instructions specifically designed to improve the speed of virtualization or translation.  There have been other chip manufacturers who have added such instructions, so there's a precedent.

GPU

The GPU is perhaps the biggest question in the ARM-based future.  Apple has consistently used PowerVR-based GPUs in its A-series chips.  PowerVR is a product of Imagination Technologies, which licenses the technology in much the same way that ARM licenses its core designs and instruction set.

There are a few possibilities here, and I'll hold off on the hardware options and the direction I think Apple will take until Part IV of this series of articles.  But software-wise, I think Apple is in good shape.  First, most Mac software does not directly access the underlying graphics chip.  So the vast majority of software uses the GPU through SDK abstractions that would not suffer inherently from a switch in architecture (assuming the graphics architectures ran at comparable speed, which is a reasonably even bet if you stick to comparing internal-to-internal GPUs.)

OpenCL and OpenGL (and QuartzCore, and CoreGraphics...) will abstract away any differences for most software, which will continue to run fine.  And Apple's new "Metal" GPU programming kit will provide lower level access to those who need, while still abstracting enough to iron over differences between chips in various Apple products.  In fact, Metal is, to me, a sort of give-away that this transition is coming.


What's Lost

The most common objection I hear to the ARM transition is that Bootcamp won't work. Yep. Sorry.  And as angry as it makes you, and as vehemently as you insist this means you will switch to Windows or Linux or to TI-99/4A, Apple doesn't care.  When Bootcamp arrived it was simultaneously an equal expression of confidence and desperation.  Why will no one try out Macs? If only they would try them, they would see how awesome they are and buy them!

Well, Apple doesn't need this anymore.  More and more, even in enterprises, software is platform agnostic.  People are allowed to bring their own machines to work, or choose what type of machine they'd like.  Word documents can be edited on any platform or the web.  Same with Excel spreadsheets.  And the number of people who buy Macs specifically to use them as Windows PCs was always small, and is now very small. (Yeah, I know you, the reader, fall into that category. You're all alone, trust me).

And even if there were hordes clamoring for it, do you really think that would stop Apple from killing Bootcamp if it suited their wider interest?  Apple is ruthless as far as killing things that it thinks don't belong in its future.

Existing apps will run slower than they could run, but not any slower than they run on the existing hardware that these machines would replace (mostly).  Updated versions will quickly solve any performance problems for the vast majority of these apps.

Some apps may run horribly for quite awhile.  People will cope. They will keep buying other Mac models until the transition is complete, at which point the software that still isn't updated will mostly be limited to Quicken.


What's Next

In Part IV I will examine what the hardware might look like going forward, specifically with respect to the CPU and GPU.



5 comments:

  1. What about all current existing software that was created using Xcode? wouldn't this just be a one click recompiling and you're done? Also, could the fat binary or Mach-O file not be handled in the vast majority pf cases by having the software distributed through the Mac App Store? The store will download and install only the appropriate binary, and the process would be seamless for the user, whether on an ARM Mac or an x86 Mac?

    ReplyDelete
  2. Yes to both. I forgot to mention the mac app store, which is a good mechanism to prevent the need to have to distribute all binaries. However, for code signing reasons Apple might choose to sign the entire package rather than split up the binaries. Either way is fine - storage is cheap nowadays.

    ReplyDelete
  3. Thanks for the reply! Agree with your thoughts there. When is Part IV coming out??

    ReplyDelete
  4. Actively working on it - taking longer than I expected. May have to split it into two, one for the GPU, and one for the CPU and other logic.

    ReplyDelete
  5. Hi Cliff, do you have any thoughts on AMD Zen?

    ReplyDelete