SMP Primer for Android

现在的位置: 首页 > 综合 > 正文

SMP Primer for Android

2018年04月16日 ⁄ 综合 ⁄ 共 19062字 ⁄ 字号小中大 ⁄ 评论关闭

Android 3.0 and later platformversions are optimized to support multiprocessor architectures. This documentintroduces issues that can arise when writing code for symmetric multiprocessorsystems
in C, C++, and the Java programming language (hereafter referred tosimply as “Java” for the sake of brevity). It's intended as a primer forAndroid app developers, not as a complete discussion on the subject. The focusis on the ARM CPU architecture.

If you’re in a hurry, you can skip the Theory section
and go directlyto Practice for
best practices, butthis is not recommended.

Introduction

SMP is an acronym for “Symmetric Multi-Processor”. Itdescribes a design in which two or more identical CPU cores share access tomain memory. Until a few years ago, all Android devices were UP(Uni-Processor).

Most — if not all — Android devices do have multiple CPUs,but generally one of them is used to run applications while others managevarious bits of device hardware (for example, the radio). The
CPUs may havedifferent architectures, and the programs running on them can’t use main memoryto communicate with each other.

Most Android devices sold today are built around SMPdesigns, making things a bit more complicated for software developers. Thesorts of race conditions you might encounter in a multi-threaded program
aremuch worse on SMP when two or more of your threads are running simultaneouslyon different cores. What’s more, SMP on ARM is more challenging to work withthan SMP on x86. Code that has been thoroughly tested on x86 may break badly onARM.

The rest of this document will explain why, and tell youwhat you need to do to ensure that your code behaves correctly.

Theory

This is a high-speed, glossy overview of a complexsubject. Some areas will be incomplete, but none of it should be misleading orwrong.

See Further reading at
the end of thedocument for pointers to more thorough treatments of the subject.

Memory consistency models

Memory consistency models, or often just “memory models”,describe the guarantees the hardware architecture makes about memory accesses.For example, if you write a value to address A, and then write
a value toaddress B, the model might guarantee that every CPU core sees those writeshappen in that order.

The model most programmers are accustomed to is sequentialconsistency, which is described like
this (Adve & Gharachorloo):

· All memory operations appear to execute one at a time

· All operations on a single processor appear to execute inthe order described by that processor's program.

If you look at a bit of code and see that it does somereads and writes from memory, on a sequentially-consistent CPU architecture youknow that the code will do those reads and writes in the expected
order. It’spossible that the CPU is actually reordering instructions and delaying readsand writes, but there is no way for code running on the device to tell that theCPU is doing anything other than execute instructions in a straightforwardmanner. (We’re ignoring
memory-mapped device driver I/O for the moment.)

To illustrate these points it’s useful to consider smallsnippets of code, commonly referred to as litmustests.
These are assumed to execute in programorder, that is, the order in which the instructions appearhere is the order in which the CPU will execute them. We don’t
want to considerinstruction reordering performed by compilers just yet.

Here’s a simple example, with code running on twothreads:

Thread 1

Thread 2

A = 3

B = 5

reg0 = B

reg1 = A

In this and allfuture litmus examples, memory locations are represented by capital letters (A,B, C) and CPU registers start with “reg”. All memory is initially zero.Instructions are executed from
top to bottom. Here, thread 1 stores thevalue 3 at location A, and then the value 5 at location B. Thread 2 loads thevalue from location B into reg0, and then loads the value from location A intoreg1.
(Note that we’re writing in one order and reading in another.)

Thread 1 and thread 2 are assumed to execute on differentCPU cores. You should always make
this assumptionwhen thinking about multi-threaded code.

Sequential consistency guarantees that, after boththreads have finished executing, the registers will be in one of the followingstates:

Registers	States
reg0=5, reg1=3	possible (thread 1 ran first)
reg0=0, reg1=0	possible (thread 2 ran first)
reg0=0, reg1=3	possible (concurrent execution)
reg0=5, reg1=0	never

To get into a situation where we see B=5 before we seethe store to A, either the reads or the writes would have to happen out oforder. On a sequentially-consistent machine, that can’t happen.

Most uni-processors, including x86 and ARM, aresequentially consistent. Most SMP systems, including x86 and ARM, are not.

Processor consistency

x86 SMP provides processorconsistency, which is slightly weaker than sequential. While thearchitecture
guarantees that loads are not reordered with respect to otherloads, and stores are not reordered with respect to other stores, it does not guaranteethat a store followed by a load will be observed in the expected order.

Consider the following example, which is a piece ofDekker’s Algorithm for mutual exclusion:

Thread 1

Thread 2

A = true

reg1 = B

if (reg1 == false)

critical-stuff

B = true

reg2 = A

if (reg2 == false)

critical-stuff

The idea is that thread 1 uses A to indicate that it’sbusy, and thread 2 uses B. Thread 1 sets A and then checks to see if B is set;if not, it can safely assume that it has exclusive access to
the criticalsection. Thread 2 does something similar. (If a thread discovers that both Aand B are set, a turn-taking algorithm is used to ensure fairness.)

On a sequentially-consistent machine, this workscorrectly. On x86 and ARM SMP, the store to A and the load from B in thread 1can be “observed” in a different order by thread 2. If that happened,
we couldactually appear to execute this sequence (where blank lines have been insertedto highlight the apparent order of operations):

Thread 1

Thread 2

reg1 = B

A = true

if (reg1 == false)

critical-stuff

B = true

reg2 = A

if (reg2 == false)

critical-stuff

This results in
both reg1 and reg2 set to “false”, allowing the threadsto execute code in the critical section simultaneously. To understandhow this can happen, it’s useful to know a little about CPU caches.

CPU cache behavior

This is a substantial topic in and of itself. Anextremely brief overview follows. (The motivation for this material is toprovide some basis for understanding why SMP systems behave as they do.)

Modern CPUs have one or more caches between the processorand main memory. These are labeled L1, L2, and so on, with the higher numbersbeing successively “farther” from the CPU. Cache memory adds
size and cost tothe hardware, and increases power consumption, so the ARM CPUs used in Androiddevices typically have small L1 caches and little or no L2/L3.

Loading or storing a value into the L1 cache is veryfast.
Doing the same tomain memory can be 10-100x slower. The CPU will therefore try to operateout of the cache as much as possible. The writepolicy of
a cache determines when data written to it isforwarded to main memory. A write-through cache will initiate awrite to memory immediately, while a write-back cache
will wait until itruns out of space and has to evict some entries. In either case, the CPU willcontinue executing instructions past the one that did the store, possiblyexecuting dozens of them before the write is visible in main memory. (While thewrite-through
cache has a policy of immediately forwarding the data to mainmemory, it
only initiates the write. It does not have to wait for it to finish.)

The cache behavior becomes relevant to this discussionwhen each CPU core has its own private cache. In a simple model, the cacheshave no way to interact with each other directly. The values held
by core #1’s cache are not shared with or visibleto core #2’s cache except as loadsor stores from main memory. The long latencies on memory accesses would makeinter-thread interactions sluggish,
so it’s useful to define a way for the caches to share data.This sharing is called cache coherency, and the coherencyrules
are defined by the CPU architecture’s cacheconsistency model.

With that in mind, let’s return to the Dekker example. Whencore 1 executes “A = 1”, the valuegets stored in core 1’scache. When core 2 executes “if (A == 0)”, it might read from main memory or
itmight read from core 2’scache; either way it won’t see the store performed by core 1. (“A” could be incore 2’s cache because of aprevious load from “A”.)

For the memory consistency model to be sequentiallyconsistent, core 1 would have to wait for all other cores to be aware of “A = 1”before it could execute “if (B == 0)” (either
through strict cache coherency rules, or bydisabling the caches entirely so everything operates out of main memory).This would impose a performance penalty on every store operation. Relaxing therules for the ordering of stores followed by loads improves
performance butimposes a burden on software developers.

The other guarantees made by the processor consistencymodel are less expensive to make. For example, to ensure that memory writes arenot observed out of order,
itjust needs to ensure that the stores are published to other cores in the sameorder that they were issued.
It doesn’t need to wait for store #1 to finish being published before it can start on store #2,
it justneeds to ensure that it doesn’t finish publishing #2 before it finishespublishing #1. This avoids a performance bubble.

Relaxing the guarantees even further can provideadditional opportunities for CPU optimization, but creates more opportunitiesfor code to behave in ways the programmer didn’t expect.

One additional note:
CPU caches don’t operate on individual bytes. Data isread or written as cachelines;
formany ARM CPUs these are 32 bytes. If you read data from a location inmain memory, you will also be reading
some adjacent values. Writing data will cause thecache line to be read from memory and updated. As a result, you can cause avalue to be loaded into cache as a side-effect of reading or writing somethingnearby, adding to
the general aura of mystery.

Observability

Before going further,it’s useful to define in a more rigorous fashion what is meant by “observing” aload or store. Suppose core 1 executes “A = 1”.The store is initiated when
the CPU executes the instruction. At some pointlater, possibly through cache coherence activity, the store is observed by core 2.
In a write-through cache itdoesn’t really complete until the storearrives
in main memory, but the memoryconsistency model doesn’t dictate when something completes,
just when it can be observed.

(In a kernel device driver that accessesmemory-mapped I/O locations, it may be very important to know when thingsactually complete. We’re not going to go into that
here.)

Observability may bedefined as follows:

· "A write to a location in memory is said to beobserved by an observer Pn when a subsequent read of the location by Pn wouldreturn the
value written by the write."

· "A read of a location in memory is said to beobserved by an observer Pm when a subsequent write to the location by Pm wouldhave no
effect on the value returned by the read." (Reasoningabout
the ARM weakly consistent memory model)

A less formal way todescribe it (where “you” and “I” are CPU cores) would be:

· I have observed your write when I can read what you wrote

· I have observed your read when I can no longer affect thevalue you read

The notion of observinga write is intuitive; observing a read is a bit less so (don’t worry, it growson you).

With this in mind, we’reready to talk about ARM.

ARM'sweak ordering

ARM SMP provides weakmemory consistency guarantees.
It does not guarantee that loads or stores are ordered with respect toeach other.

Thread 1

Thread 2

A = 41

B = 1 // “A is ready”

loop_until (B == 1)

reg = A

Recall that all addresses are initially zero. The “loop_until” instruction reads B repeatedly,looping until we read 1 from B. The idea here
is that thread 2 is waiting forthread 1 to update A.
Thread1 sets A, and then sets B to 1 to indicate data availability.

On x86 SMP, this is guaranteed to work. Thread 2 will observe the
stores made by thread 1 inprogram order, and thread 1 will observe thread 2’s
loadsin program order.

On ARM SMP, the loadsand stores can be observed in any order.
It is possible, after all the code has executed, for regto hold 0. It’s also possible for it to hold 41. Unless you explicitlydefine the ordering, you don’t know how this will come out.

(For those withexperience on other systems, ARM’s memory model is equivalent to PowerPC inmost respects.)

Datamemory barriers

Memory barriers providea way for your code to tell the CPU that memory access ordering matters.ARM/x86 uniprocessors offer sequential consistency, and thus have no need forthem. (The barrier instructions
can be executed but aren’t useful; in at leastone case they’re hideously expensive, motivating separate builds for SMPtargets.)

There are four basicsituations to consider:

1. store followed by another store

2. load followed by another load

3. load followed by store

4. store followed by load

Store/storeand load/load

Recall our earlierexample:

Thread 1

Thread 2

A = 41

B = 1 // “A is ready”

loop_until (B == 1)

reg = A

Thread 1 needs to ensurethat the
store to Ahappens before the store to B. This is a “store/store” situation.Similarly, thread 2 needs to ensure that the
load of B happens before the load of A; this is aload/load situation. As mentioned earlier, the loads and stores can be observedin any order.

Going back to the cachediscussion, assume A and B are on separate cache lines, with minimal cachecoherency.
If the store toA stays local but the store to B is published, core 2 will see B=1 but won’tsee the update to A. On the other side, assume we read A earlier, or itlives on the same cache line as something else we recently
read. Core 2 spinsuntil it sees the update to B,
then loads A from its local cache, where the value is still zero.

We can fix it like this:

Thread 1

Thread 2

A = 41

store/store barrier

B = 1 // “A is ready”

loop_until (B == 1)

load/load barrier

reg = A

The store/store barrierguarantees that all observers will
observe the write to A
before they observe the write to B. It makes
no guarantees about the orderingof loads in thread 1, but we don’t have any of those, so that’s okay.The load/load barrier in thread 2 makes a similar guarantee for the loadsthere.

Since the store/storebarrier guarantees that thread 2
observes the stores in program order,
why do we need the load/load barrier in thread 2?Because we also need to guarantee that thread 1
observes the loads in program order.

The store/store barriercould work by
flushing alldirty entries out of the local cache, ensuring that other cores see them beforethey see any future stores. The load/load barrier could
purge the local cache completelyand wait for any “in-flight” loads to finish, ensuring
that future loads are observed after previousloads. What the CPU actually does doesn’t matter, so long as theappropriate guarantees are kept. If we use a barrier in core 1 but not in core2,
core 2 could still bereading A from its local cache.

Because the architectureshave different memory models, these barriers are required on ARM SMP but notx86 SMP.

Load/storeand store/load

The Dekker’s Algorithmfragment shown earlier illustrated the need for a store/load barrier. Here’s anexample where a load/store barrier is required:

Thread 1

Thread 2

reg = A

B = 1 // “I have latched A”

loop_until (B == 1)

A = 41 // update A

Thread 2 could observethread 1’s store of B=1
before it observe’s thread 1’s load from A, and
as a result store A=41 beforethread 1 has a chance to read A. Inserting a load/store barrier in eachthread solves the problem:

Thread 1

Thread 2

reg = A

load/store barrier

B = 1 // “I have latched A”

loop_until (B == 1)

load/store barrier

A = 41 // update A

A store to local cache may be observed before a load frommain memory, because accesses tomain memory are so much slower. In this case,
assume core 1’s cache has the cache line for B but not A. The loadfrom A is initiated, and while that’s in progress execution continues. Thestore to B happens in local cache,
and by some means becomes available to core 2 while the load from A isstill in progress. Thread 2 is able to exit the loop before it hasobserved thread 1’s loadfrom A.

A thornier question is:do we need a barrier in thread 2? If the CPU doesn’t perform speculativewrites, and doesn’t execute instructions out of order, can thread 2 store to Abefore thread 1’s
readif thread 1 guarantees the load/store ordering? (Answer: no.) What if there’s athird core watching A and B? (Answer: now you need one, or you could observeB==0 / A==41 on the third core.) It’s safest to insert barriers in both placesand not worry about
the details.

As mentioned earlier,store/load barriers are the only kind required on x86 SMP.

Barrierinstructions

Different CPUs providedifferent flavors of barrier instruction. For example:

· Sparc V8 has a “membar” instruction that takes a4-element bit vector. The four categories of barrier can be specifiedindividually.

· Alpha provides
“rmb” (load/load), “wmb” (store/store), and “mb” (full).(Trivia: the linux kernel provides three memory barrier functions with thesenames and behaviors.)

· x86 has a variety of options; “mfence” (introduced withSSE2) provides a full barrier.

· ARMv7 has
“dmb st” (store/store) and “dmb sy” (full).

“Full barrier” means all four categories areincluded.

It is important torecognize that the only thing guaranteed by barrier instructions is ordering.
Do not treat them as cachecoherency “sync points” or synchronous “flush” instructions. The ARM“dmb” instruction
has nodirect effect on other cores. This is important to understand whentrying to figure out where barrier instructions need to be issued.

Addressdependencies and causal consistency

(This is a slightlymore advanced topic and can be skipped.)

The ARM CPU provides onespecial case where a load/load barrier can be avoided. Consider the followingexample from earlier, modified slightly:

Thread 1

Thread 2

[A+8] = 41

store/store barrier

B = 1 // “A is ready”

loop:

reg0 = B

if (reg0 == 0) goto loop

reg1 = 8

reg2 = [A + reg1]

This introduces a newnotation. If “A” refers to a memory address, “A+n” refers to a memory addressoffset by 8 bytes from A. If A is the base address of an object or array, [A+8]could be a field
in the object or an element in the array.

The “loop_until” seen inprevious examples has been expanded to show the load of B into reg0. reg1 isassigned the numeric value 8, and reg2 is loaded from the address [A+reg1] (thesame location
that thread 1 is accessing).

This will not behavecorrectly because the load from B could be observed
after the load from [A+reg1].
We can fix this with a load/loadbarrier after the loop, but on ARM we can also just do this:

Thread 1

Thread 2

[A+8] = 41

store/store barrier

B = 1 // “A is ready”

loop:

reg0 = B

if (reg0 == 0) goto loop

reg1 = 8 + (reg0 & 0)

reg2 = [A + reg1]

What we’ve done here ischange the assignment of reg1 from a constant (8) to
a value that depends on what we loaded from B.In this case, we do a bitwise AND of the value with 0, which yields zero, whichmeans reg1 still has the value 8. However,
the ARM CPU believes that the load from [A+reg1] dependsupon the load from B, and will ensure that the two are observed inprogram order.

This is called an address dependency. Address dependencies exist when the value returned by aload
is used to compute the address of a subsequent load or store. It can letyou avoid the need for an explicit barrier in certain situations.

ARM does not provide control dependency guarantees. To illustrate this it’s necessary to dip intoARM
code for a moment: (BarrierLitmus Tests and
Cookbook).

LDR r1,[r0]

CMP r1,#55

LDRNE r2,[r3]

The loads from r0 and r3 may be observed out of order,
eventhough the load from r3 will not execute at all if [r0] doesn’t hold 55.Inserting AND r1, r1, #0 and replacing the last instruction with LDRNE r2, [r3,r1] would ensure proper ordering without an explicit barrier. (This
is a primeexample of why you can’t think about consistency issues in terms of instructionexecution.
Always think interms of memory accesses.)

While we’re hip-deep,it’s worth noting that ARM does not provide causal consistency:

Thread 1

Thread 2

Thread 3

A = 1

loop_until (A == 1)

B = 1

loop:

reg0 = B

if (reg0 == 0) goto loop

reg1 = reg0 & 0

reg2 = [A+reg1]

Here, thread 1 sets A,signaling thread 2. Thread 2 sees that and sets B to signal thread 3. Thread 3sees it and loads from A, using an address dependency to ensure that the loadof B and the load
of A are observed in program order.

It’s possible for reg2to hold zero at the end of this. The fact that a store in thread 1 causessomething to happen in thread 2 which causes something to happen in thread 3does not mean that thread
3 will observe the stores in that order. (Inserting a load/store barrierin thread 2 fixes this.)

Memorybarrier summary

Barriers come indifferent flavors for different situations. While there can be performanceadvantages to using exactly the right barrier type, there are code maintenancerisks in doing so — unless
the person updating the code fully understands it,they might introduce the wrong type of operation and cause a mysteriousbreakage. Because of this, and because ARM doesn’t provide a wide variety ofbarrier choices, many atomic primitives
use full barrier instructions when a barrier is required.

The key thing to remember about barriers is that theydefine ordering.
Don’t think of them as a “flush”call that causes a series of actions to happen. Instead, think of themas a
dividing linein time for operations on the current CPU core.

Atomicoperations

Atomic operationsguarantee that an operation that requires a series of steps always behaves asif it were a single operation. For example, consider a non-atomic increment(“++A”) executed on the
same variable by two threads simultaneously:

Thread 1	Thread 2
reg = A reg = reg + 1 A = reg	reg = A reg = reg + 1 A = reg

If the threads executeconcurrently from top to bottom, both threads will load 0 from A, increment itto 1, and store it back,
leavinga final result of 1. If we used an atomic increment operation, you wouldbe
guaranteed that thefinal result will be 2.

【上篇】DMA描述符及映射
【下篇】Linux内核ARM构架中原子变量的底层实现研究

作者: forceful

该日志由 forceful 于6年前发表在综合分类下，最后更新于 2018年04月16日.
转载请注明: SMP Primer for Android | 学步园 +复制链接

抱歉!评论已关闭.

学步园

SMP Primer for Android

作者: forceful

书签

最新文章New

本站推荐

返回首页