现在的位置: 首页 > 综合 > 正文

Fran Allen: Compilers and Parallel Computing Systems

2019年05月14日 ⁄ 综合 ⁄ 共 3625字 ⁄ 字号 评论关闭

注释:编译自网络,该文对 Fran Allen 关于并行计算系统的演讲作了摘要


今天的高性能计算的伟大目标是要有每秒浮点运算有1千兆(petaflop 的机器。当然这就需要每秒浮点运算1百万千兆(gigaflop)的处理器。她显示,相对引入年份的一个半对数绘图(a
semilogplot)峰值速度是线性线条(摩尔定律仍能工作)。


Much of Allen’s work in the 80’s and early 90’s was around the PTRAN system of analysisfor parallelism. The techniques
are used, for example in the optimization stageof IBM’s XL family of compilers.

Becausemore and more transistors are being placed on chips, they’re using more and more energy—getting hotter. Part of the solution—which we’re seeing play out—ismulti-core chips. This requires parallelism to achieve the performance users expect. But making
use of multi-codes requires that tasks be organized by either users or software to run in parallel.


By 2021,there will be chips with 1024 cores on them. Is parallelism the tool that will make al these ores useful?
 John
Hennessey
 has calledit the biggest challenge Computer Science has every faced. He has
credentials that might make you believe him. Allen says that it’s also the best opportunity that Computer Science has to improve user productivity, application performanceand system integrity.


For parallel (superscalar, etc.) architectures, compilers—software—have been usedto automatically manage scheduling tasks so that they can operate in parallel.What about those techniques will be useful in this new world of multi-cores?


Allen says we need to get rid of C — soon. C, as a language, doesn’t provide enoughinformation to the compiler for it to figure out interdependencies — making it hard to parallelize.
Another
way to look at it is that
pointers allow programmers to build programs that can’t be easily analyzed to find out which parts of the program can be executed at the same time.


Another factor that makes parallelization hard is data movement.Allen
offers no silver bullet.
The latency of data movement inhibits high performance.


The key isthe right high level language that can effectively take advantage of the manygood scheduling and pipelining algorithms that exist. If we don’t start withthe right high level language, those techniques will have limited impact.

Shepresents some research from
 Vivek
Sarkar
 oncompiling for parallelism. Only a small fraction of application developers areexperts
in parallelism. Expecting them to become such is unreasonable. Thesoftware is too complex and the primary bottleneck in the usage of parallelsystems. X10 is an example of a language (object oriented) that tries to maximize the amount of automatic parallel
optimization that can be done.

Major themesinclude cross-procedure parallelization, data dependency analysis, controldependency analysis,
and then using those analyses to satisfy the dependencieswhile maximizing parallelism
.

Usefulparallelism depends on the run time behavior of the program (i.e. loopfrequency, branch prediction, and node run times) and the parameters of thetarget multiprocessor. Finding the maximal parallelism isn’t enough because itprobably can’t be efficiently
mapped on the multiple cores or processors. Thereis a trade off between the partition cost and the run time. Finding the intersection gives the
 right
level of parallelism
—thelevel that is the most efficient use of available resources. Inter procedural analysis is the key to whole program parallelism.


One of the PTRAN analysis techniques was the transform the program into a functional equivalent that used static single assignment.
This, of course, is
what functional programming enthusiasts have been saying for years:
one of functional programming’s biggest advantages is that functional programs—thosewithout
mutation—are much more easily parallelized than imperative programs(including imperative-based object oriented languages).


There’s a long list of transformations that canbe done—everything from array padding to get easily handled dimensions to loopunrolling and interleaving. Doing most of these transformations well requires detailed knowledge of the machine—making it a better
job for compilers thanhumans. Even then, the speedup is less than the number of additional processors applied on the job. That is, applying 4 processors doesn’t get you a speedup of4—more like 2.2. The speed up—at present—is asymptotic.


(The  End)

【上篇】
【下篇】

抱歉!评论已关闭.