《编程珠玑（续）》学习笔记——第一章性能监视工具（1）

现在的位置: 首页 > 综合 > 正文

《编程珠玑（续）》学习笔记——第一章性能监视工具（1）

2018年03月30日 ⁄ 综合 ⁄ 共 13464字 ⁄ 字号小中大 ⁄ 评论关闭

一、计算素数（prime number）

定义：质数，又称素数，指在大于1的自然数中，除了1和此整数自身外，无法被其他自然数整除的数（也可定义为只有1和本身两个因数的数）。（摘自维基百科：质数）

素数判定，或素性测试，是检验一个给定的整数是否为素数的测试。（摘自维基百科：素性测试）

思考：可以查阅这方面的资料，了解各种实现算法的原理和意义，利用本文所说的性能监视工具，对这几种算法进行对比和调优，并进行调优的记录（应当包括算法调优和数据结构调优）。要注意的是：工程上使用时分两种：一种是算法产生质数，一种是算法检验一个数是否为质数，两者的重点不同，所以算法特点也不一样。

二、开发性能监视工具

目标：调优单个子过程或函数的性能时，性能监视工具能告诉你运行时间都花在了哪里。

思想：对于不处理I/O密集型的大多数程序，大部分的运行时间花在了很小一部分代码上。

文中主要介绍了两种性能开发工具：

1.行计数性能监视：用于检查每行在程序运行过程中执行的次数，可以查看异常计数寻找代码BUG。文中提到重要的一点：行计数在估计测试覆盖面时极有价值。

2.过程时间性能监视：解释各个过程的CPU运行时间。

三、C语言开发性能监视工具

平台：UBUNTU + GCC

工具一：gprof （《C专家编程》“有用的C语言工具”一节中也提及到，其中prof为UNIX系统下的程序）

英文简介：This l describes the GNU profiler, gprof, and how you can use it to determine which parts of a program are taking most of the execution time. （详细教程参见：http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html）

重点：配合GCC监视程序各函数运行时间和效率。

第一步: Compiling a Program for Profiling

GCC使用编译选项-pg (备注: cc 和 gcc 命令在LINUX下都指向/usr/bin/gcc（使UNIX下C程序能够编译通过），UNIX下cc特指UNIX下的C语言编辑器。)

例子: cc -o myprog myprog.c utils.c -g -pg

有的选项:

-lc_p If you run the linker ld directly instead of through a compiler such as cc, you must specify the profiling startup file `/lib/gcrt0.o' as the first input file instead of the usual startup file `/lib/crt0.o'.
In addition, you would probably want to specify the profiling C library, `/usr/lib/libc_p.a', by writing `-lc_p' instead of the usual `-lc'.

（说明:1. 使用ld链接时需要用/lib/gcrt0.o代替crt0.o作为第一个input文件；2.如果要调试libc库需要使用-lc_p代替-lc参数）

(UBUNTU 下没有这个libc_p.a文件，需要安装：sudo apt-get install libc6-prof 但仍会出现以下错误，暂无解决方案：

/usr/bin/ld: dynamic STT_GNU_IFUNC symbol `strcmp' with pointer equality in `/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/libc_p.a(strcmp.op)' can not be used when making an executable; recompile with -fPIE and relink with -pie

)

第二步: Executing the Program to Generate Profile Data

运行二进制可执行文件，程序会在当前目录自动下生成文件：gmon.out

第三步: gprof Command Summary

The result of the analysis is a file containing two tables, the flat profile （过程时间性能监视）and thecall graph（行计数性能监视）

例子: gprof options [executable-file [profile-data-files...]] [outfile]

有用的选项:

-z If you give the `-z' option, gprof will mention all functions in the flat profile, even those that were never called, and that had no time spent in them.

特点：实际使用时，该工具只能定位到函数级，函数内部不再进行分析。

（LINUX平台下类似工具：oprofile:适用于系统级软件， gprof适用于用户级软件，也有别的程序通过gprof分析形成图形化的界面而不是文档形式，可利用）

工具二：gcov （《C专家编程》“有用的C语言工具”一节中提及到tcov，但属于SUNOS系统，这里的gcov是gcc提供的，与上面工具gprof配套，lcov是gcov图形化前端工具，可进一步学习）

英文简介：gcov is a test coverage program. Use it in concert with GCC to analyze your programs to help create more efficient, faster running code and to discover untested parts of
your program. You can use gcov as a profiling tool to help discover where your optimization efforts will best affect your code. You can also use gcov along with the other profiling tool, gprof, to assess which parts of your code use the greatest amount of
computing time.（详细教程参见：https://gcc.gnu.org/onlinedocs/gcc/Gcov.html#Gcov）

重点：配合GCC实现代码覆盖率测试，获得:每一行代码的执行频率,实际上哪些代码确实被执行了,每一段代码(section code)的耗时(执行时间)。使用该工具可以查看测试的覆盖率，进而分析测试用例设计的缺陷。使用该工具可以查看程序在某分支处的执行频率，进而分析程序的性能。

第一步:When using gcov, you must first compile your program with two special GCC options: ‘-fprofile-arcs -ftest-coverage’.

GCC 加入这两个编译选项，并运行程序后，会产生两个文件：

gcov主要使用.gcno和.gcda两个文件。gcno是由-ftest-coverage产生的，它包含了重建基本块图和相应的块的源码的行号的信息。gcda是由加了-fprofile-arcs编译参数的编译后的文件运行所产生的，它包含了弧跳变的次数和其他的概要信息（而gcda只能在程序运行完毕后才能产生的）。

第二步:Running the program will cause profile output to be generated.

第三步:Running gcov with your program's source file names as arguments will now produce a listing of the code along with frequency of execution for each line.

可使用-b选项来查看程序的分支数据。

通过使用-f选项来查看每一个函数的执行情况。

第四步:打开后缀位.gcov的文件查看信息：

被标记为#####的代码行就是没有被执行过的

工具三：time（time命令用于打印出一条命令或一个程序的执行时间。）

使用： time -p ./a.out

(time命令结果有三行组成：real、user和sys。我们这里用的都是real值，它表示从程序开始到程序执行结束时所消耗的时间，包括CPU的用时。CPU用时被划分为user和sys两块。user值表示程序本身，以及它所调用的库中的子例程使用的时间。sys是由程序直接或间接调用的系统调用执行的时间。)

四、时间复杂度

时间复杂度最初的学习来自于《大话数据结构》，应用级别学习此书上相关内容即可，浅显易懂，但比较难进一步认知时间复杂度这一概念及思想，回忆起来时不能很好的阐释清除其意义，故重新学习之。（也为《算法导论》学习做功课）

参考资料：http://www.cnblogs.com/gaochundong/p/complexity_of_algorithms.html

http://blog.csdn.net/zolalad/article/details/11848739

http://univasity.iteye.com/blog/1164707

思想：

1.为什么要进行算法分析？

预测算法所需的资源：计算时间（CPU 消耗）内存空间（RAM 消耗）通信时间（带宽消耗）
预测算法的运行时间：在给定输入规模时，所执行的基本操作数量。或者称为算法复杂度（Algorithm Complexity）

算法在实际使用时会有一定的约束条件，通过对算法进行分析，可以更清楚算法在这种约束条件的下的执行情况并对算法进行调整。

       2.如何衡量算法复杂度？
           内存（Memory）
           时间（Time）
           指令的数量（Number of Steps）
           特定操作的数量
           磁盘访问数量
           网络包数量
           渐进复杂度（Asymptotic Complexity）
       3.算法的运行时间与什么相关？
           取决于输入的数据。（例如：如果数据已经是排好序的，时间消耗可能会减少。）
           取决于输入数据的规模。（例如：6 和 6 * 109）
           取决于运行时间的上限。（因为运行时间的上限是对使用者的承诺。）
       4.算法分析的种类：
           最坏情况（Worst Case）：任意输入规模的最大运行时间。（Usually）
           平均情况（Average Case）：任意输入规模的期待运行时间。（Sometimes）
           最佳情况（Best Case）：通常最佳情况不会出现。（Bogus）

5.时间复杂度定义：

一般情况下，算法中基本操作重复执行的次数是问题规模n的某个函数，用T(n)表示，若有某个辅助函数f(n)，使得当n趋近于无穷大时，T(n)/f(n)的极限值为不等于零的常数，则称f(n)是T(n)的同数量级函数。记作T(n)=O(f(n))，称O(f(n))为算法的渐进时间复杂度(O是数量级的符号 )，简称时间复杂度。

原则：

       算法分析要保持大局观（Big Idea），其基本思路：
                            忽略掉那些依赖于机器的常量。
                            关注运行时间的增长趋势。

方法：

渐近记号（Asymptotic Notation）通常有 O、 Θ 和 Ω 记号法。Θ 记号渐进地给出了一个函数的上界和下界，当只有渐近上界时使用 O 记号，当只有渐近下界时使用 Ω 记号。尽管技术上 Θ 记号较为准确，但通常仍然使用 O 记号表示。

使用 O 记号法（Big O Notation）表示最坏运行情况的上界。例如，

线性复杂度 O(n) 表示每个元素都要被处理一次。
平方复杂度 O(n²) 表示每个元素都要被处理 n 次。

Notation	Intuition	Informal Definition
	f is bounded above by g asymptotically
	Two definitions : Number theory: f is not dominated by g asymptotically Complexity theory: f is bounded below by g asymptotically
	f is bounded both above and below by g asymptotically

常见的时间复杂度：

复杂度	标记符号	描述
常量（Constant）	O(1)	操作的数量为常数，与输入的数据的规模无关。 n = 1,000,000 -> 1-2 operations
对数（Logarithmic）	O(log₂ n)	操作的数量与输入数据的规模 n 的比例是 log₂ (n)。 n = 1,000,000 -> 30 operations
线性（Linear）	O(n)	操作的数量与输入数据的规模 n 成正比。 n = 10,000 -> 5000 operations
平方（Quadratic）	O(n²)	操作的数量与输入数据的规模 n 的比例为二次平方。 n = 500 -> 250,000 operations
立方（Cubic）	O(n³)	操作的数量与输入数据的规模 n 的比例为三次方。 n = 200 -> 8,000,000 operations
指数（Exponential）	O(2ⁿ) O(kⁿ) O(n!)	指数级的操作，快速的增长。 n = 20 -> 1048576 operations

计算步骤：

1. 计算出基本操作的执行次数T(n)

注意循环体，算法考虑最坏情况。

2.计算出T(n)的数量级

忽略常量、低次幂和最高次幂的系数，重点在增长率。

3.用大Ο记号表示算法的时间性能

技巧：

加法规则:T(n,m) = T1(n) + T2(n) = O ( max (f(n), g(m) )

乘法规则:T(n,m) = T1(n) * T2(m) = O (f(n) * g(m))

常见的算法时间复杂度由小到大依次为：Ο(1)＜Ο(log2n)＜Ο(n)＜Ο(nlog2n)＜Ο(n2)＜Ο(n3)＜…＜Ο(2n)＜Ο(n!)

五、素数判定算法实战：

版本一（试除法）：特点：采用乘法代替开方，只考虑sqrt(n)以内的整数因子。

判断质数子函数时间复杂度：O(sqrt(n)) 总时间复杂度：O(n*sqrt(n))

  8 #include <stdio.h>
  9 #include <stdlib.h>
 10 
 11 int is_prime_number(unsigned int count)
 12 {
 13     int i;
 14     for(i = 2; i * i <= count; i++) {
 15         if(0 == count % i)
 16             return 0;
 17     }
 18     return 1;
 19 }
 20 
 21 int main(void)
 22 {
 23     int i;
 24     unsigned int count = 1000000000u;
 25     for(i = 2; i <= count; i++) {
 26         if(is_prime_number(i))
 27             printf("%d\n", i);
 28     }
 29     return EXIT_SUCCESS;
 30 }

gprof测试情况:

n =10^5: 时间：0.01

Flat profile:

Each sample counts as 0.01 seconds.
%   cumulative   self              self     total
time   seconds   seconds    calls ns/call ns/call name
101.27      0.01     0.01    99999   101.27   101.27 is_prime_number

Call graph (explanation follows)

granularity: each sample hit covers 2 byte(s) for 98.75% of 0.01 seconds

index % time    self children    called     name
                0.01    0.00   99999/99999       main [2]
[1]    100.0    0.01    0.00   99999         is_prime_number [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    0.01                 main [2]
                0.01    0.00   99999/99999       is_prime_number [1]
-----------------------------------------------

n =
10^6: 时间：0.31

Flat profile:

Each sample counts as 0.01 seconds.
%   cumulative   self              self     total
time   seconds   seconds    calls ns/call ns/call name
101.27      0.31     0.31   999999   313.92   313.92 is_prime_number

Call graph (explanation follows)

granularity: each sample hit covers 2 byte(s) for 3.19% of 0.31 seconds

index % time    self children    called     name
                0.31    0.00 999999/999999      main [2]
[1]    100.0    0.31    0.00 999999         is_prime_number [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    0.31                 main [2]
                0.31    0.00 999999/999999      is_prime_number [1]
-----------------------------------------------

n =10^7: 时间：7.54
Flat profile:

Each sample counts as 0.01 seconds.
%   cumulative   self              self     total
time   seconds   seconds    calls ns/call ns/call name
100.06      7.54     7.54 9999999   754.43   754.43 is_prime_number
0.81      7.61     0.06                             main
0.40      7.64     0.03                             frame_dummy

Call graph (explanation follows)

granularity: each sample hit covers 2 byte(s) for 0.13% of 7.64 seconds

index % time    self children    called     name
                                                 <spontaneous>
[1]     99.6    0.06    7.54                 main [1]
                7.54    0.00 9999999/9999999     is_prime_number [2]
-----------------------------------------------
                7.54    0.00 9999999/9999999     main [1]
[2]     98.8    7.54    0.00 9999999         is_prime_number [2]
-----------------------------------------------
                                                 <spontaneous>
[3]      0.4    0.03    0.00                 frame_dummy [3]
-----------------------------------------------

n = 10^8: 时间：201.75

Flat profile:

Each sample counts as 0.01 seconds.
%   cumulative   self              self     total
time   seconds   seconds    calls us/call us/call name
101.03    201.75   201.75 99999999     2.02     2.02 is_prime_number
0.17    202.10     0.34                             main
0.07    202.23     0.13                             frame_dummy

Call graph (explanation follows)

granularity: each sample hit covers 2 byte(s) for 0.00% of 202.23 seconds

index % time    self children    called     name
                                                 <spontaneous>
[1]     99.9    0.34 201.75                 main [1]
              201.75    0.00 99999999/99999999     is_prime_number [2]
-----------------------------------------------
              201.75    0.00 99999999/99999999     main [1]
[2]     99.8 201.75    0.00 99999999         is_prime_number [2]
-----------------------------------------------
                                                 <spontaneous>
[3]      0.1    0.13    0.00                 frame_dummy [3]
-----------------------------------------------

版本二（试除法）：特点：采用开方，只考虑sqrt(n)以内的整数因子。（gcc -lm)

判断质数子函数时间复杂度：O(sqrt(n)) 总时间复杂度：O(n*sqrt(n))

  8 #include <stdio.h>
  9 #include <stdlib.h>
 10 #include <math.h>
 11 
 12 int is_prime_number(int count)
 13 {
 14     int i;
 15     int root = sqrt(count);
 16     for(i = 2; i <= root; i++) {
 17         if(0 == count % i)
 18             return 0;
 19     }
 20     return 1;
 21 }
 22 
 23 int main(void)
 24 {
 25     int i;
 26     int count = 100000;
 27     for(i = 2; i <= count; i++) {
 28         if(is_prime_number(i))
 29             printf("%d\n", i);
 30     }
 31     return EXIT_SUCCESS;
 32 }

gprof测试情况:

n = 10^5: 时间：0.01
n = 10^6:
时间：0.29
n = 10^7:
时间：7.41

n = 10^8:
时间：199.69

版本三（试除法）：特点：采用开方，对被2，3，5，7整除的数进行特殊检验。（gcc -lm)

判断质数子函数时间复杂度：O(sqrt(n)) 总时间复杂度：O(n*sqrt(n))

  8 #include <stdio.h>
  9 #include <stdlib.h>
 10 #include <math.h>
 11 
 12 int is_prime_number(int count)
 13 {
 14     if(0 == n % 2)
 15         return (2 == n);
 16     if(0 == n % 3)
 17         return (3 == n);
 18     if(0 == n % 5)
 19         return (5 == n);
 20     if(0 == n % 7)
 21         return (7 == n);
 22     int i;
 23     int root = sqrt(count);
 24     for(i = 11; i <= root; i += 2) {
 25         if(0 == count % i)
 26             return 0;
 27     }   
 28     return 1;
 29 }   
 30

 31 int main(void)
 32 {   
 33     int i;
 34     int count = 100000000;
 35     for(i = 2; i <= count; i++) {
 36         if(is_prime_number(i))
 37             printf("%d\n", i);
 38     }
 39     return EXIT_SUCCESS;
 40 }

n = 10^5:
时间：0.00

n = 10^6:
时间：0.12

n = 10^7:
时间：3.58

n = 10^8:
时间：98.36

总结：前三个版本逐步优化后，速度的确变快了，与《编程珠玑（续）》原文数据结论有出入的一处地方是：这里采用开方算法要比乘法算法要快。并且，这些优化措施，效率提升不大，它们的时间复杂度是相近的。（在计算总时间复杂度时，O(n*sqrt(n))这一个值待商榷，实际上它是一个n个i*sqrt(i)（1<=i<=n）求和）

版本四（埃拉托斯特尼篩法，簡稱埃氏篩法，Sieve of Eratosthenes，比传统试除法更佳）：

实现：給出要篩數值的範圍n，找出\sqrt{n}以內的質數p_{1},p_{2},\dots,p_{k}。先用2去篩，即把2留下，把2的倍數剔除掉；再用下一個質數，也就是3篩，把3留下，把3的倍數剔除掉；接下去用下一個質數5篩，把5留下，把5的倍數剔除掉；不斷重複下去......。
（维基百科）

思想：由于一个合数总是可以分解成若干个质数的乘积，那么如果把质数（最初只知道2是质数）的倍数都去掉，那么剩下的就是质数了。同时，在算法中，第一个循环内对于每一个未剔除的数p，由于小于大的数的倍数都已被剔除，因此这个数p必然是个质数。（这一点在维基百科中有提到：The main idea here is that every value for p is prime, because we have already
marked all the multiples of the numbers less than p.）

  8 #include <stdio.h>
  9 #include <stdlib.h>
 10 #include <math.h>
 11 
 12 #define TEST_COUNT      100000000
 13 #define TEST_COUNT_HALF 100000000u >> 1
 14 
 15 int prime_list[TEST_COUNT_HALF] = {0};
 16 ///< 0 means prime number, begin from 2, 3, 5, 7..., only include odd number
 17 
 18 int main(void)
 19 {
 20     int i, j;
 21     int sum = 0;
 22     int count = TEST_COUNT;
 23     int root = sqrt(count);
 24     for(i = 3; i <= root; i += 2) {
 25         if(!prime_list[(unsigned)(i - 1) >> 1]) {
 26             for(j = i * i; j <= count; j += i << 1) {
 27                 prime_list[(unsigned)(j - 1) >> 1] = 1;
 28             }
 29         }
 30     }
 31     if(count >= 2) {
 32         printf("%d\n", i);

 33         sum++;  
 34     }       
 35     for(i = 3; i <= count; i += 2) {
 36         if(!prime_list[(unsigned)(i - 1) >> 1]) {
 37             printf("%d\n", i);
 38             sum++;
 39         } 
 40     }   
 41     printf("sum of the prime number to %d is %d.\n", count, sum);
 42     return EXIT_SUCCESS;
 43 }

n = 10^5: 时间：0.00

n = 10^6: 时间：0.01

n = 10^7: 时间：0.22

n = 10^8: 时间：2.79

总结：认真思考，可以发现这个算法，结合质数定义，利用了质数和合数内部和之间的关系，同时开辟更多的空间（内存），给算法运算速度带来了很大的提高。算法必原来的相比效率提高了两个数量级

细节方面的优化措施：

1.剔除质数p时不必开始于2的倍数，最小可以设置为p*p，因为对于更小的倍数，这个相应的数已经更被小的质数所剔除。同样的思路，对于所有的遍历止于n的开方即可。（参考维基百科一段话：As a refinement, it is sufficient to mark the numbers in step 3 starting from p²,
as all the smaller multiples of p will have already been marked at that point.This means that the algorithm is allowed to terminate in step 4 when p2 is greater than n.）

2.设立剔除表时实际上只用考虑奇数（参考维基百科一段话：Another refinement is to initially list odd numbers only, (3, 5, ..., n), and count in increments of 2p in step 3, thus marking only odd multiples of p.）

（措施1和2优化使算法在原有基础上效率再次提升了一倍）

3.空间复杂度优化，标志位可以设计成一个bit大小节约空间。

【上篇】《编程之美》学习笔记——2.4 1的数目
【下篇】从GCC编译器来学习各类C标准及编译器使用

作者: interlock

该日志由 interlock 于6年前发表在综合分类下，最后更新于 2018年03月30日.
转载请注明: 《编程珠玑（续）》学习笔记——第一章性能监视工具（1） | 学步园 +复制链接

抱歉!评论已关闭.

学步园