第二次结对编程

Github 链接：https://github.com/eastOffice/WordFrequency

合作方式

我们这一次的合作方式是 pair coding 和 separate coding 相结合。我们先一起讨论这个项目整体设计方案，功能怎么分配，然后建立 github 仓库，两人各自开发自己的功能。遇到问题和两个人都要使用的功能，会采取 pair coding 的方式来解决。

关于讨论

Design guideline：这次的项目设计其实比较简单，按照功能分开即可，我们用了很短的时间就搭建了代码的整体框架，并且后面没有改动。
Coding convention：编程规范在第一次讨论的时候确定，并且一起写了一些样板函数的接口，之后的函数都参照样板函数来写。
Reach agreement: 这个也很简单，谁的方案更好，时间更快，就用谁的想法。

Dealing with Time Constraint

我们一开始就给自己估计了项目的时间，而且给了一个非常短的预算：10月28号开始写，并且争取一天基本写完。事实上，我们完成的效果还是符合预期的，可以从 github 仓库 commit 的时间点中看到，10 月 28 号一天我们基本上完成了框架中的内容，之后两天都在进行测试，debug，优化。

时间的确是一个比较大的限制。我们只有周末的时候比较有空来写这个项目，平时都是晚上抽两个小时出来。不过这个项目本身也不难，我们还是很快实现了全部的内容并且成功的测试，优化。

队友

优点：

沟通积极
精益求精
编程速度快

缺点：

可以提高编程规范

单元测试和回归测试

单元测试：我们的代码结构中，每一个功能都由 modes.py 中的一个模式函数负责，每一个模式函数由逻辑代码和 utils.py 里面的支持函数组成。因此，单元测试非常简单，我们首先从 utils.py 里面的支持函数开始测试，然后测试模式函数。

回归测试：每次新加入一个函数，都会新加入测试代码，运行所有的测试代码，保证各种参数的组合都正确运行（事实上这一点也很简单，只要控制模式函数的入口参数就行）。

因此，每一阶段的测试代码都是在同一个测试文件中。为了方便阅读和进行覆盖率测试，我们把最后的测试代码做了压缩和整理，去掉了最基础的 utils.py 里面的支持函数的测试，整合到了 coverage_test.py 中。

覆盖率测试

在完成所有功能之后，我们在测试文件的基础上整合出 coverage_test.py，利用 python 的 coverage 包：

coverage run coverage_test.py
coverage report

得到结果：

Name               Stmts   Miss  Cover
--------------------------------------
coverage_test.py      36      0   100%
modes.py              94      0   100%
utils.py              68      0   100%
--------------------------------------
TOTAL                198      0   100%

结果记录在 coverage_test.txt 中。

效能分析和优化

我们使用了 python 的 cProfile 来进行效能分析，这里打印了每个函数 internal time 的最高的前十个：

优化前：

Tue Oct 30 20:14:19 2018    profile.stats

         697390 function calls (690360 primitive calls) in 0.650 seconds

   Ordered by: internal time
   List reduced from 2079 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    22391    0.141    0.000    0.141    0.000 C:\Users\v-yizzha\Desktop\WordFrequency\modes.py:102(<listcomp>)
     1375    0.061    0.000    0.061    0.000 {built-in method nt.stat}
    22391    0.060    0.000    0.074    0.000 C:\Users\v-yizzha\Desktop\WordFrequency\utils.py:14(get_phrases)
        1    0.045    0.045    0.382    0.382 C:\Users\v-yizzha\Desktop\WordFrequency\modes.py:83(mode_p)
    27395    0.039    0.000    0.039    0.000 {method 'split' of 're.Pattern' objects}
      306    0.023    0.000    0.023    0.000 {built-in method marshal.loads}
    12/11    0.020    0.002    0.023    0.002 {built-in method _imp.create_dynamic}
      306    0.017    0.000    0.027    0.000 <frozen importlib._bootstrap_external>:914(get_data)
    27798    0.011    0.000    0.062    0.000 C:\Users\v-yizzha\AppData\Local\Continuum\anaconda3\envs\nltk\lib\re.py:271(_compile)
1067/1064    0.010    0.000    0.039    0.000 {built-in method builtins.__build_class__}

发现用时最长的是 modes.py 102行的 list 比较操作：

pre_list = [word for word in pre_list if word not in stop_words]

于是我把 stop_words 这个 list 变成 set，优化之后：

Tue Oct 30 20:23:31 2018    profile.stats

         697516 function calls (690485 primitive calls) in 0.510 seconds

   Ordered by: internal time
   List reduced from 2094 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1379    0.060    0.000    0.060    0.000 {built-in method nt.stat}
    22391    0.058    0.000    0.072    0.000 C:\Users\v-yizzha\Desktop\WordFrequency\utils.py:14(get_phrases)
        1    0.040    0.040    0.234    0.234 C:\Users\v-yizzha\Desktop\WordFrequency\modes.py:83(mode_p)
    27395    0.037    0.000    0.037    0.000 {method 'split' of 're.Pattern' objects}
      304    0.023    0.000    0.023    0.000 {built-in method marshal.loads}
    12/11    0.018    0.002    0.020    0.002 {built-in method _imp.create_dynamic}
      308    0.018    0.000    0.028    0.000 <frozen importlib._bootstrap_external>:914(get_data)
    22391    0.011    0.000    0.011    0.000 C:\Users\v-yizzha\Desktop\WordFrequency\modes.py:102(<listcomp>)
1067/1064    0.010    0.000    0.039    0.000 {built-in method builtins.__build_class__}
    27798    0.010    0.000    0.058    0.000 C:\Users\v-yizzha\AppData\Local\Continuum\anaconda3\envs\nltk\lib\re.py:271(_compile)

可以看到 list comp 的时间从 0.141 秒变成了0.011秒。

下面是队友的一次优化成果，优化前：

Thu Nov  1 18:20:35 2018    proflie.status

         1714748 function calls (1701302 primitive calls) in 1.118 seconds

   Ordered by: internal time
   List reduced from 3945 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    22391    0.179    0.000    0.238    0.000 C:\Users\v-qiyao\Documents\WordFrequency\utils.py:14(get_phrases)
     3163    0.111    0.000    0.111    0.000 {built-in method nt.stat}
   100/78    0.059    0.001    0.085    0.001 {built-in method _imp.create_dynamic}
      741    0.052    0.000    0.052    0.000 {built-in method marshal.loads}
        1    0.041    0.041    0.455    0.455 C:\Users\v-qiyao\Documents\WordFrequency\modes.py:83(mode_p)
    27395    0.040    0.000    0.040    0.000 {method 'split' of '_sre.SRE_Pattern' objects}
   105354    0.035    0.000    0.035    0.000 C:\Users\v-qiyao\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\probability.py:127(__setitem__)
      743    0.035    0.000    0.054    0.000 <frozen importlib._bootstrap_external>:830(get_data)
    992/1    0.032    0.000    1.119    1.119 {built-in method builtins.exec}
        1    0.030    0.030    0.065    0.065 {built-in method _collections._count_eleme

时间最长的是 get_phrases 函数，用时 0.179 秒，看到原来的 get_phrases 函数是这样的：

while(len(pre_list) >= n):
        target_phrase = []
        for i in range(n):
            if not_word(pre_list[i]):
                for j in range(i+1):
                    pre_list.pop(0)
                break
            else:
                target_phrase.append(pre_list[i])
        if len(target_phrase) == n :
            target_str = target_phrase[0]
            for i in range(n-1):
                target_str += " "+target_phrase[i+1] 
            result.append(target_str)
            pre_list.pop(0)
    return result

其中的 pop 操作都是不必要的，优化后：

for j in range(len(pre_list)+1-n):
        target_phrase = ""
        for i in range(n):
            if not_word(pre_list[i+j]):
                j += i
                break
            elif target_phrase == "":
                target_phrase += pre_list[i+j]
            else :
                target_phrase += (' ' + pre_list[i+j])
            if i == n-1:
                result.append(target_phrase)

优化后的 profile：

Thu Nov  1 18:22:38 2018    proflie.status

         1187845 function calls (1174399 primitive calls) in 0.972 seconds

   Ordered by: internal time
   List reduced from 3945 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3163    0.109    0.000    0.109    0.000 {built-in method nt.stat}
    22391    0.095    0.000    0.118    0.000 C:\Users\v-qiyao\Documents\WordFrequency\utils.py:14(get_phrases)
   100/78    0.055    0.001    0.081    0.001 {built-in method _imp.create_dynamic}
      741    0.052    0.000    0.052    0.000 {built-in method marshal.loads}
        1    0.040    0.040    0.336    0.336 C:\Users\v-qiyao\Documents\WordFrequency\modes.py:83(mode_p)
    27395    0.039    0.000    0.039    0.000 {method 'split' of '_sre.SRE_Pattern' objects}
   105544    0.036    0.000    0.036    0.000 C:\Users\v-qiyao\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\probability.py:127(__setitem__)
      743    0.034    0.000    0.053    0.000 <frozen importlib._bootstrap_external>:830(get_data)
        1    0.033    0.033    0.068    0.068 {built-in method _collections._count_elements}
    992/1    0.030    0.000    0.973    0.973 {built-in method builtins.exec}

可以看到 get_phrases 函数用时变成了0.095秒。

这是使用了 cProfile 之后的两次优化，效果都十分好。到此为止，用时最长的函数都已经是一些内建函数了。我们在第一遍写代码的时候就比较注重代码的整洁和效率，曾因为改进了算法以及去掉不必要的文件读写，把一个20秒的命令优化到了1秒以内。

posted @ 2018-11-03 13:12 yizhuoz 阅读(227) 评论(1) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

yizhuoz

第二次结对编程

第二次结对编程

合作方式

关于讨论

Dealing with Time Constraint

队友

单元测试和回归测试

覆盖率测试

效能分析和优化

公告