cs152 lab1

3.4

Note how the mix of different types of instructions vary between benchmarks. Record the mix
for each benchmark. (Remember: Do not provide raw dumps. A good way to visualize this kind
of data would be a bar graph.) Which benchmark has the highest arithmetic intensity? Which
benchmark seems most likely to be memory bound? Which benchmark seems most likely to be
dependent on branch predictor performance?

multiply这个benchmark的计算强度最高;
因为在roofline模型中,越低的计算强度越有可能会在脊点的左侧,即是memory bound。因此,先计算一下(计算指令/存储加载指令),median这个benchmark为31.845/32.147=0.9906; towers这个benchmark为41.702÷42.197=0.9882. 所以,结论为towers
median这个benchmark最有可能依赖branch predictor,因为这个benchmark中的branch/jump指令占比最高,最容易受branch predictor影响

image

image

3.5

Consider the results gathered from the RV32 1-stage processor. Suppose you were to design a
new machine such that the average CPI of loads and stores is 2 cycles, integer arithmetic
instructions take 1 cycle, and other instructions take 1.5 cycles on average. What is the overall
CPI of the machine for each benchmark?

dhrystone这个benchmark在RV32 1stage processor的结果:

Stats:

CPI          : 1.000
IPC          : 1.000
Cycles       : 245738
Instructions : 245739
Bubbles      : 0

Instruction Breakdown:
% Arithmetic  : 40.379 %
% Ld/St       : 35.324 %
% Branch/Jump : 23.757 %
% Misc.       : 0.541 %

这里INST=245739, arithmetic占40.379%,约为99227条arithmetic指令; 同理,ld/st约为86804条指令,branch/jump约为58380条指令, misc约为1329条指令

在新的CPI条件下,大约\(cycles=99227*1 + 86804*2+ 58380*1.5 + 1329*1.5=362398\),这样CPI=362398/245739=1.475

What is the relative performance for each benchmark if loads/stores are sped up to have an
average CPI of 1? Is this still a worthwhile modification if it means that the cycle time increases
30%? Is it worthwhile for all benchmarks or only a subset? Explain.

那么时钟周期的计算公式改为:
\(cycles=99227*1 + 86804*1+ 58380*1.5 + 1329*1.5=275594\),这样CPI=275594/245739=1.121

如果时钟周期的时间增加30%,可以计算一下275594*1.3 - 362398 = −4125.8,因此可以得知虽然减少了时间周期数,但增加了时钟时间,经过转换计算,总体的执行时间应该是可以降低的,因此是值得在增加30%时钟时间条件下,加速ld/st指令的CPI到1的。

至于其他的benchmark,也可以采取相同的计算方法来回答3.5

3.6

SETTING1

  • full bypass
  • 5stage

image
image
image

SETTING2

  • interlock
  • 5stage
    image

How does full bypassing perform
compared to full interlocking? If adding full bypassing would hurt the cycle time of the processor
by 25%, would it be worth it? Argue your case quantitatively.

因为full interlock和full bypass都是在相同的处理器中进行测试,因此时钟周期的时间是相等的,那么直接比较cycles:
全互锁的Cycles数值:

Dhrystone: 481129
Median: 30490
Multiply: 94867
Qsort: 456100
Rsort: 869612
Towers: 30930
Vvadd: 21906
全旁路的Cycles数值:

Dhrystone: 321955
Median: 24274
Multiply: 78291
Qsort: 335258
Rsort: 405558
Towers: 23604
Vvadd: 16578
现在,我们比较这两组数据来看哪一组的Cycles数值更低:

Dhrystone:全旁路较低。
Median:全旁路较低。
Multiply:全旁路较低。
Qsort:全旁路较低。
Rsort:全旁路较低。
Towers:全旁路较低。
Vvadd:全旁路较低。

对于其他指标:
CPI(Clock Cycles per Instruction):全旁路的测试中,CPI普遍较低,这表示每条指令需要更少的时钟周期来完成。这是因为全旁路减少了由于数据冒险导致的暂停。

IPC(Instructions per Cycle):全旁路的IPC值普遍较高,表明每个时钟周期内完成的指令数更多。这是高效流水线操作的直接结果,减少了因数据依赖导致的等待。

Cycles(总时钟周期数):尽管总指令数相似,但全旁路在大多数测试中展示了更少的总时钟周期数,这表明整体上执行更快。

Bubbles(气泡数):全旁路技术的测试中,气泡数通常较低,这表明因数据冒险导致的暂停更少。
因此得出结论:full bypass性能更好

如果full bypass使得cycle time增加25%,那么可以再次计算一下等效耗时

  1. dhrystone:
    321955 * 1.25 - 481129 = -78,685.25
  2. median:
    24274 * 1.25 - 30490 = -147.5
  3. multiply:
    78291 * 1.25 - 94867 = 2,996.75
  4. qsort:
    335258 * 1.25 - 456100 = -37,027.5
  5. rsort:
    405558 *1.25 - 869612 = -362,664.5
  6. towers:
    23604 * 1.25 - 30930 = -1,425
  7. vvadd:
    16578 * 1.25 - 21906 = -1,183.5

因此,结论是值得的

3.7

image
增加指令数量的比较为:

What percentages of the instruction mix do the various types of load and store instructions make
up? Evaluate the new design in terms of the percentage increase in the number of instructions
that will have to be executed. Which design would you advise your employer to adopt? Justify
your position quantitatively.

图中,黄色Ld/St指令是各类load store指令的占比;
根据所提出的设计,这些non zero offset的LD/ST指令会被分成两个指令。
原始设计是5stage,新的设计为4stage,指令数量增加了,每个stage的cycle并没有减少(因为仍然要与最慢的mem这个stage的cycle time对齐。
由于是流水线处理器设计,所以每一个cycle完成一条指令,第一条指令的latency由5减少为4,但是后续指令仍然是每一个cycle完成一条。
指令数量的增加,明显增加了总周期数量,因此新的设计看起来并没有比5stage这种旧的设计好。

但是,如果可以增加一个解码器,那么在一个时钟周期内可以发射两条指令,一条指令是alu(比如计算non zero offset),不涉及mem的,另一条是只设计mem(已经把non zero offset计算好了,可以直接读mem),而不涉及alu的。这样,在理想情况下可以抵消掉non zero offset所增加的指令数量,带来的总周期数增加。
image

posted @ 2023-12-10 23:37  ijpq  阅读(26)  评论(0编辑  收藏  举报