Pipeline
Structure Modeling with M-Sim
In this lab we
shall experiment with some of the pipeline structure models M-sim makes
available for superscalar microprocessors.
Recall that one of M-Sim’s primary extensions to simplescalar is explicit,
cycle-accurate modeling of key pipeline structures. In particular, M-Sim models the Reorder
buffer (ROB), the Issue Queue (IQ), the Load/Store Queue (LSQ), and separate
integer and floating point register files.
The ROB ensures in order commitment of program instructions, despite
out-of-order execution. It is
modeled as a FIFO buffer, as instructions are allocated at the tail and freed
at the head when they are ultimately committed. The IQ controls entry into the
pipeline. It is modeled as an array
of entries, each with two states (free/allocated). The LSQ is used to handle memory
disambiguation which protects against RAW, WAW, and WAR hazards. Integer and floating point register
files contain physical register arrays, with stateful entries (free, allocated,
allocated and written back, and architectural).
With regard to the
structures of interest, the SMT process consists of issue queue selection,
followed by register file access, the start of execution, entry into a
per-thread load/store queue (for loads and stores), write back to register
files after execution, and finally commitment from a per-thread ROB.
Part 1:
Pipeline Structure Modeling: Issue BW
In this part, we
shall see how M-Sim handles competition among threads for bandwidth in the
issue queue. By experimenting with
the size of the issue BW, we will see how the Throughput IPC is impacted in the
multi-program SMT experiment from the previous lab (perl and go). The default issue BW is 4 IPC.
~/msim/msim_v2.0/$ ./sim-outorder
–issue:width 4 perl-your_name.arg go-your_name.arg
Now fill out the
following table:
Issue
queue (BW) |
2-threaded SMT Throughput IPC |
1 |
|
2 |
|
4 |
|
8 |
|
16 |
|
Table 1: Multi-program SMT
Now answer the
following questions:
1)
Why do
you think the Throughput IPC flattens out after an issue BW of 4 inst/cycle?
What may be an alternate bottleneck for IPC above this point?
2)
Does
this indicate a lack of contention for Bandwidth above 4 inst/cycle?
Part 2: Pipeline Structure Modeling:
ROB Size
In this part
we’ll consider the effect of ROB size on Throughput IPC for each of our 5
benchmarks. The default value of
128 will be varied.
Benchmark |
ROB
Size: 4 |
ROB
Size: 16 |
ROB
Size:
64 |
ROB
Size:
256 |
ROB
Size:
1024 |
anagram. Alpha |
|
|
|
|
|
go.alpha |
|
|
|
|
|
compress95.alpha |
|
|
|
|
|
cc1.alpha |
|
|
|
|
|
perl.alpha |
|
|
|
|
|
Table 2: Throughput IPC
Now answer the
following questions:
1)
Why do
you think the influence of ROB size on Throughput IPC increases dramatically
towards the default value, then has little effect? Does this indicate a steady
state buffer content size of narrow range?
2)
For
which benchmark is the steady state ROB content size probably the smallest? For
which is it the largest?
Part 3:
Pipeline Structure Modeling: LSQ Size
In this part
we’ll consider the effect of LSQ size on Throughput IPC for each of our 5
benchmarks. The default value of 48
will be varied.
Benchmark |
LSQ
Size: 3 |
LSQ Size: 12 |
LSQ
Size:
48 |
LSQ
Size:
192 |
LSQ
Size:
768 |
anagram. Alpha |
|
|
|
|
|
go.alpha |
|
|
|
|
|
compress95.alpha |
|
|
|
|
|
cc1.alpha |
|
|
|
|
|
perl.alpha |
|
|
|
|
|
Table 2: Throughput IPC
Now answer the
following questions:
1)
Why do
you think the influence of LSQ size on Throughput IPC increases dramatically
towards the default value, then has little effect? Does this indicate a steady
state buffer content size of narrow range?
2)
Go
contains about twice the percentage of loads/stores as cc1. Can you come up with a plausible
explanation for why cc1’s steady state LSQ content size could still be
slightly greater?