Pipeline Structure Modeling with M-Sim

Pipeline Structure Modeling with M-Sim

In this lab we shall experiment with some of the pipeline structure models M-sim makes available for superscalar microprocessors. Recall that one of M-Sim’s primary extensions to simplescalar is explicit, cycle-accurate modeling of key pipeline structures. In particular, M-Sim models the Reorder buffer (ROB), the Issue Queue (IQ), the Load/Store Queue (LSQ), and separate integer and floating point register files. The ROB ensures in order commitment of program instructions, despite out-of-order execution. It is modeled as a FIFO buffer, as instructions are allocated at the tail and freed at the head when they are ultimately committed. The IQ controls entry into the pipeline. It is modeled as an array of entries, each with two states (free/allocated). The LSQ is used to handle memory disambiguation which protects against RAW, WAW, and WAR hazards. Integer and floating point register files contain physical register arrays, with stateful entries (free, allocated, allocated and written back, and architectural).

With regard to the structures of interest, the SMT process consists of issue queue selection, followed by register file access, the start of execution, entry into a per-thread load/store queue (for loads and stores), write back to register files after execution, and finally commitment from a per-thread ROB.

Part 1: Pipeline Structure Modeling: Issue BW

In this part, we shall see how M-Sim handles competition among threads for bandwidth in the issue queue. By experimenting with the size of the issue BW, we will see how the Throughput IPC is impacted in the multi-program SMT experiment from the previous lab (perl and go). The default issue BW is 4 IPC.

~/msim/msim_v2.0/$ ./sim-outorder –issue:width 4 perl-your_name.arg go-your_name.arg

Now fill out the following table:

Issue queue (BW)	2-threaded SMT Throughput IPC
1
2
4
8
16

Table 1: Multi-program SMT

Now answer the following questions:

1) Why do you think the Throughput IPC flattens out after an issue BW of 4 inst/cycle? What may be an alternate bottleneck for IPC above this point?

2) Does this indicate a lack of contention for Bandwidth above 4 inst/cycle?

Part 2: Pipeline Structure Modeling: ROB Size

In this part we’ll consider the effect of ROB size on Throughput IPC for each of our 5 benchmarks. The default value of 128 will be varied.

Benchmark	ROB Size: 4	ROB Size: 16	ROB Size: 64	ROB Size: 256	ROB Size: 1024
anagram. Alpha
go.alpha
compress95.alpha
cc1.alpha
perl.alpha

Table 2: Throughput IPC

Now answer the following questions:

1) Why do you think the influence of ROB size on Throughput IPC increases dramatically towards the default value, then has little effect? Does this indicate a steady state buffer content size of narrow range?

2) For which benchmark is the steady state ROB content size probably the smallest? For which is it the largest?

Part 3: Pipeline Structure Modeling: LSQ Size

In this part we’ll consider the effect of LSQ size on Throughput IPC for each of our 5 benchmarks. The default value of 48 will be varied.

Benchmark	LSQ Size: 3	LSQ Size: 12	LSQ Size: 48	LSQ Size: 192	LSQ Size: 768
anagram. Alpha
go.alpha
compress95.alpha
cc1.alpha
perl.alpha

Table 2: Throughput IPC

Now answer the following questions:

1) Why do you think the influence of LSQ size on Throughput IPC increases dramatically towards the default value, then has little effect? Does this indicate a steady state buffer content size of narrow range?

2) Go contains about twice the percentage of loads/stores as cc1. Can you come up with a plausible explanation for why cc1’s steady state LSQ content size could still be slightly greater?