Multi-threaded
Simulation with M-Sim
In this lab we
shall experiment with M-sim’s multithreading capability. Recall that M-Sim is capable of executing
multiple threads concurrently, according to the SMT model. Simultaneous multithreading (SMT) is an increasingly
popular technique for increasing the throughput of superscalar processors. While superscalar processors execute multiple instructions concurrently,
implementing multithreading in addition allows the processor to execute multiple instructions from multiple threads concurrently. This practice is defined as SMT. It
allows sharing of several of the key datapath components between multiple
threads. For example, a shared issue
queue creates the possibility of more efficient execution, since a new instruction
can enter the queue if at least one
thread is capable of issuing an instruction, rather than relying on a single
thread for issue. This can be useful in
chat applications or other instances where multiple, asynchronous
communications are occurring simultaneously.
Many modern microprocessors implement SMT, including Intel’s Pentium 4,
the latest MIPS architecture, and IBM’s Power 5. See [1] for more information.
Part 1:
Single Threaded Execution vs. Multi-threaded execution: Single Instance vs.
Multiple Instances
In this section,
we will perform single threaded, 2-threaded SMT, and 4-threaded SMT for each of
the 5 benchmarks from the previous exercise. That is, we will run one instance of the
benchmark, then two threads of the same benchmark, then 4 threads of the same benchmark.As an example sequence, run the following sets of
commands for the anagram benchmark, recording the Throughput IPC (the very last
statistic in the output – We won’t bother saving to a ‘results’ file) in Table
1.
~/msim/msim_v2.0/$ ./sim-outorder
anagram-your_name.arg
~/msim/msim_v2.0/$ ./sim-outorder
anagram-your_name.arg anagram-your_name.arg
IMPORTANT
NOTE: You may run out of architectural
registers when trying to run multithreaded simulations. M-sim sets the total number of registers using "-rf:size num_phy_regs" option. The num_phy_regs value
includes the number of architectural
registers (num_arch_regs) and the number of
renaming registers (num_rename_regs). The value of num_rename_regs
must be greater than 0 in order to do register renaming. That is to say, num_rename_regs = num_phy_regs - num_arch_regs*thread_num must be greater
than 0. While running 4 threads under default values, the usable renaming
registers is calculated as num_rename_regs (128 - 32*4=0)
since the default num_phy_regs is only 128.
For the benchmarks
we’ve been using, 512 will be enough in every case, though you should perform
the calculation above to verify this yourself. As an example, for the
4-threaded anagram example, execute:
~/msim/msim_v2.0/$ ./sim-outorder
–rf:size 512 anagram-your_name.arg anagram-your_name.arg anagram-your_name.arg
anagram-your_name.arg
Now fill out the
following table:
Benchmark |
Single
Thread |
2-threaded SMT |
4-threaded
SMT |
anagram. alpha |
|
|
|
go.alpha |
|
|
|
compress95.alpha |
|
|
|
cc1.alpha |
|
|
|
perl.alpha |
|
|
|
Table 1: Multi-instance SMT
Now answer the
following questions:
1)
From
the information in Table 1, what is the general trend in Throughput IPC as
multi-instance SMT is performed? Why do you think this might be?
2)
Can
you imagine a program for which 4-threaded SMT might outperform 2-threaded SMT?
Do you think these types of programs are the majority?
3)
Given
your observations, why do you think the Intel Pentium 4 and IBM POWER 5
processors use 2-threaded SMT?
Part 2:
Single Threaded Execution vs. Multi-threaded execution: Concurrent Programs of
Different Type
In this section,
we’ll perform an SMT simulation using two different programs. There can be compatibility issues between
programs that may make some combinations of programs crash, and using the
benchmarks we’ve been working with, perl and go (in that order) are two that
have been demonstrated to run together.
Execute the
following command:
~/msim/msim_v2.0/$ ./sim-outorder
perl-your_name.arg go-your_name.arg
Now fill out Table
2:
Benchmark |
2-threaded SMT |
perl.alpha & go.alpha |
|
go.alpha & go.alpha |
|
perl.alpha & perl.alpha
|
|
Table 2: Multi-program SMT
Now answer the
following question:
1) What does the
information in Table 2 demonstrate about resource contention among threads?
References:
1) http://en.wikipedia.org/wiki/Simultaneous_multithreading