Cache Performance Analysis Tool


A final project for: ECE 551, Computer Systems Manufacturing Lab

Dept. of ECE, UMASS Amherst

Background

Determining the most efficient cache for a given CPU and workload can be very difficult. There are currently two commonly used techniques: the analytical cache model and trace simulation. Trace simulation involves recording gigabyte traces, stripping them, and then running long simulations on the stripped traces. The analytical cache model is fast and convenient but under some situations grossly inaccurate.

The objective of this project is to design a programmable cache analyzer (PCA) capable of measuring cache performance in real time.

Project Description

The PCA is programmed with a cache directory of a particular organization. The PCA monitors the address lines of the CPU and uses the cache directory to compute hits and misses. By programming the PCA with different cache directory organizations we are able to determine the performance of caches with varying line size, associativity, and cache size.

Design Process

To simulate a particular cache organization we only need to store the index of a cache. Recall that

K = associativity = number of lines per set

L = line size = number of words in a line

S = number of sets in the cache

The address is broken into three parts, , the tag bits, the set bits, and the word within a line bits. The size of these fields are given by

Sizew/l = log2( L )

Sizeset = log2( S )

Sizetag = address size - set size - w/l size

The PCA stores tags in its directory. The directory consists of S sets each with K tags. The tags have Sizetag bits plus a valid bit. When a memory access occurs we must take the following actions:

- test for a miss or a hit

- update the appropriate counter

- in the case of a miss place the tag at a random location in the active set

To identify a hit or a miss we search through the set specified by the address looking for a tag that matches the address tag. To implement this functionality we first multiplex the sets into a single set selected by the set bits of the address. The tag stored in each line of this set is compared with the tag of the address. The results of these compares are anded with their valid bits. The results of the compares and valid checks are ored together to determine if a miss or a hit occurred.

Updating the appropriate counter is easy once a miss or hit is computed. When an access occurs the miss counter is enabled on a miss and the hit counter is enabled on a hit.

On a miss we must place the address tag at some random line within the set. To accomplish this the inputs of the line registers are connected to the address tag lines. The read enable of these registers is enabled if a miss occurred, the set is the active set, and the least significant bits in the miss count are equal to the line number within the set. Because the miss count and access to a particular set are independent variables using the lower bits of the miss count to choose the line to replace is random.

After determining the miss and hit count the CPU needs to access these values so they may be displayed them on the screen. To do this we memory mapped the PCA to the same addresses as the PIT, which was specified in project 1 but never used. This allowed us to access the PCA without modifying our original PLD. Similarly, DTACK from the PCA uses the DTACK originally intended for the PIT. The miss and hit registers are mapped to four consecutive words starting at the base address of the PIT. When the CPU accesses these locations the PCA recognizes the access, bypasses the hit and miss detection, and asserts the appropriate counter values on the data bus. Once logic for handling miss counter and hit counter reads was complete, modifying project 4 code to access and display these values was trivial.

We used Altera's FLEX 8282 PLD to implement this project.

Design entry was accomplished using Max+Plus II.

Manufacturability

The PCA tool described above would most likely not be a consumer oriented product; thus, the designers of such a system would not face many manufacturing issues, such as time to market and testability, that concern most of the producers of stand alone commercial products.

There are a number of reasons the above described system would not be manufactured for commercial uses. First, there is not much of a market out there for cache performance analyzers. Not many people or organizations have use for such a device. Also, for the system to work, it must be properly interfaced to the bus protocol used on the system to be tested. Thus, it would be difficult to create a modular design that would work well for testing any system.

The most likely group of people who would design, implement and eventually use the cache analysis system would be the group of memory system designers working on optimizing their memory system design. This group would need to know the cache configuration best suited for their overall memory system (which could consist of multiple layers of cache and a main memory unit).

Costs for such a system would be minimal. All that is required to implement the design is a programmable device powerful (i.e. large) enough to contain the logic and memory necessary to implement a cache directory for the maximum cache size to be tested. Development time would be minimal as well. The design entry (in our case, AHDL) would be done such that only minor changes would be necessary to test different cache models, which would certainly speed up testing.

Acknowledgements

We would like to thank Keith Shimeld and Ed O'Donnell for helping us gather the necessary equipment and documentation to successfully implement this project. They have been a big help to us and the rest of the class all semester, and for that we would like to thank them.


jstevens@bnlux1.bnl.gov

umanoff@kira.ecs.umass.edu

burleson@ecs.umass.edu

(Last Update: 12/22/95)