Packetbench v 1.0.0 README ========================== Packetbench is a programming framework for workload characterization of network processing. More details can be found in: Ramaswamy, R., Wolf, T.: "PacketBench: A Tool for Workload Characterization of Network Processing", in Proc. of 6th IEEE Annual Workshop on Workload Characterization (WWC-6), pp. 42-50, Austin, TX, October 2003 http://www.ecs.umass.edu/ece/wolf/papers/wwc2003.pdf It is not a tool in the traditional sense, but provides a programming environment which can be used to measure detailed statistics on the processing characteristics of networking applications run on a typical router/network processor. Packetbench is run on a processor simulator to obtain measurements. We use the SimpleScalar tools as our processor simulator, more specifically, the ARM version of SimpleScalar (simplesim-arm-0.2). Using PacketBench, you can get measurements such as: 1) Instructions executed for processing a packet for a given application. 2) Number of accesses to packet memory and program state while processing a packet for an application. 3) Computation of instruction core and memory core sizes for an application. 4) Generation of instruction and memory access patterns. 5) Generation of a dynamic instruction trace of the code executed while processing a packet. Authors/Contact: ================ Ramaswamy Ramaswamy (rramaswa [at] ecs [dot] umass [dot] edu) Tilman Wolf (wolf [at] ecs [dot] umass [dot] edu) Requirements: ============= The Packetbench/SimpleScalar combination has been tested on the following system: 1) x86 processor running RedHat Linux 9.0 (gcc version 3.2.2 20030222 (Red Hat Linux 3.2.2-5)) Additionally, you will need the following in order to use PacketBench: 1. SimpleScalar ARM (simplesim-arm-0.2) from http://www.simplescalar.com/v4test.html 2. A cross compiler for the ARM architecture. This can be found at the same web page above. (gcc-2.95.2, binutils-2.10, and glibc-2.1.3) 3. libpcap libraries *IF* you require the ability to process TCPDUMP trace files. 4. libpcap compiled for the ARM architecture (using the tools in (2) above) *IF* you require the ability to read/write from TCPDUMP trace files. Note: PacketBench by default includes support for reading from/writing to TCPDUMP trace files. As such, the code will require libpcap (in (3) and (4) above) to be present for PacketBench to compile. The source code will need to be modified if TCPDUMP support is not required. Directory structure: ==================== flow_class --> Flow classification application arm x86 ipsec --> IPSec application arm x86 ipv4-lctrie --> IPv4 forwarding application with LC-trie table lookup arm x86 ipv4-radix --> IPv4 forwarding application with radix tree lookup arm x86 simple --> Simple "barebones" application arm x86 sscalar --> SimpleScalar specific files traces --> Sample trace file in TSH format 4 sample applications are provided. Each application directory has two subdirectories - "x86", which contains the version of the application specific to the x86 system which is used for application development and debugging and "arm" which contains the version of the application which has been cross compiled for the ARM architecture. This version of the code will be run on SimpleScalar to obtain statistics. Note that the code for both versions of the applications are identical. Separate copies are made for the sake of convenience In addition to the 4 sample applications, a fifth "barebones" application is provided in the directory called "simple" to provide a better understanding of how PacketBench works. This application extracts the IP header from the packet and prints out some header information. Installation: ============= 1) Untar archive to the directory of your choice 2) To incorporate the statistics collection features of PacketBench into SimpleScalar a) Make a backup copy of your original sim-profile.c in your SimpleScalar installation (usually, simplesim-arm/sim-profile.c) b) Copy the file sim-profile.c from the "sscalar" subdirectory into the SimpleScalar installation directory overwriting the original sim-profile.c c) Rebuild sim-profile (make sim-profile) Known Issues/Limitations: ========================= 1) PacketBench by default includes support to read/write TCPDUMP files. To compile/run out of the box you will need libpcap installed on your system and cross-compiled for the ARM. The source code will need to be modified if TCPDUMP support is not required. 2) Writing back processed packets to TSH trace files has not been implemented at this time. 3) The top (head) of the input packet queue is always used to store the next packet to be processed. PacketBench was designed with the ability for the packet processing application to store back (or "re-enqueue") a processed packet into the input packet queue for future processing. This code path has not yet been extensively tested. 4) In the "ipv4-radix" application, the x86 version will cause a segmentation fault if compiled with an optimization level higher than 0 (-O0). This is not an issue of great concern since the x86 versions are mainly used for code development and not measurement. PacketBench basics: =================== PacketBench consists of two main parts. The PacketBench framework and the PacketBench application. The PacketBench framework is responsible for packet input/output from/to trace files and is contained in the bench.c and bench.h files. The framework provides a pointer to the packet which the PacketBench application can use. The PacketBench application (which the user is supposed to provide) contains the code which will be used to process a packet. The application processing function is hooked into the PacketBench framework by means of a function pointer. PacketBench reads in packet data from an input tracefile and writes processed packets back to an output tracefile. Currently, two trace formats are supported. The TCPDUMP format (libpcap is required) and the TSH format from NLANR (http://nlanr.net). At present, writing packets back to TSH files has not been implemented. The data structure used to store the packets is a queue of length 64 (#define'd in bench.h), each of size PACKET_SIZE bytes (#define'd in bench.h). The data structure (called "packet") is defined in bench.h. The current packet to be processed is always stored at the top of the queue and a pointer to this location is passed to the application. This location is overwritten with the next packet to be processed when application requires that the current packet being processed exit the system. Alternatively, the application can store the packet back in the queue by returning a different value. Constants and Defines: ====================== In bench.h ---------- 1. GLOBAL_PACKET_MEMORY Controls whether memory allocated for the packet queue described above is allocated locally or globally. This MUST be defined if any type of memory analysis needs to be performed. (explained later) In Makefile ----------- 1. VERBOSE If defined, this switch prints diagnostic messages which may aid in debugging. This switch must be DISABLED during measurement so that instructions/memory accesses due to printf statements are not counted. 2. BENCHMARK_FUNCTION This is the hook through which a function pointer to the name of the packet processing function is passed to the application. Compiling the application for measurement: ========================================== When compiling the application for measurement (ie. the "arm" version), certain important points have to be followed. 1. In order to mimic the hardware operation as close as possible, the final application during measurement should not print anything out on the screen. If this happens, SimpleScalar will also count the instructions required for displaying messages and instruction counts that are reported will be grossly exaggerated. It is recommended to enclose all print statements with a "#ifdef VERBOSE" and "#endif" clause to allow for easy debugging/development. Do not pass the -DVERBOSE switch to gcc during compile time for the "arm" version 2. An optimization level of "O2" must be used with gcc. This will cause the compiler to do all possible optimizations except architecture specific ones. 3. The debug flag (-g) must be turned off 4. If any type of memory statistics are required, the "GLOBAL_PACKET_MEMORY" flag must be defined in bench.h. PacketBench distinguishes between two types of memory regions - packet memory (which is the memory required to store the packet headers and payload) and program state (which is everything else such as eg. look up tables for IP forwarding, IVs for encryption). In order for SimpleScalar to decide whether a particular memory access is to packet memory or not, it needs to know the memory addresses of the area which is to be considered as packet memory (in PacketBench, it is the input packet queue). This can be obtained at compile time by defining this data structure as a global variable. This is exactly what "GLOBAL_PACKET_MEMORY" does. SimpleScalar - PacketBench specific flags: ========================================== 1) -start_addr Starting address of packet processing function at which SimpleScalar starts maintaining statistics. Obtained from "objdump" Default value: 0 2) -end_addr Ending address of packet processing function at which SimpleScalar stops maintaining statistics. Obtained from "objdump" Default value: 0 3) -imem Flag to toggle (0 = off, 1 = on) computation of instruction memory size. If toggled, it will create a file called "imem.dat" which needs to be post processed in order to compute instruction memory size. Default value: 0 4) -dmem Flag to toggle (0 = off, 1 = on) computation of data memory size. If toggled, it will create a file called "dmem.dat" which needs to be post processed in order to compute data memory size. Default value: 0 5) -mem_pattern Print out the memory access patterns for the packet specified by the unsigned integer. Output is written to a file called "mem_pattern.dat". Default value: 0 6) -instr_pattern Print out the instruction pattern for the packet specified by the unsigned integer. Output is written to a file called "instr_pattern.dat". Default value: 0 7) -mem_start This value defines the starting address of packet memory. Obtained from "nm". Default value: 0 8) -mem_end This value defines the ending address of packet memory. Obtained from "nm". Default value: 0 9) -pb_verbose Toggles (0 = off, 1 = on) verbose display of the dynamic instruction trace, which contains all the instructions executed in order to process packets. Output is written to "pb_intr.trace" Default value: 0 10) -func_calls Specifies the number of packets for which SimpleScalar is required to run. Provides an alternate way of stopping the simulator. Default value: 100 PacketBench flow: ================= 1. Start by developing the packet processing application with the "x86" version. 2. When you are satisfied that the application works as intended, use the template Makefile provided in the "arm" directory and cross-compile the application for the ARM ensuring the issues raised in the section above are dealt with. 3. We need to provide valid values for "start_addr", "end_addr", "mem_start", and "mem_end". These are obtained from examining the binary file created in step (2) above. a. In order to get "start_addr" and "end_addr", we need to disassemble the binary obtained from step (2) above (eg. bench.arm) objdump -d --prefix-addresses bench.arm > bench.arm.dis Make sure that you use the ARM version of objdump (should be installed when you install binutils-2.10). Examine bench.arm.dis and note down the starting address of the packet processing function. This is the value for "start_addr". For "end_addr", the value given depends on the type of trace file being used. It is either the address of "write_packet_to_tsh_file()" or the address of "write_packet_to_tcpdump_file()" depending on whether you use TSH or TCPDUMP files respectively. The values for "start_addr" and "end_addr" mark the entry and exit points into the packet processing function. b. In order to get "mem_start" and "mem_end", we need to examine the symbol table of the binary obtained from step (2) above (eg. bench.arm) nm --numeric-sort bench.arm > bench.arm.st Make sure that you use the ARM version of nm (should be installed when you install binutils-2.10). Examine bench.arm.st and note down the starting address of the packet queue. The variable by default is called "top_packet". This is the value for "mem_start". The address of the next variable is the value for "mem_end". These two values define the location of packet memory 4. Run SimpleScalar. Depending on what you want to measure a typical command line would look like the following: sim-profile -start_addr 0x02000a88 -end_addr 0x02000194 -mem_start 0x02098600 -mem_end 0x020b8600 -pb_verbose 1 -func_calls 500 ./bench.arm -N ./maewest.table trace.tsh dump.tsh drop.tsh The command line above specifies valid values for starting and ending address for both the function of interest and the packet memory region. Additionally, a dynamic instruction trace for the first 500 packets executed will be created in a file called "pb_intr.trace" 5. Post-processing To compute the instruction and data memory sizes the files "imem.dat" and "dmem.dat" need to be post processed. These two files contain an address trace of every instruction/memory access which occured To compute the size: sort -u imem.dat > imem.dat.sort wc -l imem.dat.sort This would give you the number of unique instructions/memory accesses that occured while the application was executed. Since the architecture is 32 bit, the output of the "wc" command times 4 is the size of the instruction store/program memory in bytes To obtain memory access patterns, the output of "mem_pattern.dat" needs to be graphed. This is a simple two column file where the first column represents the instruction number. The second column can be an integer from 0-4 where: 0 -> no memory access 1 -> write to packet memory 2 -> read from packet memory 3 -> write to non-packet memory (program state) 4 -> read from non-packet memory (program state) To obtain instruction access patterns, the output of "instr_pattern.dat" needs to be graphed. This is also a two column file where the first column represents the instruction number. The second column represents the instruction address of that particular instruction. The perl script "unique-instr.pl" in the "perl" subdirectory will process this file and return a third column of values which is a unique index for each new instruction that is executed. Graphing column 3 vs. column 1 will provide an instruction pattern plot.