Packetbench v 1.0.0 README
==========================

Packetbench is a programming framework for workload characterization of network processing. 
More details can be found in:

Ramaswamy, R., Wolf, T.: "PacketBench: A Tool for Workload Characterization of Network Processing", 
in Proc. of 6th IEEE Annual Workshop on Workload Characterization (WWC-6), pp. 42-50, Austin, TX, 
October 2003
http://www.ecs.umass.edu/ece/wolf/papers/wwc2003.pdf

It is not a tool in the traditional sense, but provides a programming environment which can be 
used to measure detailed statistics on the processing characteristics of networking applications 
run on a typical router/network processor. Packetbench is run on a processor simulator to obtain 
measurements. We use the SimpleScalar tools as our processor simulator, more specifically, the 
ARM version of SimpleScalar (simplesim-arm-0.2). Using PacketBench, you can get measurements such as:

1) Instructions executed for processing a packet for a given application.
2) Number of accesses to packet memory and program state while processing a packet for an application.
3) Computation of instruction core and memory core sizes for an application.
4) Generation of instruction and memory access patterns.
5) Generation of a dynamic instruction trace of the code executed while processing a packet.


Authors/Contact:
================

Ramaswamy Ramaswamy (rramaswa [at] ecs [dot] umass [dot] edu)
Tilman Wolf         (wolf [at] ecs [dot] umass [dot] edu)


Requirements:
=============
The Packetbench/SimpleScalar combination has been tested on the following system:

1) x86 processor running RedHat Linux 9.0 
   (gcc version 3.2.2 20030222 (Red Hat Linux 3.2.2-5))

Additionally, you will need the following in order to use PacketBench:

1. SimpleScalar ARM (simplesim-arm-0.2) from http://www.simplescalar.com/v4test.html
2. A cross compiler for the ARM architecture. This can be found at the same web page above.
   (gcc-2.95.2, binutils-2.10, and glibc-2.1.3)
3. libpcap libraries *IF* you require the ability to process TCPDUMP trace files.
4. libpcap compiled for the ARM architecture (using the tools in (2) above) *IF* you require 
   the ability to read/write from TCPDUMP trace files. 

Note: PacketBench by default includes support for reading from/writing to TCPDUMP trace files. As such, 
the code will require libpcap (in (3) and (4) above) to be present for PacketBench to compile. The 
source code will need to be modified if TCPDUMP support is not required.


Directory structure:
====================

flow_class --> Flow classification application
	arm
	x86
ipsec      --> IPSec application
	arm
	x86
ipv4-lctrie --> IPv4 forwarding application with LC-trie table lookup
	arm
	x86
ipv4-radix  --> IPv4 forwarding application with radix tree lookup
	arm
	x86
simple      --> Simple "barebones" application
	arm
	x86
sscalar     --> SimpleScalar specific files
traces      --> Sample trace file in TSH format

4 sample applications are provided. Each application directory has two subdirectories - "x86", which 
contains the version of the application specific to the x86 system which is used for application 
development and debugging and "arm" which contains the version of the application which has been cross 
compiled for the ARM architecture. This version of the code will be run on SimpleScalar to obtain 
statistics. 

Note that the code for both versions of the applications are identical. Separate copies are made for 
the sake of convenience 

In addition to the 4 sample applications, a fifth "barebones" application is provided in the directory 
called "simple" to provide a better understanding of how PacketBench works. This application extracts 
the IP header from the packet and prints out some header information. 


Installation:
=============

1) Untar archive to the directory of your choice
2) To incorporate the statistics collection features of PacketBench into SimpleScalar
	a) Make a backup copy of your original sim-profile.c in your SimpleScalar installation
           (usually, simplesim-arm/sim-profile.c)
	b) Copy the file sim-profile.c from the "sscalar" subdirectory into the SimpleScalar
	   installation directory overwriting the original sim-profile.c
        c) Rebuild sim-profile (make sim-profile)


Known Issues/Limitations:
=========================

1) PacketBench by default includes support to read/write TCPDUMP files. To compile/run out of the box
   you will need libpcap installed on your system and cross-compiled for the ARM. The source code will
   need to be modified if TCPDUMP support is not required.

2) Writing back processed packets to TSH trace files has not been implemented at this time.

3) The top (head) of the input packet queue is always used to store the next packet to be processed. 
   PacketBench was designed with the ability for the packet processing application to store back 
   (or "re-enqueue") a processed packet into the input packet queue for future processing. This
   code path has not yet been extensively tested.

4) In the "ipv4-radix" application, the x86 version will cause a segmentation fault if compiled with
   an optimization level higher than 0 (-O0). This is not an issue of great concern since the x86 
   versions are mainly used for code development and not measurement.

PacketBench basics:
===================
PacketBench consists of two main parts. The PacketBench framework and the PacketBench application. 
The PacketBench framework is responsible for packet input/output from/to trace files and is contained 
in the bench.c and bench.h files. The framework provides a pointer to the packet which the PacketBench 
application can use. The PacketBench application (which the user is supposed to provide) contains the 
code which will be used to process a packet. The application processing function is hooked into the 
PacketBench framework by means of a function pointer.

PacketBench reads in packet data from an input tracefile and writes processed packets back to an output 
tracefile. Currently, two trace formats are supported. The TCPDUMP format (libpcap is required) and the 
TSH format from NLANR (http://nlanr.net). At present, writing packets back to TSH files has not been 
implemented.

The data structure used to store the packets is a queue of length 64 (#define'd in bench.h), each of 
size PACKET_SIZE bytes (#define'd in bench.h). The data structure (called "packet") is defined in bench.h. 
The current packet to be processed is always stored at the top of the queue and a pointer to this location 
is passed to the application. This location is overwritten with the next packet to be processed when 
application requires that the current packet being processed exit the system. Alternatively, the 
application can store the packet back in the queue by returning a different value.


Constants and Defines:
======================

In bench.h
----------
1. GLOBAL_PACKET_MEMORY
Controls whether memory allocated for the packet queue described above is allocated locally or globally. 
This MUST be defined if any type of memory analysis needs to be performed. (explained later)

In Makefile
-----------
1. VERBOSE
If defined, this switch prints diagnostic messages which may aid in debugging. This switch must be 
DISABLED during measurement so that instructions/memory accesses due to printf statements are not counted.

2. BENCHMARK_FUNCTION
This is the hook through which a function pointer to the name of the packet processing function is passed 
to the application. 


Compiling the application for measurement:
==========================================

When compiling the application for measurement (ie. the "arm" version), certain important points have 
to be followed.

1. In order to mimic the hardware operation as close as possible, the final application during measurement
should not print anything out on the screen. If this happens, SimpleScalar will also count the instructions
required for displaying messages and instruction counts that are reported will be grossly exaggerated. It
is recommended to enclose all print statements with a "#ifdef VERBOSE" and "#endif" clause to allow for
easy debugging/development. Do not pass the -DVERBOSE switch to gcc during compile time for the "arm"
version

2. An optimization level of "O2" must be used with gcc. This will cause the compiler to do all possible 
optimizations except architecture specific ones.

3. The debug flag (-g) must be turned off

4. If any type of memory statistics are required, the "GLOBAL_PACKET_MEMORY" flag must be defined in 
bench.h. PacketBench distinguishes between two types of memory regions - packet memory (which is the 
memory required to store the packet headers and payload) and program state (which is everything else such
as eg. look up tables for IP forwarding, IVs for encryption). In order for SimpleScalar to decide whether
a particular memory access is to packet memory or not, it needs to know the memory addresses of the area
which is to be considered as packet memory (in PacketBench, it is the input packet queue). This can be
obtained at compile time by defining this data structure as a global variable. This is exactly what
"GLOBAL_PACKET_MEMORY" does. 


SimpleScalar - PacketBench specific flags:
==========================================

1) -start_addr <uint>

Starting address of packet processing function at which SimpleScalar starts maintaining statistics.
Obtained from "objdump"
Default value: 0

2) -end_addr   <uint>     

Ending address of packet processing function at which SimpleScalar stops maintaining statistics.
Obtained from "objdump"
Default value: 0

3) -imem       <uint>

Flag to toggle (0 = off, 1 = on) computation of instruction memory size. If toggled, it will create a 
file called "imem.dat" which needs to be post processed in order to compute instruction memory size.
Default value: 0

4) -dmem       <uint>

Flag to toggle (0 = off, 1 = on) computation of data memory size. If toggled, it will create a 
file called "dmem.dat" which needs to be post processed in order to compute data memory size.
Default value: 0

5) -mem_pattern  <uint>           

Print out the memory access patterns for the packet specified by the unsigned integer. Output is written
to a file called "mem_pattern.dat".
Default value: 0

6) -instr_pattern   <uint> 

Print out the instruction pattern for the packet specified by the unsigned integer. Output is written
to a file called "instr_pattern.dat".
Default value: 0

7) -mem_start       <uint>

This value defines the starting address of packet memory. Obtained from "nm".
Default value: 0

8) -mem_end         <uint>

This value defines the ending address of packet memory. Obtained from "nm".
Default value: 0

9) -pb_verbose     <uint>

Toggles (0 = off, 1 = on) verbose display of the dynamic instruction trace, which contains all the
instructions executed in order to process packets. Output is written to "pb_intr.trace"
Default value: 0

10) -func_calls      <uint>

Specifies the number of packets for which SimpleScalar is required to run. Provides an alternate way
of stopping the simulator.
Default value: 100


PacketBench flow:
=================

1. Start by developing the packet processing application with the "x86" version.
2. When you are satisfied that the application works as intended, use the template Makefile provided in
   the "arm" directory and cross-compile the application for the ARM ensuring the issues raised in the
   section above are dealt with.
3. We need to provide valid values for "start_addr", "end_addr", "mem_start", and "mem_end". These are
   obtained from examining the binary file created in step (2) above. 
   a. In order to get "start_addr" and "end_addr", we need to disassemble the binary obtained from
      step (2) above (eg. bench.arm)

	objdump -d --prefix-addresses bench.arm > bench.arm.dis

      Make sure that you use the ARM version of objdump (should be installed when you install
      binutils-2.10). Examine bench.arm.dis and note down the starting address of the packet processing
      function. This is the value for "start_addr". For "end_addr", the value given depends on the type
      of trace file being used. It is either the address of "write_packet_to_tsh_file()" or the address
      of "write_packet_to_tcpdump_file()" depending on whether you use TSH or TCPDUMP files respectively.
      The values for "start_addr" and "end_addr" mark the entry and exit points into the packet
      processing function.

   b. In order to get "mem_start" and "mem_end", we need to examine the symbol table of the binary
      obtained from step (2) above (eg. bench.arm)

	nm --numeric-sort bench.arm > bench.arm.st

      Make sure that you use the ARM version of nm (should be installed when you install
      binutils-2.10). Examine bench.arm.st and note down the starting address of the packet queue.
      The variable by default is called "top_packet". This is the value for "mem_start". The address
      of the next variable is the value for "mem_end". These two values define the location of packet
      memory
4. Run SimpleScalar. Depending on what you want to measure a typical command line would look like the
   following:

   sim-profile -start_addr 0x02000a88 -end_addr 0x02000194 -mem_start 0x02098600 -mem_end 0x020b8600 
   -pb_verbose 1 -func_calls 500 ./bench.arm -N ./maewest.table trace.tsh dump.tsh drop.tsh

   The command line above specifies valid values for starting and ending address for both the function
   of interest and the packet memory region. Additionally, a dynamic instruction trace for the first 500
   packets executed will be created in a file called "pb_intr.trace"

5. Post-processing

   To compute the instruction and data memory sizes the files "imem.dat" and "dmem.dat" need to be post
   processed. These two files contain an address trace of every instruction/memory access which occured
   To compute the size:
	sort -u imem.dat > imem.dat.sort
	wc -l imem.dat.sort
   This would give you the number of unique instructions/memory accesses that occured while the application
   was executed. Since the architecture is 32 bit, the output of the "wc" command times 4 is the size
   of the instruction store/program memory in bytes

   To obtain memory access patterns, the output of "mem_pattern.dat" needs to be graphed. This is a simple
   two column file where the first column represents the instruction number. The second column can be an
   integer from 0-4 where:

	0 -> no memory access
	1 -> write to packet memory
	2 -> read from packet memory
	3 -> write to non-packet memory (program state)
	4 -> read from non-packet memory (program state)

   To obtain instruction access patterns, the output of "instr_pattern.dat" needs to be graphed. This is 
   also a two column file where the first column represents the instruction number. The second column
   represents the instruction address of that particular instruction. The perl script "unique-instr.pl"
   in the "perl" subdirectory will process this file and return a third column of values which
   is a unique index for each new instruction that is executed. Graphing column 3 vs. column 1 will
   provide an instruction pattern plot.



