Lab 5

Lab Instructions

Part I – Compiling to the PIC

In this part of the lab you are to write a bubble sort routine in C as well as in assembler. You will use the PIC C compiler to compile your C code to assembly and then run the program on the PIC. Your task is to compare the performance of the compiled code against your handwritten code. I expect that you do not get beaten by a brainless compiler…

To implement this lab, do the following things:

Understand how bubble sort works. Go through an example and make sure your counter variables cover the correct range. Make sure you do not compare any number of which you already know that they are correctly sorted (i.e., the ones that have “bubbled” to the end of the array). For this lab, we will use an array with exactly 10 elements.

Implement bubble sort in C. Look at the example C code at /ece/wolf/courses/ECE354/labs/insert.c
This code implements insertions sort, which is different from bubble sort, but it still will be helpful. Note that there are certain restrictions on the C compiler:

#pragma chip PIC16F877 defines the PIC that is targeted.

All operations are carried out in the main routine. While another function could be called to perform an operation no data can be passed to the function, i.e. it would have to defined as void func(void).

Only one indirect reference can occur per line:
k = array[j-1];
array[j] = k;
This cannot be simplified to one line.

Compile your C code. Use the command line prompt to enter (assuming your C file is sort.c): cc5x –a sort.c
This will create sort.asm, which you can load into the PIC IDE and simulate.

Reset the simulation in the IDE and open the File Register Window. If you use an array similar to that in the insertion sort example, you will see that it is located in 0x26 – 0x2F. You need to enter “interesting” data values into this array to make sure that there is something to be sorted. Use the “fill registers” function (right-click in file register window) to enter different data values. You should have 5 different arrays that you are going to sort with the C code and later with your code. One should be the best case, one should be the worst case, and the other three should be somewhat random. Make sure to write down what your values are.

After entering the data into the registers, step through the simulation until the program has completed. Use the stopwatch window to count the number of cycles that the sorting took. Record this value for all five arrays. You should see differences between the worst and best case array.

Now, write the bubble sort program by hand in assembly. You can use the same memory layout as the C compiler or any other that you find more suitable. You may use any clever tricks to write faster code than the compiler. Just make sure that your program still does bubble sort…

Simulate the sorting of the five arrays with your program as you have done in steps 4 and 5. Record the number of cycles that your program takes.

For your lab report, show the performance of the C code and your code (e.g., in a table) for all five arrays. Show the data arrays for each experiment. Explain why there is a performance difference (you may show the relevant assembly code lines). Include the C and both assembly files in the report.

Part II – Compiling to the DSP

In this part of Lab 5, you will write a simple routine to calculate the dot product of two vectors. The major part of the code is given to you and you just need to add a few lines yourself. The code also contains a vector addition routine that illustrates how to operate on the vectors. You will then compile the code for the TI DSP and simulate its performance for different levels of optimization. For your lab report, you should report the performance and explain the differences based on the assembly code generated by the compiler. You should do the following steps:

Download the C code from /ece/wolf/courses/ECE354/labs/lab5.c

Look at the C code and understand what’s going on. Add your code that calculates a dot product in the following function: int dotproduct(short *in1, short *in2, unsigned int N), where in1 and in2 are the vectors and N is their length. The integer returned by this function is the result.

Compile your code using the command line compiler for the DSP:
cl6x –qko1 lab5.c -z -llnk.cmd -lrts6201.lib -o lab5.out
This will create lab5.asm, which is the assembly code for the DSP and lab5.out, which is the binary that you need for the simulation.

You can now simulate the generated code with
load62x -q -k lab5.out
The program is set up to print the number of cycles used by the vector addition (which you can ignore) as well as by your dot product function. Record the number of cycles spent by your code.

Re-compile your code using optimization level 2 (cl6x -qko2 ...) and 3 (cl6x
–qko3 ...) and repeat step 4. Depending on your code you might not see any performance improvements between level 2 and 3.

Look at the assembly code for each optimization level and identify the lines that correspond to your dot product function. To interpret the assembly, you might want to look at the Programmer’s Guide available on the course web site. For your report, discuss the following questions while comparing different optimization levels:

	How many instructions are executed in your dot product loop?
	What is the maximum parallelism that is achieved (i.e., how many instruction does the DSP execute in parallel in one cycle)?
	Which functional units are used (L1, D1, …) and what do they do?

Your lab report should show the dot product function and discuss how you implemented it. Discuss what you have done to use the parallelism that can be used on the DSP. Show the results for the different optimization levels. Discuss why there are performance differences.

Demo

For the demo, you should show the results from Part I and II. You should be able to explain the steps that you have taken and the code that you have written. You should also be able to show that you can use the compilers and the DSP simulator. The demo and quiz will be done individually for this lab. Please sign up with your former team partner for a 30 minute slot (even though it is an individual project).

Lab Report

The lab report should be concise and consist of two parts that include the components mentioned in each section above. Include a brief introduction that discusses the basic concepts of how the code is compiled for the PIC and what the benefits and drawbacks of this approach are. For the DSP part, briefly discuss the DSP architecture and how parallelism can be used to speed up processing.

Challenge

For both parts of the lab, a competition is held. For both competitions, the goal is to write the fastest code:

PIC code: write the fastest bubble sort that can sort the worst case data array.
DSP code: write the fastest dot product function.

The winner of each competition receives an extra 10 points for lab 5 (the total of lab 5 cannot exceed 100 points). If there are multiple competitors who achieve the best result, the 10 points will be divided between them (fractions of points are rounded off). If there is a single winner for the PIC competition, he or she may carry the title “ECE354 Master Assembly Crafter.” If there is a single winner for the DSP competition, he or she may carry the title “ECE354 Master Parallelism Exploiter.” If both competitions are won by the same individual, he or she may carry the title “ECE354 Supreme Embedded Coder.” During lab hours, the current best performance should be announced on the blackboard in the lab. All competitors must have their solution’s performance announced on the blackboard by Wednesday 5/12/04 to give others a chance to counter (updates due to code improvements are allowed and encouraged).