### Chapter 21 #### Data-Flow Paradigm Co-processors Using the Designing Special-purpose Bilha Mendelson Baiju Patel Israel Koren ### 21.1 Introduction processor attached to a host computer for executing a specific application. flow paradigm. An Asic designed using this approach is intended to be a co-This research concentrates on designing high-performance ASICs using the datahas been the prime choice for the design of application specific ICs (Asics) [4]. Design of general purpose computers using the data-flow paradigm has been an area of intensive research for over a decade, whereas control driven architecture lated to a data-flow graph (DFG). The DFG represents all the parallelism inher-The application, given in the data-flow language SISAL [10], is first trans- Supported in part by SRC under contract 89-DJ-125. The authors are with the Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA 01003. this chapter. high levels of concurrency and pipelining for general applications. The ASICs generated using this paradigm are referred to as data-driven ASICs throughout for pipelined operation. Therefore, this approach to ASIC design can provide rithm without having the user identify it. Moreover, it is naturally suitable execution make it possible to exploit all the inherent parallelism in the algooperation of the VLSI cells follows the data-driven execution model, i.e., each allelism explicitly. The DFG for a given application is then "directly" mapped onto VLSI such that each node in the DFG corresponds to a cell in VLSI. The ent in the algorithm and therefore, the user is not required to identify the parbuffer is free. The "direct" mapping of the DFG onto VLSI and the data-driven cell executes its task as soon as all the operands are available and the output presented in section 21.6. Section 21.5 illustrates the layout generation step, and final conclusions are of implementations can be generated for different performance requirements minimization process is discussed in detail and it is shown that a large number estimation of the application program is described. In section 21.4, the area discussed in detail in sections 21.3 to 21.5. In section 21.3, the performance brief description of the design process. Each step of the design process is then The rest of the chapter is organized as follows. Section 21.2 contains a ### 21.2 Design Process The complete procedure for designing data-driven ASICs is shown in figure 21.1. These steps are outlined in what follows. SISAL to DFG Translation: The application program is specified in SISAL correctness of the application program. simulated using an event driven simulator, PARET [12], to verify the the flow control for conditional and loop constructs. merge, True, False, etc.) nodes. The control nodes are used to implement A typical DFG is composed of arithmetic (e.g., add, subtract, multiply, compare, etc.), logical (e.g., AND, OR, etc.) and control (e.g., switch, data values that flow through the directed arcs, which connect nodes nodes of the DFG represent the computations performed on operands or [10], which is then translated using an optimizing compiler to a DFG. The The DFG is then Performance Estimation: Before the detailed design of the ASIC is initiated, a preliminary estimate of the performance of the application pronew algorithm must be developed for the application. The performance of gram is obtained. If the estimated performance is not satisfactory, then a Figure 21.1: The design process DFG. At this stage, the implementation details are ignored. of arithmetic, logical, conditional, and loop constructs that constitute the the ASIC is estimated hierarchically based on the estimated performance Initial Mapping: The first step in mapping a DFG to VLSI is to assign an certain nodes are then chosen to minimize the overall area within given area wasteful implementation. Therefore, alternate implementations for implementation is chosen for each node. This, however, may result in an in designing very high-performance ASICS, initially, the fastest available implementation to each of the nodes of the DFG. Since we are interested performance constraints. Area Minimization: Different operands at a multi-input node may arrive at of the nodes on the shorter paths can be increased without affecting the that arrives earlier has to wait for the other operands to arrive, the delay different time instances by traversing different paths. Since the operand a heuristic algorithm which is very fast, and the second is a branch and Two algorithms for area minimization have been developed. The first is bound algorithm which yields optimal results. This, in turn, will reduce the area of the Buffer Allocation: If a DFG with non-uniform path lengths is directly for buffers. This problem has been mapped to a quadratic programming synchronous pipelines [8] and static data-flow computers [5] because dataproblem that can be solved using well-known methods. The details of the buffer allocation algorithm are not provided here for the sake of brevity. problem for this architecture is more difficult than the similar problem for mapped onto VLSI, it may not be optimally pipelinable. In such an event, driven ASICs allow variable execution time for nodes and variable delay buffers may be added to shorter paths of the DFG. The buffer allocation Layout Generation: The final layout is generated using OCT tools [18] deing standard cell based generators. entire ASIC is synthesized from this library of behavioral descriptions usdescriptions of the different node implementations is prepared. Then, the veloped at the University of California at Berkeley. A library of behavioral # 21.3 Performance Estimation any implementation overheads. estimates are implementation independent and therefore, do not account for obtained concurrently while creating the DFG in a hierarchical fashion. These priate subgraph. Estimates for the potential performance of the program are basic structures in the program and for each one of them creates the appro-The compiler translating the program written in SISAL to a DFG identifies the successive results (measures the throughput which is its reciprocal). the potential parallelism), and pipeline period, which is the mean time between cycles, from entering the input operands until the output is produced (measures We use two performance measures: latency, which is the time, in clock successive results if the longest operation in the algorithm is always executed) abilities and estimates of iteration counts). and average pipeline period (the average pipeline period based on branch probof pipelining measures: estimates of iteration count in loop structures). Similarly, we define two types latency (the average input to output latency, based on branch probabilities and to output latency if the longest possible execution path is taken) and average We distinguish between two types of latencies: worst case latency (the input worst case pipeline period (the elapsed time between expressions, conditional expressions and loops, and estimate the potential parestimates for the basic structures hierarchically, we analyze the performance of allelism and pipelining for each one of them. By combining the performance the complete application We decompose the DFG into three types of structures: arithmetic/logic derive expressions for their performance measures. In what follows, we present the data-flow graphs of the basic structures and # 21.3.1 Arithmetic/logic expressions by the length of the critical path. expression. Therefore, the latency of the arithmetic/logic expression is given has to pass through all possible paths in the computation tree of the arithmetic best performance is through a balanced computation tree [1, 17]. A common way to implement an arithmetic/logic expression and achieve the The data of two sub-expressions, the estimated latency of the compound expression is calculated by the following recursive formula: The latency of an expression is denoted by L(expression). The execution time of an operation is denoted by EX(op). Given the estimated latencies Given the estimated latencies $$L(expression) = Max\{L(sub\_expression\_1), L(sub\_expression\_2)\} + EX(op)$$ The above formula assumes binary operation but it can be easily extended to n-ary operations. longest operation in the expression. The pipeline period of an expression is given by the execution time of the ## 21.3.2 Conditional expressions passes the data to the outgoing arcs or otherwise consumes it. the Then or Else part, and routing the result of either branch through a Merge Boolean control input. When the control value is true (false) the T (F) node the True (T) and False (F) nodes. These nodes receive a data input and a This structure is composed of three parts: computing the condition, executing A DFG representation of a general conditional expression is shown in figure 21.2. Routing of input data to either the Then or Else part is achieved using the computation can not be determined a priori. Therefore, both the average and worst case performance measures need to be computed. the computation performed depends on the input data, and the path taken by The conditional structure is not a deterministic expression in the sense that WL(exp2) + EX(F), respectively. Because of their similarity, we can assume that EX(T) = EX(F) and we use the notation EX(T). The worst case latency of the conditional structure is The latency of the Then and Else parts is equal to WL(exp1) + EX(T) and The worst case latency of an expression is denoted by WL(expression). $$WL(if \pm hen \pm else) = Max\{WL(exp1), WL(exp2)\} + EX(T) + WL(Cond) + EX(Merge)$$ Figure 21.2: If-then-else structure by p and (1-p), respectively. The average latency of the Then part is equal to AL(exp1) + EX(T) and that of the Else part is equal to AL(exp2) + EX(T). The average latency of the complete if-then-else structure is therefore The probability of passing through the Then and the Else parts is denoted $$AL(if \ \ \, then \ \, else) = p * AL(exp1) + (1-p) * AL(exp2)$$ $$+ EX(T) + AL(Cond) + EX(Merge)$$ by D and is equal to $D = \frac{p_i}{1-p_i}$ . number of times that the "short" path will be selected successively is denoted probability of passing through the "short" path be denoted by $p_s$ ( $p_s$ is either ppath with the largest pipeline period the "long" path while the other is called consequently, $AP(long) \geq AP(Cond)$ and $AP(short) \geq AP(Cond)$ . Let the AP(long) and AP(short). Note that the Cond part belongs to both paths and the "short" path, and we denote the corresponding average pipeline periods by based on the average pipeline periods of the Then and Else paths. We call the The worst case pipeline period is given by the longest operation in the struc-The calculation of the average pipeline period of a conditional structure is Assuming geometrical distribution of the selected paths, the average in the "long" path only during AP(long) - AP(Cond) clock cycles. We denote Consecutive computations are at least AP(Cond) clock cycles apart. Therefore, several computations in the "short" path can overlap a single computation the ratio between the maximum overlap time and the average pipeline period of the "short" path by R, i.e., $R = \frac{AP(long) - AP(Cond)}{AP(short)}$ can start only AP(Cond) time units after the computation in the "long" path the D computations in the "short" path. The computation in the "short" path divided by (D+1). If, however, D is larger than R, then the time to complete other branch. Therefore, the average pipeline period will be equal to AP(long)computations in the "short" path and the single preceding computation in the The average pipeline period of the if-then-else structure is [11]: has started. Hence, we need to add this term to the overall computation time D+1 consecutive computations is determined by the time needed to complete If D is smaller than R, then there is a complete overlap between the D $$AP(if\_then\_else) = \left\{ egin{array}{ll} rac{AP(long)}{D+1} & ext{if } D < R \\ rac{D*AP(short) + AP(Cond)}{D+1} & ext{otherwise} \end{array} ight.$$ always available when needed, i.e., the input rate is not smaller than the internal throughput. Consequently, the calculated estimates tend to be optimistic. The above estimations are based on the assumption that the incoming data is ### 21.3.3 Loop structures the computation that has to be repeated. termines the number of iterations to be executed, and the body which contains A DFG of a loop structure is composed of two parts: the control part, which de- user supplied values1 of the average or worst case number of iterations, which will in turn yield average or worst case estimations, respectively. Therefore, we Therefore, in these cases we estimate the latency and pipeline period based on use the same notation for average and worst case measures. In many cases, the number of iterations needed is not known at compile time next estimate the performance when the loop structure produces a single result and then we analyze the other case. In general, a loop may generate a single result or a stream of results. ### Single result loop structure put streams, (stream\_1,...,stream\_l). These synchronization nodes are Stream outgoing link in a Round Robin fashion. By replicating the loop body, we can The control signal passes to the synchronization nodes, which control the in-The DFG of a typical loop structure with a single result is shown in figure 21.3 Modulo nodes (S.Mod) that route the incoming data stream to the appropriate <sup>&</sup>lt;sup>1</sup>These values may be determined through simulations on typical input data. Figure 21.3: Single result loop structure iterations (e.g., Sum node). In summary, the loop body consists of m replicas of the f block and a summation tree as shown in figure 21.3. tree (e.g., plus (+) nodes) and then accumulate the partial results of the various task is accomplished in two steps: add the partial results using a computation copies of the f block have to be accumulated to produce the loop output. This to maximize performance, appears in [11]. The partial results generated by all analysis, including the derivation of the optimal number of replications needed ing several replicas of the body, denoted by f blocks in figure 21.3. achieve a better performance and therefore we analyze a loop structure contain-Detailed is not a constant we may use its expected value. time t. For simplicity, we assume that t is a constant. If the inter-arrival time The input to the loop body is a stream of data elements with inter-arrival noted by L(f), the number of f block replications, m, and the latency of the The latency of the loop body depends on the latency of an f block, de- summation tree, denoted by L<sub>sum. tree</sub>. We can divide the summation tree into two sub-trees; one is a complete binary tree with [log m] levels and the second is a partial tree. We can divide leaves and the other is a partial tree. We may continue this process until the partial tree will be either a complete tree or will be empty. Therefore, the latency of the summation tree is given by the following recursive formula: the partial tree again into two sub-trees; one is a complete tree with $m-2^{\lfloor \log m \rfloor}$ $$L_{sum. tree}(m) = Max\{LC(m), LP(m)\} + \delta(m)EX(+)$$ where LC(m) is the latency of the complete tree and is given by $$LC(m) = (2^{\lfloor \log m \rfloor} - 1)t + EX(+)\lfloor \log m \rfloor$$ LP(m) is the latency of the partial tree and is given by $$LP(m) = \left\{ egin{array}{ll} 0 & m=0 \ { m or} \ m-2^{\lfloor \log m \rfloor} = \ & \\ 2^{\lfloor \log m \rfloor}t + L_{sum.\ tree}(m-2^{\lfloor \log m \rfloor}) & { m otherwise} \end{array} ight.$$ and $$\delta(m) = \left\{ egin{array}{ll} 0 & m = 2^k & (k ext{ is an integer}) \\ 1 & ext{otherwise} \end{array} ight.$$ The latency of the body can therefore be estimated as $$L(body, m) = (\lceil \frac{n}{m} \rceil - 1)P(body) + CL + L_{sum. tree}(m)$$ where n is the original number of iterations, $$CL = L(f) + EX(S.Mod) + EX(Sum) + EX(F)$$ and P(body) is the pipeline period of the body determined by $Max\{IP(body), mt\}$ . IP(body) is the execution time of the longest operation in the body. The optimal number of f block replications necessary to achieve the minimum possible execution time of a single result loop structure is $$m_{opt} = \left\{ \begin{array}{c} \left\lceil \frac{IP(body)}{t} \right\rceil & \text{if } \left\lceil \frac{IP(body)}{t} \right\rceil \leq \frac{1}{2} + \sqrt{\frac{1}{4} + \frac{n \cdot IP(body)}{t + EX(+) + IP(body)}} \\ \left\lceil \frac{n}{\lceil \frac{IP(body)}{t} \rceil} \right\rceil & \text{if } \frac{1}{2} + \sqrt{\frac{1}{4} + \frac{n \cdot IP(body)}{t + EX(+) + IP(body)}} < \left\lceil \frac{IP(body)}{t} \right\rceil < n \\ & \text{otherwise} \end{array} \right.$$ Figure 21.4: Stream of results loop structure The corresponding minimal latency of the body is $nt+LP+L(sum.\ tree)$ [11]. Finally, the latency of the single result loop structure is $$L(loop) = L(control) + L(f)$$ pipeline period of the single result loop is the same as its latency. where L(control) denotes the latency of the control part of the loop. The ### Stream of results loop structure A stream of results loop structure is shown in figure 21.4. For synchronization purposes we use two nodes: Stream Modulo (S.Mod), which synchronizes the stream of inputs, and a Stream Merge (S.Merge), which guarantees the proper ordering of the results. period of the f block. As is shown in [11], the optimal number of replications that may be smaller than EX(S.Mod) and P(f), where P(f) is the pipeline within the loop. This way, the loop structure can produce new results at a rate By replicating the f block, we allow overlapping of consecutive computations $$m_{opt} = \lceil \frac{Max\{P(f), EX(S.Mod)\}}{Max\{t, P(control), EX(S.Merge)\}} \rceil$$ where $m_{opt} \leq n$ . The minimal pipeline period of the loop structure is $P(stream\ of\ results\ loop\ structure) = Max\{t, P(control), EX(S.Merge)\}$ Figure 21.5: Nested if then-else expression and its DFG representation and the corresponding latency is $$L(first\ result) = L(f) + EX(S.Mod) + EX(S.Merge) + L(control)$$ #### 21.3.4 Examples We demonstrate the performance estimation method through a simple nested if then-else program. Figure 21.5 shows the program and its corresponding DFG generated by our compiler. number marked on each arc represents the accumulated worst case latency at that point. As can be seen from the figure, the latency of the complete DFG in For the analysis of this example, we use the execution times from [7]. The longest operation in the graph, respectively. the worst case is equal to 32 clock cycles and the pipelining period is 11 clock cycles. The above results correspond to the length of the critical path and the complete overlap between the Else and Then branches of the outer conditional the average pipeline period decreases. this case, as the probability of passing through the outer Then path increases, the Then path of the inner conditional structure is equal to 0.8 instead of 0.2. In probability to pass through the outer Then path increases, the average pipeline The Then path is the "short" path. In figure 21.6(a), the probability to pass through the Then path of the inner conditional is 0.2. We can see that as the function of the probability of taking the Then path of the outer conditional simulator, PARET of the example in figure 21.5 to the simulation results obtained using the event period approaches AP(short). In figure $21.6(\mathrm{b})$ , the probability to pass through In figure 21.6, we compare the estimated values of the average pipeline period [12]. Figure 21.6 shows the average pipeline period as This continues as long as there is a the average case pipeline period. worst case pipeline period is 11 clock cycles, which is substantially higher than be seen from the figure, the estimated values are very close to the simulation the overlap and results in an increase in the average pipeline period. As can results and therefore, lengthy simulations may be avoided. In both cases, the Further increase in the probability of taking the outer Then path reduces sequence of operations that cannot be overlapped, which includes the multiply operation). Here, the pipeline period of the loop structure is determined by a which is the accumulated execution time of the operations along the critical this example, the result of the first iteration is produced after 26 clock cycles of the body is: EX(\*) + EX(+) + EX(Merge). Replicating the body will not Because of the dependency between successive iterations, the pipeline period a loop structure where the current iteration depends on the previous iteration add, and Merge nodes. which is the execution time of the longest operation in the graph (the multiply reduce this pipeline period and therefore the latency of the loop structure. In Figure 21.7 shows a first order impulse response filter [6] as an example for The second result, however, is produced 17 clock cycles later and not 11, ## 21.4 Area Minimization smaller implementations to some of the nodes to minimize the overall area for The area minimization procedure starts with the initial mapping and reassigns formance measure is the throughput. Since the pipeline period of the designed the desired performance. For a pipelined ASIC, the single most important per- (a) The probability to pass through the Then path of the inner conditional is 0.2 (b) The probability to pass through the Then path of the inner conditional is 0.8 Figure 21.6: Comparing the estimated pipeline period to simulation results for the example in figure 21.5 Figure 21.7: First order impulse response filter example in figure 21.8. Let the multiply (MULT) node be a sequential multipipeline period. The need for area minimization can be illustrated through the implementation will increase the length of the critical path and thus increase as well. A parallel adder should still be used for ADD2 because any slower ASIC. Moreover, the area of the interconnections for ADDI would be reduced may be used for node ADDI without affecting the overall performance of the ever, an alternate bit-serial implementation (with execution time of 16 cycles) the result of ADDI has to wait for 15 cycles after it has been computed. Howresult of ADD1. Since ADD2 cannot execute until both operands are available, the initial mapping, the result of MULT is available 15 cycles later than the nodes be parallel 16-bit adders that take one cycle to execute. plier that takes 16 cycles for the computation. Let the add (ADD1 and ADD2) tation for any node of the DFG such that its delay is larger than the allowed pipeline period. The area minimization process will not select an implemen-(slowest) implementation that can be used for a node is restricted by the desired ASIC can never be smaller than the delay of the slowest node, the smallest Therefore, for the overall area of the Asic is minimal. If the area of the final design is not the area minimization will select an implementation for each node such that further reduce the overall area. acceptable, the design process will be repeated for a larger value of latency to In general, there may be many different implementations for a node, and a different system area and execution time. For example, if the DFG contains nThe above area minimization, in general, is an NP-complete problem since, for every node, there are many implementations and each of these may lead to Figure 21.8: A simple DFG interconnection area reduces with the smaller (e.g., serial) implementation of a by the area minimization step. tween the nodes. Hence, the final layout may be larger than the area estimated present, our algorithms ignore the area required for the interconnections besolution while a branch and bound algorithm is used for optimal solution. sible designs. Therefore, a greedy algorithm was developed to obtain a "good" adders and there are five implementations of the adder, then there are 5" pos-However, we observe that in most cases, the At ### 21.4.1 A greedy algorithm and on the set of implementations available for each type of node depends primarily upon the order in which nodes are selected for replacement area of the resultant ASIC is unacceptably large, the length of the critical path remains unchanged. This process is repeated until no such node exists. If the by a smaller one such that the total area is reduced, but the overall latency From the initial mapping, a node is selected and its implementation is replaced (latency) is increased and the process is repeated. The quality of the solution #### Preliminaries ilarly, the node must complete its execution before the smallest of the latest instances at which the results from the predecessor nodes are available. therefore, the earliest time it can execute is the maximum of the earliest time define the freedom of a node as the largest amount by which its execution time These two time instances are denoted by asap (as soon as possible) and alap time instances at which its output must be available to the successor nodes A node cannot start its execution before all the required inputs are available: surplus time, have been used for several scheduling problems (e.g., [9]). (as late as possible). The asap and alap times, also referred to as slack and ### Algorithm: Area Minimization - Compute the freedom for each node. - <u>ب</u> ب Let S be the set of candidate nodes for which there are implementations such that when replaced, the increase in the delay is not larger than their freedom. If S is empty then exit. - .Α. ω Compute the area savings for each v in S - Let $S = \{v \mid v \in S \text{ and } v \text{ has maximum area savings}\}$ - Choose node v from S' which smallest freedom. - 700 Replace current implementation by a smaller one - Go to step 1. Figure 21.9: Greedy algorithm for area minimization the freedom of a node on the critical path is zero. (which is given by the difference between the alap and asap times). Obviously, can be increased without increasing the execution time of the overall design #### Node Selection will affect the solution and hence, determine the overall size of the final design. time is not larger than its freedom. The order in which the nodes are replaced can be substituted by a smaller one if and only if the increase in its execution tation of the DFG are replaced by slower ones. The implementation of a node In order to minimize the overall area, some of the nodes in the fastest implemen- in figure 21.9. A more detailed discussion of this algorithm appears in [15]. decrease, but need not become zero. The above greedy algorithm is summarized replacement, then the freedom of the other nodes (with larger freedom) may may be reduced. Therefore, if the node with the smallest freedom is chosen for node that provides maximum area savings to be replaced first. If there is more than one such node, the one with the smallest freedom is selected. Whenever the execution time for a node is increased, the freedom for some other nodes The greedy algorithm attempts to minimize the overall area by selecting the # Branch and bound algorithm crease in the execution time is smaller than its freedom. An implementation of a node can be replaced by a smaller one only if the in-Therefore, the lower | .500 | 0.577 | 0.750 | 0.957 | $area(mm^2)$ | |------|-------|-------|-------|--------------| | | 4 | 8 | 16 | bits | | | 4 | 2 | 1 | time(clocks) | 16-bit adder Table 21.1: The area and execution time for different implementations of a that can be replaced by smaller implementations, the node that leads to the the branching and bounding steps. At the branching step, out of the nodes node with the smallest freedom and largest area savings. Bounding takes place smallest lower bound is selected. A tie in the selection is broken in favor of the when the current smallest solution is smaller than the area lower bound tation within the above constraint. The area lower bound is used during both bound for the area of the DFG is evaluated by choosing the smallest implemen- #### 21.4.3 Examples is shown in figure 21.11. tradeoff curve for this example obtained using the worst case latency measure the total area is reduced from initial area of $15.32 \ mm^2$ to $13.69 \ mm^2$ while the greedy and branch and bound algorithms for area minimization use an 8-bit 16-bit addition is performed by adding 8 bits at a time in two clock cycles. Both present an example taken from [14], as shown in figure 21.10, where "S" and the overall execution time remains unchanged (16 clock cycles). The area-time adder implementation (delay of two clock cycles) for the ADD3, ADD4, ADD5, the number of bits that are operated simultaneously, i.e., 8 bits indicate that implementations of these nodes are shown in table 21.1. The "bits" indicate To demonstrate the practical nature of the area minimization algorithms we were designed using $2\mu$ technology. The area and execution time for different "M" nodes are Switch2 and Merge nodes, respectively. The add/subtract nodes The rest of the nodes are 16-bit adder implementations. Consequently, SUB4, and SUB5, and a 4-bit adder (delay of four clock cycles) for first generating an approximate set of solutions for different area and latency obtained using the greedy algorithm is no more than 5% larger than the optimal area obtained using the branch and bound algorithm. Therefore, we suggest using the greedy algorithm. Once a solution is selected, it may be optimized using the branch and bound algorithm. In all the examples we tried (including the one above), the overall area <sup>&</sup>lt;sup>2</sup>A switch node is equivalent to a pair of True and False nodes. Figure 21.10: An example from MAHA [14] Figure 21.11: Area vs. time for the DFG of figure 21.10 ## 21.5 Layout Generation macrocell based layout generation schemes, as outlined below. is generated using the Oct tools [18]. The cells in the final layout may either be predesigned or generated as needed. The alternatives are standard cell or The DFG for the application is translated into a netlist from which the ASIC The final step in the synthesis of data-driven ASICs is the layout generation. the complete layout for the ASIC. the standard cell based placement and routing program Wolfe [16], generates netlist of the DFG using BDNET. This netlist, when placed and routed using complete ASIC is then obtained by combining the netlists of the nodes and the parallel adder (16-bit) is shown in figure 21.12. The BDS description of a node is translated into a netlist using BDSYN and MisII [2]. The netlist for the the nodes is prepared in the language BDS [18]. The BDS description of a the cells are of uniform height). A library of the behavioral description of all The first approach generates the final layout using standard cells only (all curves is similar. The timing analysis using Crystal [13] shows that all of these area" in this figure indicates the area of the ASIC after placement and routing, is shown in figure 21.15. initial implementations will operate at least at 5MHz clock. One sample layout algorithm (ignoring interconnects). It can be observed that the nature of both and the "estimated area" indicates the area calculated by the area minimization ure 21.13, and their area and latency are shown in figure 21.14. The "final We have generated different layouts for the conditional statement of fig- We have also generated layouts for several implementations of the DFG of ``` ack_i1<0> = ack_o1<0>; ack_i2<0> = ack_o1<0>; send_o1<0> = send_i1<0> AND send_i2<0>; out_1<16:1> = in_1<16:1> + in_2<16:1>; ENDROUTINE; ENDMODEL; H ROUTINE add; MODEL add_16 ack_01<0> send_i1<0>, send_i2<0> in_1<16:1>, in_2<16:1>, send_o1<0>, ack_i1<0>, ack_i2<0>, out_1<16:1>, Isend signal for output tack signal for each input lack signal at output port Isend for each input Itwo inputs Ithe output ``` Figure 21.12: The BDS description of a 16-bit adder Figure 21.13: A conditional statement Figure 21.14: Area vs. time for the DFG of figure 21.13 figure 21.10. For example, the total area required for two designs with latencies of 12 and 16 cycles are 19.4 mm<sup>2</sup> and 17.9 mm<sup>2</sup>, respectively. This clearly demonstrates the applicability of this approach to large DFGs. to place and route the separate cells. may be generated using different target architectures (such as PLA). Finally, a used cells can be predesigned for optimal area/performance ratio and others cell based generator may result in a suboptimal design. macrocell based placement and routing tool (e.g., Mosaico [3]) should be used The requirement that the complete Asic be generated using a standard Instead, frequently ### 21.6 Conclusions presented. step, the area of the ASIC is minimized and then the final layout is generated. Examples illustrating the various steps in the design of the ASIC have been compiler also provides estimates for the performance of the ASIC. for translating the application specified in SISAL to a data-flow graph. grain parallelism and pipelining. The developed CAD tool includes a compiler An innovative approach to the design of ASICs has been presented in this pa-The designed ASICs operate in a data-driven mode that supports fine In the next Our preliminary experiments show that a DFG with about 50 to 100 Figure 21.15: Layout of the DFG of figure 21.13 References 23 at 5MHz to 10MHz clock rate. timing analysis using Crystal and Spice shows that these ASICs can be operated add/subtract operators can easily be implemented on a $1 cm^2$ chip. Also, the #### References - A.V. Aho and S.C. Johnson. Optimal code generation for expression trees. J. ACM, 23:488-501, November 1976. - 2 R.K. Brayton et al. MIS: A Multiple-Level Logic Optimization System. *IEEE Transactions on CAD*, CAD-6(6):1062-1081, November 1987. - ယ C.H Sequin ed., editor, Proc. of VLSI '87, Vancouver, Canada, 1987. J. Burns et al. Mosaico: An integrated macro-cell layout system. In - <u>4</u> D.D. Gajski. Silicon Compilation. Addison-Wesley Publishing Co., 1987. - <u>5</u> G.R. Gao. Algorithmic aspects of balancing techniques for pipelined data 1(6):39-61, February 1989. flow code generation. Journal of Parallel and Distributed Computing, - Publishers, 1986. L.B. Jackson. Digital Filters and Signal Processing. Kluwer Academic - I. Koren, B. Mendelson, I. Peled, and G.M. Silberman. A data-driven VLSI array for arbitrary algorithms. Computer, 21(10):30-43, October 1988 - <u></u> C.E. Leiserson and J.B. Saxe. Optimizing synchronous systems. J. of VLSIand Computer Systems, 1(1):41-67, January 1983 - 9 M.C. McFarland, A.C. Parker, and R. Camposano. Tutorial on high-level synthesis. In Proc. of 25rd Design Automation Conference, pages 330-336. - [10] J. R. McGraw et al. SISAL: Streams and iteration in a single assignment Livermore National Laboratory, Livermore, CA, March 1985. language: Reference manual version 1.2. Manual M-146, Rev. 1, Lawrence - [11] B. Mendelson and I. Koren. versity of Massechusetts, Amherst, 1989. B. Mendelson and I. Koren. Estimating the potential parallelism and pipelining of algorithms. Technical Report TR-90-CSE-5, ECE Dept. Uni- - [12]K.M. Nichols and J.T. Edmark. PARET. Computer, 21(5):39-48, May 1988 Modeling multicomputer systems with - [13] J.K. Ousterhout. A switch-level timing verifier for digital MOS VLSI. IEEE Transactions on CAD, CAD-4(3):336-348, July 1985. - 14 A.C. Parker et al. MAHA: A program for datapath synthesis. In Proc. of 23rd Design Automation Conference, pages 461-466, 1986. - [15] B. Patel, I. Koren, and D.K. Pradhan. Designing highly pipelined ASICs University of Massechusetts, Amherst, 1989. using the data flow paradigm. Technical Report TR-89-CSE-9, ECE Dept. - [16] C. Sechen and A. Sangiovanni-Vincentelli. April 1985. and routing package. IEEE Journal of Solid State Circuits, 20(2):510-522, The TimberWolf placement - [17] R. Sethi and J.D. Ullman. The generation of optimal code for arithmetic expression. J. ACM, 17:715-728, October 1970. - [18] R. Spickelmier, editor. Oct Tools Distribution 3.0. University of California, Berkeley, March 1989.