Overview

- Anti-fuse and EEPROM-based devices
- Contemporary SRAM devices
  - Wiring
  - Embedded
- New trends
  - Single-driver wiring
  - Power optimization
22V10 PAL

- Combinational logic elements (SoP)
- Sequential logic elements (D-FFs)
- Up to 10 outputs
- Up to 10 FFs
- Up to 22 inputs
Antifuse Switch

• Anti-fuses are one-time programmable.
  - Pulse eliminates dielectric
  - Only need to program once.

Metal 3

Metal-to-Metal Antifuse

Metal 2

Via

Metal 1

Contact

Silicon
Anti-Fuse FPGA

- Negligible programming overhead
- Low capacitance routing (fast)
- Security
- Tolerant of firm errors
- Resistance of about 100 $\Omega$

antifuse polysilicon  ONO dielectric

$\text{n}^+\text{ antifuse diffusion}$

2 $\lambda$
Typical Actel Anti-fuse Interconnect

Interconnection Fabric
Anti-fuse Security

° Very good for design security
  • No bitstream can be intercepted in the field (no bitstream transfer, no external configuration device)
  • Need a Scanning Electron Microscope (SEM) to try to know the antifuse states (an Actel AX2000 antifuse FPGA contains 53 million antifuses with only 2-5% programmed in an average design)

Courtesy: Burleson/Gogniat
FLASH-Memory Switch

PRG/SEN

SWITCH

SEL 1  SEL 2

WORD LINE
Flash/ EEPROM Trends

- Logic elements (LUTs and flip flops)
- Segmented routing
- Low logic to register ratio
- Future?

Altera Max II
SRAM-based FPGA

- SRAM bits can be programmed many times
- Each programming bit takes up five transistors
- Larger device area reduces speed versus EPROM and antifuse.

![Diagram of SRAM-based FPGA](image)
Field Programmable Gate Array
Design Tradeoffs

- Some logic clusters are large (e.g. Altera contains 8 LUT-FF pairs)
- Three important issues:
  - Logic elements per cluster
  - Cluster connectivity to interconnect – wires \( (F_C) \) – connection flexibility
  - Switchbox flexibility \( (F_S) \)
Issue 1: The Logic Cluster

Question: How many BLE should there be per cluster?
• Interestingly, small block cluster more efficient (Betz – CICC’99)
• Includes area needed for routing.
• Small clusters (e.g. one BLE per cluster) not “CAD friendly).
• Most commercial devices have 4-8 BLEs per cluster
Number of Inputs per Cluster

- Lots of opportunities for input sharing in large clusters (Betz – CICC’99)
- Reducing inputs reduces the size of the device and makes it faster.
- Most FPGA devices (Xilinx, Lucent) have 4 BLE per cluster with more inputs than actually needed.
• F_c -> How many tracks does an input pin connect to?
• If logic cluster is small, F_c is large \( F_c = W \)
• If logic cluster is large, F_c can be less.
  - Approximately 0.2W for Xilinx XC4000EX
Switchbox Flexibility

- Switch box provides optimized interconnection area.
- Flexibility found to be not as important as $F_C$
- Six transistors needed for $F_S = 3$
Switchbox Issues
• Rotate connections inside the switchbox while keeping $F_S = 3$
• Still has six transistors for base switch matrix.
• Eliminates domain issue
Switchbox Issues

Diagram of Switchbox Issues:

- **L** blocks are connected to **S** blocks, which in turn are connected to **C** blocks.
- **0**, **1**, and **2** labels indicate the connections between blocks.

**a)** S block

**b)** C block
Buffering

- FPGAs need to buffer to isolate large RC networks
- Architects must decide where to place buffers.
Segmentation

- Segmentation distribution: how many of each length?
- Longer length
  - Better performance? 😊
  - Reduced routability? 😞
Modern CLB (V5): Slices

- More hierarchy in current devices
- Slices are complex. Multiple slices communicate with switch matrix

Source: Brad Hutchings, BYU
Virtex 5 Slice

More complex LUT (6 input)

Source: Brad Hutchings, BYU
Implementing Memory on FPGAs

- For 4-input LUTs 16 bits of information available
- Can be chained together through programmable network.
- Decoder and multiplexer an issue.
- Flexibility is a key aspect.
Xilinx XC4000 Series Devices

- Ideal for small data storage
- Register Files
- Coefficient storage
- No wasted space

16x2 (or 16x1) Edge-Triggered Single-Port RAM
Xilinx XC4000 Dual Port Mem

- Access data concurrently.
- Fine-grained access
- Synchronous access
Coarse-grained Memory

- Special large blocks of SRAM found in FPGA array
- Allow for efficient implementation of memory – predictable performance
- Six transistor SRAM cell.
Xilinx Block Memory

- Each memory block is 4 CLBs high
- 4096 bit SRAMs.
- Can be implemented in different aspect ratios.
- Need to address performance.

*Figure 6: Dual-Port Block SelectRAM*
Figure 2–1. Stratix IV LAB Structure

Courtesy: Brad Hutchings
Stratix-4 ALM (LE)

Figure 2–5. High-Level Block Diagram of the Stratix IV ALM

Combinational/Memory ALUT0

6-Input LUT

dataf0
datae0
dataa
datab
datac
dataad
datae1
dataf1

Combinational/Memory ALUT1

6-Input LUT

Combinational/Memory ALUT0

Combinational/Memory ALUT1

To general or local routing

To general or local routing

To general or local routing

To general or local routing

shared_arith_in carry_in reg_chain_in labclk

shared_arith_out carry_out reg_chain_out

Courtesy: Brad Hutchings
Inside the EAB - Altera

- Embedded array highly optimized
- Address and data can be latched for fast performance.
- Scalable to even larger sizes.

Figure 4. FLEX 10K Embedded Array Block

EAB Local Interconnect

Dedicated Inputs & GlOBal Signals

Chip-Wide Reset

Row Interconnect

2, 4, 8, 16

2, 4, 8, 16

Column Interconnect
Inside the ESB

- Embedded System Blocks can be configured as either memory or PLA.
- Multiple levels of hierarchy.

Figure 20. ESB in Read/Write Clock Mode

Note (1)
Growth Rate of Memory

- Approximately 2400 transistors per CLB
  - (1200 per LUT) for XC4000-like implementation (32x1 SRAM)
- Six transistors per cell for Altera SRAM (2K per EAB)

<table>
<thead>
<tr>
<th>Size</th>
<th>Altera 10K EABs</th>
<th>Xilinx 4000E CLBs</th>
<th>Altera 10K</th>
<th>Xilinx 4000E</th>
</tr>
</thead>
<tbody>
<tr>
<td>32x1</td>
<td>1</td>
<td>1</td>
<td>12288</td>
<td>2400</td>
</tr>
<tr>
<td>32x8</td>
<td>1</td>
<td>8</td>
<td>12288</td>
<td>19200</td>
</tr>
<tr>
<td>128x8</td>
<td>1</td>
<td>32</td>
<td>12288</td>
<td>76800</td>
</tr>
<tr>
<td>512x8</td>
<td>2</td>
<td>128</td>
<td>24576</td>
<td>307200</td>
</tr>
</tbody>
</table>

For 512x8 fine-grained requires 10X more size
Stratix V – State-of-the-Art Memory (Lewis – FPGA’13)

• Increasing amounts of memory per device
• Results below that a single uniform memory block size is better (20 kb)
• Evaluation over various ratios of logic to memory
Stratix V – New Features (Lewis FPGA’13)

- Consider clock skewing – technique to balance pipeline
- Clock signal is locally stalled to affect rising edge clock time
Summary

• Three basic types of FPGA devices
  - Antifuse
  - EEPROM
  - SRAM

• Key issues for SRAM FPGA are logic cluster, connection box, and switch box.

• Latest advances examine performance and routability.

• Newer FPGAs require large amounts of RAM.
  - Trends indicate uniform blocks
  - Experimentation over many benchmarks is key