|
|
|
| |
Driving a 32-Bit RISC
Processor in an FPGA By
Yanzhe Liu and Greg Kahlert |
|
|
|
|
|
| GDA Technologies (San Jose,
CA) is a design-for-hire engineering services firm that
specializes in ASIC designs. Increasingly, however, our
ASIC customers want to prototype in FPGAs before committing
to silicon. We had a recent contract to help a client achieve
a high-speed CPU design in an FPGA. What we learned from
the engagement may be helpful to others attempting a similar
development.
Our client contracted us to port an optimized 32-bit RISC
processor that would run at least 75 MHz and occupy less
than 40 percent of area into a Xilinx (San Jose, CA) Vertex
XCV 1000 FPGA. The final design ended up consuming about
80k gates of the 200k gates in the FPGA. We worked with
the customer's specified intellectual property (IP) core
vendor, Lexra Inc. (San Jose, CA). We ported Lexra/s LX4189
processor core to the XCV 1000 in two months time. Lexra
optimized the IP core to better fit into an FPGA. The final
design beat the 75-MHz goal and we were able to demonstrate
an 80-MHz clock speed.
We used Xilinx's Alliance Version 3.1 tool kit for hierarchical
block-based place and route. During the course of the design
effort, Xilinx offered numerous suggestions as to how we
could use their tools to make the design faster. For example,
they suggested incremental compilation, multi-pass runs,
hierarchical methodology, and so forth. For example, the
multiple pass runs produce different results and we were
able choose the one that best met our objective. We also
used Amplify, an RTL floorplanning tool, from Synplicity,
Inc. (Sunnyvale, CA) as well as their FPGA synthesis tool,
Synplify.
The original RTL of the database was coded for an ASIC
implementation, not for instantiation into an FPGA. In addition,
the configuration for the core was also tailored for an
ASIC, not an FPGA environment. In attacking the problem,
we ran the LX4189 RTL code through the tools to get a data
point from which to begin our work. Using the Amplify floorplanner
and the Synplify synthesis tool, we created a netlist that
we supplied to the Xilinx place-and-route tool to produce
a layout in the Vertex XCV 1000. |
| Core
Configured For ASIC |
 |
| The
original RTL of the database was coded for an ASIC implementation.
|
The resulting first layout
achieved a 50-MHz clock speed. In the worst path of this
layout, the delay was 20 nanoseconds. The rule of thumb
for FPGAs is that, ideally, 60 percent of the delay should
be in logic and 40 percent should be in routing. Our first
pass had resulted in 30 percent delay in logic and 70 percent
delay in routing. The plan was to squeeze the routing delay
down into the range of the logic delay to achieve a total
path delay of 12 nanoseconds, worst case, which translates
into a 75 to 80 MHz clock speed.
To begin the process of improving the clock
speed to our target of 75 MHz, we evaluated the paths that
were causing the greatest delay. Our initial analysis of
the core showed complex data paths containing chains of
multiplexors These produced large net delays when implemented
in the FPGA. In fact, 70 to 75 percent of the net delays
could be attributed to the data path. |
| Core
with Complex Paths |
 |
| The complex
data paths appeared in the initial analysis and contained
chains of multiplexors. |
 |
| |
We took a look at some obvious
problems that might have prevented us from achieving a higher
clock frequency. These problems included multiplexor implementation
and whether a tristate MUX is better than some other form
of MUX. We made use of the BUFTs common on Virtex CLB (configurable
logic blocks). BUFTs are 3-state buffers that drive dedicated,
segmented horizontal-routing resources.
Fanout was another problem area we looked
into. By minimizing the number of fanouts, we helped reduce
the delay through a number of critical paths. However, we
reached a point of diminishing returns where reducing fanouts
further increased, rather than decreased, delay. We learned
this when we placed constraints into the synthesis tool
to reduce the number of fanouts; the constraints we added
caused the synthesis tool to insert additional gates to
reduce fanout and thus increased the delay.
The Amplify floorplanning tool produced
two large blocks -- co-processor 0 (CP0) and RPA --that
it then placed within the FPGA. RPA represents the arithmetic
logic unit and instruction execution pipeline logic of the
core. During the design process, we produced a layout of
each block independent of the other and aimed to put the
two together once we had gotten each block to be as close
to 75 MHz as possible.
Of the two blocks, CP0 had the largest number
of slow paths, with timing in the range of 48 MHz to 50
MHz. With the help of Amplify, we improved result from 50
to about 66MHz, but after reaching 66 MHz, even with Amplify,
it was difficult to improve the timing any further. Therefore,
we focused our attention on fixing critical paths in both
blocks. At Xilinx's suggestion, we replaced a critical group
of multiplexors with tri-state multiplexors.
By identifying a set of paths with timing
violations and selectively replacing multiplexors in the
paths with tri-state multiplexors we were able to raise
the timing of the entire design to 80 MHz. Achieving an
80 MHz design was a significant milestone since it represents
about a third of the clock speed of the processor in our
ASIC implementation. As for the size of the completed design,
it occupied 12 out of 96 BLOCKRAMs -- 12 percent, 1505 out
of 12,288 SLICEs --12 percent, and 448 out of 12,544 TBUs
-- 3 percent, of the Xilinx XCV 1000. |
| |
|