|
|
|
| |
Hierarchical ASIC Design
at Three Million Gates and Above DSM
processes demand physical and logical design - in that order.
By Toshio Aizawa and Ravi Thummarukudy |
|
|
|
|
|
| Today, deep-submicron (DSM)
manufacturing processes are able to integrate tens of millions
of gates of logic and other components onto a single chip.
The availability of large silicon real estate along with
reusable intellectual property (IP) blocks such as CPUs,
DSPs, and other peripheral components are fueling the development
of complete systems on a single IC - true systems on a chip.
Nevertheless, the automation tools lag far behind in their
ability to design and implement designs of this complexity.
Many of the popular EDA tools available today break down
around two to three million gates, even with the highest
possible main memory allowed by an operating system. The
inadequacy of current EDA tools calls for improved methodologies
so that today's tool set can be used to handle these bigger
designs.
In addition, due to increasing market demands, the time
that designers have to complete a design is decreasing at
a dramatic rate as well. The integration of large designs
in DSM technologies poses several challenges for the designer.
Some challenges are related to large design sizes - large
netlists, SDF files, prohibitive computer run times, and
hardware requirements - and others are related to reliability
issues such as signal integrity, electromigration, hot electron,
and crosstalk. The design challenges around DSM and system
on a chip (SOC) are intertwined and need to be addressed
concurrently.
Besides reliability and design size challenges, additional
problems occur with the physical implementation of large
ICs in DSM processes including accurate modeling and analysis
of parasitics such as interconnect resistance and capacitance,
as well as timing closure between logic and physical design
phases. The difficulty in estimating the appropriate die
size for a particular design due to unpredictable interconnect
characteristics (parasitic delays and routability) is also
presenting a growing headache.
The hierarchical, or divide-and-conquer approach, is best
suited for tackling huge gate count designs. Hierarchical
methodology has been in practice for almost a decade in
the logic design domain. This approach has allowed designers
to partition the design and to allocate individual modules
and design constraints to team members who can then work
concurrently to complete the design within a fast turn-around
time. The hierarchical approach also allows for the reuse
of legacy designs in future designs - hardware description
languages (HDLs) and the availability of synthesis tools
have helped to make this a very efficient methodology as
well. Today the hierarchical approach is so popular that
no logic designer would think of attempting a complex design
without it.
Today, a multi-million gate ASIC design, though not a rarity,
continues to present a challenging problem to ASIC designers
everywhere. The hierarchical methodology described here
is implemented in NEC's (Santa Clara, CA) OpenCAD environment
for their 0.25-µm, and below, standard cell products.
The flow incorporates NEC's proprietary design tools as
well as commercial EDA tools. The implementation of this
hierarchical, timing-driven ASIC design methodology will
handle designs greater than three million gates.
Hierarchical design methodology
In the VLSI design context, the top-level design is partitioned
into physical hierarchies or blocks before they are designed
concurrently. There are two approaches possible, the bottom-up
approach or the top-down approach. In the bottom-up approach,
the design's physical hierarchies are completely laid out
and then integrated at the top level. The main draw back
of this approach is that the individual blocks are laid
out without the complete knowledge of the other blocks or
the top-level requirements. When the individual blocks are
integrated at the top level, it may not meet the chip's
timing, clock, or signal integrity requirements, thus forcing
the designer to redesign one or several of the individual
blocks. Any redesign at this time can cause significant
delays for tape-out. Additionally, while the individual
blocks are laid out, block-level pin optimization for an
optimal layout may not be possible, as the floorplan and
pin locations for other blocks may not be available. These
problems are solved in the top-down approach since in this
approach the top-level layout is completed (including clock-tree
synthesis (CTS), scan, BIST, and JTAG) ahead of the design
of the lower-level blocks. A timing budget for the block
level is then generated such that the individual blocks
work in the context of the top-level design constraints.
The timing budgeting is possible even when the lower level
blocks aren't yet designed. Once the top level design is
finished and the constraints for the individual design is
completed, block-level design can be concurrently carried
out, reducing the turn-around time as well as guaranteeing
that the design will work correctly the first time when
integrated at the top level (see Figure 1). |
| Figure 1 -
Hierarchal design flow |
 |
| Divide and
conquer is the order of the day within hierarchical design.
The timing budget decides the constraints for concurrent design.
|
NEC's implementation of this
methodology also incorporates the timing-driven feature
of the layout tool. In this approach, the design constraints
are forward annotated to the place-and-route tool so that
it could meet the required timing characteristics without
going through tedious timing fix/ECO loops. The optimization
capability inside the timing-driven place and route will
do the gate re-sizing, repeater insertion, and the buffer/inverter,
insertion/deletion needed to meet the required timing specs
for the block level, as well as the chip level. You have
to start at the top (level)
The fraction of a design that is actually
new continues to diminish over the years as a good chunk
of designs today consist of pre-characterized blocks. The
foremost task in the top-level design phase is to partition
the design into a physical hierarchy. This partitioning
is done with the connectivity and timing requirements of
the complete chip in mind. At the top level, only the I/O
blocks and hard macros (such as memories and the partitioned
blocks) are present. Flip-flops at the top-level aren't
recommended, while hard macros such as memories can reside
in the top level or block level. The physical partitioning
of the design will make it easier to guarantee better inter-block
- as well as intra-block - timing. For example, if particular
portions of the design need to work at higher frequencies
compared to the rest of the design, grouping them in one
physical block will reduce the wire lengths during final
route. Register banks are normally added to the physical
hierarchy blocks to guarantee the timing (see Figure 2).
Once the top-level partition is completed,
the design is then floorplanned. During the floorplan stage
the pins are assigned, top level I/O buffers and hard macros
such as CPU blocks, memory (RAM/ROM) blocks, and PLLs are
placed. The location, size, and shape of each physical hierarchy
are determined for the most efficient routability. The location
of the pins on the physical macros is identified with routability
in mind, a task the floorplan tools perform automatically.
Now the BIST, JTAG, scan structures, and macro-test bus
structures are placed at the top level to take into account
the overheads due to these features. The idea is to get
as close as possible to the final top-level design.
This physical partitioning also helps in
isolating test structures or clock domains better than in
the flat design. We could keep the scan chains or clock
trees restricted to a particular physical partition for
better efficiency.
Top-level route
The second piece of the top-down,
hierarchical approach is to create timing constraints from
the logic synthesis tool and forward annotate it to the
place-and-route engine. The place-and-route engine could
then complete the placement and routing based on these constraints.
The success of this approach is due to the early estimation
of parasitics, forward annotation of design constraints,
and bi-directional flow of information from physical design
to logic design - each loop progressively reducing the timing
divergence. If implemented well, this approach could substantially
reduce the uncertainty of the top-level design process.
|
| Figure 2 -
Top-level design |
 |
| |
Partitioning the design into
a physical hierarchy is the most important task at the top
level of the design flow.
Once the top-level design is completed, the individual blocks
at this level (which are physical hierarchies) are then
characterized as block library elements. Some of the blocks
at this stage are black boxes, as the register transfer
level (RTL) descriptions for them don't yet exist. A black
box definition for such blocks is developed using the physical
hierarchy it represents. The pin attributes, I/O timings,
set-up/hold timings, and clock requirements are specified
for each block. Input capacitance, slew rate, output capacitance,
and output slew rates are also defined. Identifying the
critical pins on the block-level design at this point (even
though the block level design doesn't exist yet) will enable
the designers to place appropriate timing constraints for
them.
Subsequently, the top-level design layout
is completed, including power routing, placement, clock
balancing, placement optimization, final routing, and max-load
fixing. Repeaters are inserted wherever required. Even though
clock balancing is performed at the top level, it isn't
possible to estimate the insertion delays and skew at this
point, because the clock-tree synthesis for the lower-level
blocks hasn't been completed. Nevertheless, the root clock-buffer
placement and clock-line wiring is finished at this stage.
While doing placement at the top level, caution must be
taken not to place instances overlapping with the physical
partitions. The overlap in placement will make it difficult
to move a particular physical partition within the top-level
layout. Alternatively, however, this restriction will increase
the congestion and decrease the top-level routability. After
top-level routing, congestion analysis needs to be performed
and if the results aren't promising, several alternate floorplans
could be attempted.
Now, design checks for the top level are
performed. Since some of the block-level designs don't exist
at this time, only the block boundaries are checked. Additionally,
clock-tree checks, connectivity checks, shape checks, and
antenna checks are done. Repairs are performed if problems
are identified. We need to wait for the full chip integration
- including the physical partitions - for a complete design
check and analysis, however.
Once the layout is completed, parasitic
extraction and delay calculations are performed. The delays
are calculated for all variations of signal integrity as
well. Layout verification for connectivity and geometry
violations is also performed. Signal integrity checks, as
well as checks for electromigration and hot electron, are
also performed. Signal integrity repair is attempted using
either placement techniques or routing techniques. The delays
are calculated one more time for the top level. Once the
top-level timing is satisfactory, the individual blocks
are characterized for the appropriate I/O timings.
At this point, we have the overall top-level
design completed, which will meet both the timing and the
signal integrity requirements. A timing and area budget
is available for each of the physical partitions. Logic
designers can now design these blocks in RTL according to
the specifications. One recommendation is to keep these
top-level blocks under 500K gates. Thus, for a 10-million
gate design, we may need a minimum of 20 blocks. (Typical
designs of this size have several instances of pre-defined
hard macros though.) A 500K block is still manageable, however,
for a flat layout at the block level.
Yet another benefit of the hierarchical
approach is that it follows a very similar methodology for
the top-level design, as well as the block design. Also,
once the design is partitioned into a few manageable blocks,
engineers can work concurrently on them, reducing the elapsed
time for the completion of the design. EDA tools work much
better at these gate counts when compared to gate counts
of two to five million gates.
Block-level design
Block-level design includes the following
steps: floorplanning, RTL design, synthesis, DFT insertion,
and timing-driven layout including clock-tree synthesis.
For better timing assurance, it's recommended
that a register bank be added at the interfaces. During
the clock-tree synthesis process, clock buffers are added
to the block-level netlist. This buffer is connected directly
to the top-level clock buffer. Gated clock-tree expansion
could be done at this stage as well. Logic synthesis is
performed based on the block-level constraints extracted
at the top level. Since the gate count isn't known at this
time, an automatic wire-load selection is assumed. Setup
times need to be checked in pre-layout stage using the MAX
library. Hold check will be performed during timing driven
layout or after the layout stage using MIN library.
After synthesis, scan insertion is performed.
Once this is finished, delay estimation of the block netlist
is initiated. Since the input slew rate and output loading
aren't known exactly for each block, there can be differences
in the timing characteristics of the block before and after
block design. The extent of the changes required at the
top level depends upon the divergence of the delay results.
Once the SDF file is generated, timing analysis, clock-insertion
delay, and clock-skew values can be estimated. Gate-level
verification is also performed at this stage. Any verification
of pre-scan, post-scan or pre-CTS/post-CTS could be performed
using a formal verification tool.
The block design is checked for connectivity
errors, clock-tree connections, electromigration, and hot
electron. Scan-check connections are also performed. Now,
the block-level layout is performed including placement,
power routing, placement optimization, routing, and fan-out
fixes. Once these tasks are complete, RC extraction starts
at the block level. Before the layout is verified, delay
calculation is performed, including an accounting for the
added effects of signal integrity variations. Succeeding
the delay calculation step, electromigration and antenna
checks for the block level are soon initiated. Any issues
identified at this stage are repaired using placement optimization
or using re-routing.
Now timing analysis is performed at the
macro level to meet the block-timing budget. Once the individual
blocks are completed, the block-level models are regenerated
and a top-level timing analysis can be completed. With the
timing analysis report, the block libraries are recharacterized
(see Figure 3). |
| Figure 3 -
Full-chip integration |
 |
| After top-level
timing analysis and library recharacterization, block-level
design ends and the full chip is ready for integration. |
| Integrating a full chip
Once all of the block-level functions are
completed, the design is ready for the final integration.
The final netlist is extracted from layout and top-level
delay calculation is initiated accordingly. The SDF file
is modified for signal integrity variations. Layout verification
is done along with EM checks and antenna checks. The designer
then enters into functional verification of the design by
using a formal verification tool. If required, signal integrity
fixes can also be performed. Then, in order to guarantee
connectivity and full-chip clock topology, DRCs are initiated.
Static-timing analysis is begun, once again, to make sure
that all the timing requirements are met. Once the top-level
design is completed, artwork and the tester interface files
are generated and verified.
Besides its ability to reduce design complexity,
the hierarchical method has other benefits as well. In the
hierarchical method, the inter-block timing goal can be
attempted much more easily than in flat design. This approach
is also very valuable when you have multiple instances of
the same block at the top level, especially in the networking
applications. Once this reusable block is designed (layout),
it can be placed in multiple locations on the chip with
little effort. In addition, the individual blocks can be
either designed concurrently to reduce the turn-around time
or spaced out over the design cycle to accommodate various
design groups.
Finally, the hierarchical approach improves
the clock balancing when compared with a design completed
in a flat layout.
Clearly, the huge SOC designs in demand
today will require the move to hierarchical methodologies
in the coming era. |
| Copyright ©
2000 Integrated System Design Magazine |
| |
|