Home | Careers | Contact Us  
Company System/Board Services IC Solutions Reference Designs IP Portfolio Customers-Partners News Resources
     Search For :
    
  Resources
White Papers
Articles
Newsletter
Case Studies
 
 
 
Hierarchical ASIC Design at Three Million Gates and Above
DSM processes demand physical and logical design - in that order.
By Toshio Aizawa and Ravi Thummarukudy

Today, deep-submicron (DSM) manufacturing processes are able to integrate tens of millions of gates of logic and other components onto a single chip. The availability of large silicon real estate along with reusable intellectual property (IP) blocks such as CPUs, DSPs, and other peripheral components are fueling the development of complete systems on a single IC - true systems on a chip. Nevertheless, the automation tools lag far behind in their ability to design and implement designs of this complexity. Many of the popular EDA tools available today break down around two to three million gates, even with the highest possible main memory allowed by an operating system. The inadequacy of current EDA tools calls for improved methodologies so that today's tool set can be used to handle these bigger designs.
In addition, due to increasing market demands, the time that designers have to complete a design is decreasing at a dramatic rate as well. The integration of large designs in DSM technologies poses several challenges for the designer. Some challenges are related to large design sizes - large netlists, SDF files, prohibitive computer run times, and hardware requirements - and others are related to reliability issues such as signal integrity, electromigration, hot electron, and crosstalk. The design challenges around DSM and system on a chip (SOC) are intertwined and need to be addressed concurrently.

Besides reliability and design size challenges, additional problems occur with the physical implementation of large ICs in DSM processes including accurate modeling and analysis of parasitics such as interconnect resistance and capacitance, as well as timing closure between logic and physical design phases. The difficulty in estimating the appropriate die size for a particular design due to unpredictable interconnect characteristics (parasitic delays and routability) is also presenting a growing headache.

The hierarchical, or divide-and-conquer approach, is best suited for tackling huge gate count designs. Hierarchical methodology has been in practice for almost a decade in the logic design domain. This approach has allowed designers to partition the design and to allocate individual modules and design constraints to team members who can then work concurrently to complete the design within a fast turn-around time. The hierarchical approach also allows for the reuse of legacy designs in future designs - hardware description languages (HDLs) and the availability of synthesis tools have helped to make this a very efficient methodology as well. Today the hierarchical approach is so popular that no logic designer would think of attempting a complex design without it.

Today, a multi-million gate ASIC design, though not a rarity, continues to present a challenging problem to ASIC designers everywhere. The hierarchical methodology described here is implemented in NEC's (Santa Clara, CA) OpenCAD environment for their 0.25-µm, and below, standard cell products. The flow incorporates NEC's proprietary design tools as well as commercial EDA tools. The implementation of this hierarchical, timing-driven ASIC design methodology will handle designs greater than three million gates.

Hierarchical design methodology

In the VLSI design context, the top-level design is partitioned into physical hierarchies or blocks before they are designed concurrently. There are two approaches possible, the bottom-up approach or the top-down approach. In the bottom-up approach, the design's physical hierarchies are completely laid out and then integrated at the top level. The main draw back of this approach is that the individual blocks are laid out without the complete knowledge of the other blocks or the top-level requirements. When the individual blocks are integrated at the top level, it may not meet the chip's timing, clock, or signal integrity requirements, thus forcing the designer to redesign one or several of the individual blocks. Any redesign at this time can cause significant delays for tape-out. Additionally, while the individual blocks are laid out, block-level pin optimization for an optimal layout may not be possible, as the floorplan and pin locations for other blocks may not be available. These problems are solved in the top-down approach since in this approach the top-level layout is completed (including clock-tree synthesis (CTS), scan, BIST, and JTAG) ahead of the design of the lower-level blocks. A timing budget for the block level is then generated such that the individual blocks work in the context of the top-level design constraints. The timing budgeting is possible even when the lower level blocks aren't yet designed. Once the top level design is finished and the constraints for the individual design is completed, block-level design can be concurrently carried out, reducing the turn-around time as well as guaranteeing that the design will work correctly the first time when integrated at the top level (see Figure 1).

Figure 1 - Hierarchal design flow
Divide and conquer is the order of the day within hierarchical design. The timing budget decides the constraints for concurrent design.

NEC's implementation of this methodology also incorporates the timing-driven feature of the layout tool. In this approach, the design constraints are forward annotated to the place-and-route tool so that it could meet the required timing characteristics without going through tedious timing fix/ECO loops. The optimization capability inside the timing-driven place and route will do the gate re-sizing, repeater insertion, and the buffer/inverter, insertion/deletion needed to meet the required timing specs for the block level, as well as the chip level. You have to start at the top (level)

The fraction of a design that is actually new continues to diminish over the years as a good chunk of designs today consist of pre-characterized blocks. The foremost task in the top-level design phase is to partition the design into a physical hierarchy. This partitioning is done with the connectivity and timing requirements of the complete chip in mind. At the top level, only the I/O blocks and hard macros (such as memories and the partitioned blocks) are present. Flip-flops at the top-level aren't recommended, while hard macros such as memories can reside in the top level or block level. The physical partitioning of the design will make it easier to guarantee better inter-block - as well as intra-block - timing. For example, if particular portions of the design need to work at higher frequencies compared to the rest of the design, grouping them in one physical block will reduce the wire lengths during final route. Register banks are normally added to the physical hierarchy blocks to guarantee the timing (see Figure 2).

Once the top-level partition is completed, the design is then floorplanned. During the floorplan stage the pins are assigned, top level I/O buffers and hard macros such as CPU blocks, memory (RAM/ROM) blocks, and PLLs are placed. The location, size, and shape of each physical hierarchy are determined for the most efficient routability. The location of the pins on the physical macros is identified with routability in mind, a task the floorplan tools perform automatically. Now the BIST, JTAG, scan structures, and macro-test bus structures are placed at the top level to take into account the overheads due to these features. The idea is to get as close as possible to the final top-level design.

This physical partitioning also helps in isolating test structures or clock domains better than in the flat design. We could keep the scan chains or clock trees restricted to a particular physical partition for better efficiency.

Top-level route

The second piece of the top-down, hierarchical approach is to create timing constraints from the logic synthesis tool and forward annotate it to the place-and-route engine. The place-and-route engine could then complete the placement and routing based on these constraints. The success of this approach is due to the early estimation of parasitics, forward annotation of design constraints, and bi-directional flow of information from physical design to logic design - each loop progressively reducing the timing divergence. If implemented well, this approach could substantially reduce the uncertainty of the top-level design process.

Figure 2 - Top-level design
 

Partitioning the design into a physical hierarchy is the most important task at the top level of the design flow.
Once the top-level design is completed, the individual blocks at this level (which are physical hierarchies) are then characterized as block library elements. Some of the blocks at this stage are black boxes, as the register transfer level (RTL) descriptions for them don't yet exist. A black box definition for such blocks is developed using the physical hierarchy it represents. The pin attributes, I/O timings, set-up/hold timings, and clock requirements are specified for each block. Input capacitance, slew rate, output capacitance, and output slew rates are also defined. Identifying the critical pins on the block-level design at this point (even though the block level design doesn't exist yet) will enable the designers to place appropriate timing constraints for them.

Subsequently, the top-level design layout is completed, including power routing, placement, clock balancing, placement optimization, final routing, and max-load fixing. Repeaters are inserted wherever required. Even though clock balancing is performed at the top level, it isn't possible to estimate the insertion delays and skew at this point, because the clock-tree synthesis for the lower-level blocks hasn't been completed. Nevertheless, the root clock-buffer placement and clock-line wiring is finished at this stage. While doing placement at the top level, caution must be taken not to place instances overlapping with the physical partitions. The overlap in placement will make it difficult to move a particular physical partition within the top-level layout. Alternatively, however, this restriction will increase the congestion and decrease the top-level routability. After top-level routing, congestion analysis needs to be performed and if the results aren't promising, several alternate floorplans could be attempted.

Now, design checks for the top level are performed. Since some of the block-level designs don't exist at this time, only the block boundaries are checked. Additionally, clock-tree checks, connectivity checks, shape checks, and antenna checks are done. Repairs are performed if problems are identified. We need to wait for the full chip integration - including the physical partitions - for a complete design check and analysis, however.

Once the layout is completed, parasitic extraction and delay calculations are performed. The delays are calculated for all variations of signal integrity as well. Layout verification for connectivity and geometry violations is also performed. Signal integrity checks, as well as checks for electromigration and hot electron, are also performed. Signal integrity repair is attempted using either placement techniques or routing techniques. The delays are calculated one more time for the top level. Once the top-level timing is satisfactory, the individual blocks are characterized for the appropriate I/O timings.

At this point, we have the overall top-level design completed, which will meet both the timing and the signal integrity requirements. A timing and area budget is available for each of the physical partitions. Logic designers can now design these blocks in RTL according to the specifications. One recommendation is to keep these top-level blocks under 500K gates. Thus, for a 10-million gate design, we may need a minimum of 20 blocks. (Typical designs of this size have several instances of pre-defined hard macros though.) A 500K block is still manageable, however, for a flat layout at the block level.

Yet another benefit of the hierarchical approach is that it follows a very similar methodology for the top-level design, as well as the block design. Also, once the design is partitioned into a few manageable blocks, engineers can work concurrently on them, reducing the elapsed time for the completion of the design. EDA tools work much better at these gate counts when compared to gate counts of two to five million gates.

Block-level design

Block-level design includes the following steps: floorplanning, RTL design, synthesis, DFT insertion, and timing-driven layout including clock-tree synthesis.

For better timing assurance, it's recommended that a register bank be added at the interfaces. During the clock-tree synthesis process, clock buffers are added to the block-level netlist. This buffer is connected directly to the top-level clock buffer. Gated clock-tree expansion could be done at this stage as well. Logic synthesis is performed based on the block-level constraints extracted at the top level. Since the gate count isn't known at this time, an automatic wire-load selection is assumed. Setup times need to be checked in pre-layout stage using the MAX library. Hold check will be performed during timing driven layout or after the layout stage using MIN library.

After synthesis, scan insertion is performed. Once this is finished, delay estimation of the block netlist is initiated. Since the input slew rate and output loading aren't known exactly for each block, there can be differences in the timing characteristics of the block before and after block design. The extent of the changes required at the top level depends upon the divergence of the delay results. Once the SDF file is generated, timing analysis, clock-insertion delay, and clock-skew values can be estimated. Gate-level verification is also performed at this stage. Any verification of pre-scan, post-scan or pre-CTS/post-CTS could be performed using a formal verification tool.

The block design is checked for connectivity errors, clock-tree connections, electromigration, and hot electron. Scan-check connections are also performed. Now, the block-level layout is performed including placement, power routing, placement optimization, routing, and fan-out fixes. Once these tasks are complete, RC extraction starts at the block level. Before the layout is verified, delay calculation is performed, including an accounting for the added effects of signal integrity variations. Succeeding the delay calculation step, electromigration and antenna checks for the block level are soon initiated. Any issues identified at this stage are repaired using placement optimization or using re-routing.

Now timing analysis is performed at the macro level to meet the block-timing budget. Once the individual blocks are completed, the block-level models are regenerated and a top-level timing analysis can be completed. With the timing analysis report, the block libraries are recharacterized (see Figure 3).

Figure 3 - Full-chip integration
After top-level timing analysis and library recharacterization, block-level design ends and the full chip is ready for integration.

Integrating a full chip

Once all of the block-level functions are completed, the design is ready for the final integration. The final netlist is extracted from layout and top-level delay calculation is initiated accordingly. The SDF file is modified for signal integrity variations. Layout verification is done along with EM checks and antenna checks. The designer then enters into functional verification of the design by using a formal verification tool. If required, signal integrity fixes can also be performed. Then, in order to guarantee connectivity and full-chip clock topology, DRCs are initiated. Static-timing analysis is begun, once again, to make sure that all the timing requirements are met. Once the top-level design is completed, artwork and the tester interface files are generated and verified.

Besides its ability to reduce design complexity, the hierarchical method has other benefits as well. In the hierarchical method, the inter-block timing goal can be attempted much more easily than in flat design. This approach is also very valuable when you have multiple instances of the same block at the top level, especially in the networking applications. Once this reusable block is designed (layout), it can be placed in multiple locations on the chip with little effort. In addition, the individual blocks can be either designed concurrently to reduce the turn-around time or spaced out over the design cycle to accommodate various design groups.

Finally, the hierarchical approach improves the clock balancing when compared with a design completed in a flat layout.

Clearly, the huge SOC designs in demand today will require the move to hierarchical methodologies in the coming era.

Copyright © 2000 Integrated System Design Magazine
 
© Copyright 1996-2012. GDA Technologies, Inc. All rights reserved