Tuesday, December 24, 2013

Clock Jargon: Important Terms

Clock to an SoC is like blood to a human body. Just the way blood flows to each and every part of the body and regulates metabolism, clock reaches each and every sequential device and controls the digital events inside the SoC. There are many terms which modern designers use in relation to the clock and while building the Clock Tree, the backend team carefully monitors these. Let's have a look at them.
  • Clock Latency: Clock Latency is the general term for the delay that the clock signal takes between any two points. It can be from source (PLL) to the sink pin (Clock Pin) of registers or between any two intermediate points. Note that it is a general term and you need to know the context before making any guess about what is exactly meant when someone mentions clock latency.
  • Source Insertion Delay: This refers to the clock delay from the clock origin point, which could be the PLL or maybe the IRC (Internal Reference Clock) to the clock definition point.
  • Network Insertion Delay: This refers to the clock delay from the clock definition point to the sink pin of the registers.
Consider a hierarchical design where we have multiple people working on multiple partitions or the sub-modules. So, the tool would be oblivious about the "top" or any logic outside the block. The block owner would define a clock at the port of the block (as shown below). And carry out the physical design activities. He would only see the Network Insertion Delay and can only model the Source Insertion Delay for the block.
Having discusses the latency, we have now focus our attention to another important clock parameter: The Skew.
We discusses the concept of skew and it's implication on timing in the post: Clock Skew: Implication on Timing. It would be prudent to go through that post before proceeding further. We shall now take the meaning of terms: Global Skew and Local Skew.
  • Local Skew is the skew between any two related flops. By related we mean that the flops exist in the fan-in or fan-out cone of each other.
  • Global Skew is the skew between any two non-related flops in the design. By non-related we mean that the two flops do not exist in the fan-out or fan-in cone of each other and hence are in a way mutually exclusive.

Multi-Cycle Paths: Perspective & Intent

Multi-Cycle Paths, as the name suggests, are those paths which need not be timed in one clock cycle. It is easier said than done! Before we discuss further, let's talk about what Multi-cycle paths does not mean!!
Myth 1: All those paths which the STA team is unable to meet at the highest clock frequency, are potential multi-cycle paths.
Reality: Multi-cycle paths are driven by the architectural and design requirements. STA folks merely implement or appropriately said, model the intent in their timing tools! A path can be a multi-cycle path even if the STA team is able to meet timing at the highest clock frequency.
Myth 2: It is by some magic that the design teams conjure up how many cycles it would be appropriate to for a path to function! <Apologies the hyperbole! :)>. And STA team follows the same in their constraints.
Reality: MCPs are driven by the intent. And implementation is governed by that intent which includes but is not limited to the number of run modes a particular SoC should support.
Consider the following scenario:
Normal mode, Low Power Mode and Ultra Low Power Modes can be considered to be the different run modes of the SoC. You can say that the customer can choose at what time which run mode would be better. Example: when performance is not critical, or your device can go to 'hibernate' mode, you (or the software) can allow the non-critical parts of the SoC to go into a Low Power Mode, and hence save power!
Consider the specifications:
  • Normal Mode: Most Critical IP & Not-So Critical IP would work at f MHz. Least Critical IP would work at f/2 MHz. Interaction between any two IPs would be at slower frequency.
  • Low Power ModeMost Critical IP would work at f MHz. Not-So Critical IPLeast Critical IP would work at f/2 MHz. Interaction between any two IPs would be at slower frequency.
  • Ultra Low Power ModeMost Critical IP would work at f MHz. Not-So Critical IP would work at f/2 MHz. And Least Critical IP would work at f/k MHz; (k=> 3-8). Interaction between any two IPs would be at slower frequency.
Consider the Low Power Mode. Any interaction within the Not-So Critical IP would be at slower frequency. However, any paths between the Most Critical IP and Not-So Critical IP would be Multicycle path of 2 in the low power mode. In this case, the clock at the Not-So Critical IP is gated selectively for every alternate clock cycle to implement an MCP. Hence data launched from Most Critical IP now effectively gets two clock cycles (of the faster clock) to reach the Not-So Critical IP. The following figure explains the intent:

This much for the intent! However, as we mentioned that for the Least Critical IP, depending on the mode, would work at f/k MHz => (k=3-8) one might need an MCP of 2, 3, 4.... and so on. This calls for a need of a configurable implementation of multicycle paths.

Clock Skew: Implication on Timing

Clock Skew is an important parameter that greatly influences the timing checks and you would often find the backend design engineers always keeping a close eye on the clock skew numbers. 
Clock Skew: The difference in arrival times of the clock signal at any two flops which are interacting with one another is referred to as clock skew. Having said that, please note that skew only makes sense for two flops which are interacting with one another, i.e. they make a launch-capture pair. 
If the clock at the capture flop takes more time to reach as compared to the clock at the launch flop, we refer to it as Positive Clock Skew. And when the clock at capture flop takes less time to reach the clock at the launch flop, we refer to it as Negative Clock Skew.
The figure below describes positive & negative clock skew. Assume the delays of clock tree buffers to be the same.
How does clock skew impact the timing checks, in particular, setup and hold? Consider the above example where FF1 is the launching flop and FF2 is the capturing flop. If the clock skew between FF1 and FF2 was zero, the setup and hold checks would be as follows:
  • Positive Skew: Now imagine the case where clock skew is positive. Here, clock at FF2 takes more time to reach as compared to the time taken by the clock to reach the FF1. Recall that the setup check means that the data launched should reach the capture flop at most setup time before the next clock edge. As evident in the below the data launched from FF1 gets an extra time equal to the skew to reach FF2. Hence setup is relaxed! However, hold check means that data launched should reach the capture flop at least hold time after the clock edge. Hence, the hold is further made critical in case of positive skew. Read the definitions again and again till you grasp it!!

  • Negative Skew: Here, clock at FF1 takes more time to reach as compared to the time taken by the clock to reach the FF2. As evident in the below the data launched from FF1 gets lesser time equal to the skew to reach FF2. Hence setup is more critical! However, hold is relaxed!
    Some Key Points to Note:
  • Setup is the next cycle check, and positive skew relaxes the setup check and negative skew further tightens it.
  • Hold is the same cycle check, and negative skew relaxes the hold check and positive skew further tightens it.
  • Very rarely would one come across a path that is both setup as well as hold critical. Setup becomes critical when data path is huge or you have a large negative skew; and hold becomes critical when either data path is minimal or you have a large positive skew. Both these conditions are mutually exclusive and very rarely does they manifest themselves simultaneously. It is often a case when the uncommon clock path is significant. We shall discuss it in detail later.

Puzzle: Fixing Timing Violation

Timing Violation can manifest due to a plethora of reasons. And it is important for an STA Engineer to understand the violating path and model the constraints properly before providing them to the Synthesis/PnR tools for optimization. Unnecessary optimization should be avoided because:
  • To save on the die area;
  • To save on the leakage power;
  • To prevent unnecessary congestion.
The figure below shows a scenario. Assume the clock period to be 8ns and the setup time of the capture flop (here, FF3) be 0ns and the clock-to-Q delay of the launch flops (here, FF1 & FF2) be 0ns. The violating path is shown in the figure. The negative slack is 1ns. 



How would you fix the above violation? Please note that there are many possible solutions; but one only solution adheres to the above discussed constraints of leakage power, area and congestion.

Lock-Up Latch: Implication on Timing

Lock-Up Latches are important elements for any STA engineer while closing timing on their DFT Modes: specially the Hold Timing Closure of the Shift Mode. While shifting, the scan chains come into picture, which are nothing but the chains of Flops involving the output pin of one flop, connected to the Scan-Input or Test-Input pin of the other flop, and so on, forming a chain. 
Now, imagine that we have two functionally asynchronous domains 1 and 2. Functionally asynchronous means that during the normal mode of operation (Functional Mode), the two domains do not interact with each other. However, very rarely do the designers have the liberty to make a separate scan chain for functionally asynchronous domains. Let's consider the following scenario: where domain 1 has the average clock latency of 3ns, and domain 2 has the average latency of 6ns. And the time period of the test clock is, let's say, 10ns.
Now, let's see the timing checks for this scenario. The output of the last flop of the domain 1 is in scan-chain and connected to the Test-Enable input of the first flop of domain 2. The check would be like: 

Owing to the positive clock skew, the setup check would be relaxed, but hold would be critical. Two possible options to fix the hold timing:
  • Insert the buffers, to add sufficient delay, so that hold is finally met.
  • To add the Lock-Up Latch between the two flops where scan chain crosses the functional domains.
The first might not be a robust solution because the delay of the buffers would vary across the PVT Corners and RC Corners. In the worse case, there might be a significant variation in the delays across corners and you might witness closure of timing in one corner, but violation in other corner!
Second solution is a more robust solution because it obviates the above scenario. Hence see how it does it.
 

Timing Check would be like:


Hence, both setup and hold checks are relaxed!!
[Please note that in the above circuit, the latch is a negative level triggered latch.]

DFT Modes: Perspective

As more number of transistors are finding their way onto a single SoC, Design for Testability (DFT) is becoming an increasing important function of the SoC design cycle. As the technology nodes are shrinking consistently, the probability of the occurrence of the faults is also increasing which makes DFT an indispensable function for modern sub-micron SoCs. What are the possible faults within an SoC and whhat all ways are possible to detect them? We will take them up briefly.
Imagine that you own a chip manufacturing company for the automotive industry. The end application can be something meant for infotainment, engine management, rear view camera, ethernet connectivity, power  glasses or for a critical application like collision detector or air-bag control etc. You wouldn't like to sell a faulty chip to your customers for two main reasons:
  • Trust of the customer which would impact the goodwill of the company.
  • Loss of business: Maybe because the customer opted for some other semiconductor vendor or even worse, the chip failed at the user end and he sued your customer who ultimately sued you!
Hence, it is pretty important to test the chip before shipping it out to the customers.

Types of fault and their detection:
  • Structural Fault: Basically refers to the faults due to faulty manufacturing at the fabs. Even a tiny dust particle has cause shorts or opens in an SoC. 
    Let's try to understand it from our example. Let's say you manufactured the chip but there is a fair probability  that there might be some structural inadequacies in the form of shorts or opens. Imagine any digital circuit. Single short or a single open can cause the entire functionality of the device to go haywire. Structural testing is done during the DFT tests or modes called as Shift and Stuck-At Capture. We'll discuss these in detail in the upcoming posts. Note that these tests are conducted after manufacturing, before shipping the part to the customer.
  • Transition Faults: Signal transitions are nothing but the voltage levels while they switch from either 'high' to 'low' or vice-versa. There is a designated time before the clock edge when all the signals should be stable at the input of the Flop (a very crude definition of setup time) and also a designated time after the clock edge when none of the signals should change their states at the input of the Flop (a very crude definition of hold time). Any such fault in the transition times (conversely: setup or hold violations) is referred to as a transition fault.
    Going back to our example. Suppose that you first filtered out the chips which had some structural fault. Now you would test the remaining chips for transition faults. What would happen if you ship a chip with a transition fault to a customer? If it had a setup violation, the chip will not be able to work at the specified frequency. However, it will be able to work at a slower frequency. If it had a hold violation, the chip will not be able to work at all! One possible consequence from our example could be that in an event of a collision you would expect a few micro- or nano- seconds for the air bag to open up, it might ending up taking seconds! Unfortunately, it would be too late.
    The At-Speed test is used to screen the chip for transition faults.
    Broadly speaking, there are only two types of the faults as discussed above. However, there's another possibility which can arise. 

    Imagine that your car has an SoC which senses a collision and opens the  air bag  within a few micro-seconds of the collision. You would expect it to open up if such a scenario arises. But what if your car is, let's say, 6 years old and the chip is now not functioning as expected. In this case, you would like to test the chip first. And if it is fine, you may proceed on to ignite the engine and start the car. Such a scenario would demand conducting a test while the chip is in operation. Such a DFT test is called LBIST Test (Logical-Built-in-Self Test). In an LBIST test, one would be testing the entire chip or a sub-part of it for structural and/or transition faults. Such a test for memory is referred to as MBIST Test (Memory-Built-in-Self-Test).

    An important characteristic of a built in self test is that it is self sufficient. It would carry out the test internally, following a software trigger, without any external input; carry the test; and send out a signature in terms of PASS/FAIL.

    A failed LBIST test on the air-bag chip, might flash a warning and can prevent the user from starting the car engine! It might sound cruel, but it can surely save your life!

Routing: Basics

Routing process determines the precise paths for nets on the chip layout to interconnect the pins on the circuit blocks. Before discussing further, it would be prudent to discuss where does Routing actually fit in the Physical Design flow.
After Synthesis (the conversion of RTL to gate-level netlist), the blocks and the instances are Placed, which, to some extent, is governed by the Floorplan. After Placement, Clock Tree is synthesized followed by Routing of the signal nets. The following flow chart summarizes the Physical Design Flow.



Objectives of the Routing Process:
  • To determine the necessary wiring, e.g., net topologies and specific routing segments, to connect these cells while respecting constraints like design rules.
  • To Optimize routing objectives, e.g., minimizing total wire length and maximizing timing slack.

Routing is further divided into many subtypes:
  • Global Routing: It defines the routing regions and generates a tentative route for each net. Each net is assigned to a set of routing regions. However, it does not specify the actual layout of wires and it not sensitive to DRV violations.
  • Detailed Routing: For each routing region (defined during Global Routing), each net passing through that region is assigned to particular routing tracks. The actual layout of wires is specified. It also tries to fix all DRV violations in the design.

Puzzle: Divide by 3 Counter with 50% DC

It is pretty simple to make a clock divider with odd frequency division (let's say 3 or 5). But it doesn't have 50% duty cycle. Some modifications are essential to achieve that 50% duty cycle. You might argue, why so much fuss about 50%? To give you an insight into it, consider the following divided waveform with 66% DC:

As you can note from the above waveforms: 
  • NEG-TO-POS arc (i.e. any path launching from a negative edge triggered flop and being captured at positive edge triggered flop) would have least time to meet the setup time requirement and hence can be critical. 
  • On the other hand, POS-TO-POS and NEG-TO-NEG are so much relaxed. 
Same would be true for a divider with 33% duty cycle as well. So, it is preferable to use a divided clock with 50% duty cycle.
Can you design such a circuit which takes a clock signal of frequency f, and outputs another clock signal of frequency f/3 with 50% duty cycle?

Design for Testability: The Need for Modern VLSI Design

DFT is the acronym for Design for Testability. DFT is an important branch of VLSI design and in crude terms, it involves putting up a test structure on the chip itself to later assist while testing the device for various defects before shipping the part to the customer.
Have you ever wondered how the size of electronic devices is shrinking? Mobile phone used to be big and heavy with basic minimal features back in 90s. But nowadays, we have sleek phones, lighter in weight and with all sorts of features from camera, bluetooth, music player and not to forget with faster processors. All that's possible because of the scaling of technology nodes. Technology node refers to the channel length of the transistors which form the constituents of your device. Well, we are moving to reduced channel lengths. Some companies are working on technology nodes as small as 18nm. Smaller is the channel length, more difficult it is for the foundries to manufacture. And more are the chances of manufacturing faults.
Possible manufacturing faults are: Opens and shorts.
The figure shows two metal lines one of which got "open" while other got "shorted". As we are moving to lower technology nodes, not only the device size is shrinking but that also enables to pack more transistors on the same chip and hence density is increasing. And manufacturing faults have become therefore indispensable. DFT techniques enable us to test these (other kinds as well) faults.



Kinds of defects:
  • Opens and shorts, as mentioned above, can cause functional failures. A kind of open and shorts, where any node might get shorted to ground is referred to as stuck-at-0 (SA0) fault or in cases where the node might get shorted to the power supply is referred to as stuck-at-1 (SA1) fault.
  • Speed Defect: May arise due to coupling of a net with the adjacent net and hence affecting the signal transition on it.
  • Leakage Defect: Where a path might exist between the power supply and ground and this would cause excessive leakage power dissipation in the device.
In a nutshell, DFT techniques are important especially for sub-deep micron technology nodes (i.e. below 90nm). And it can prevent shipping any defective part to be customer, which instead would have caused revenue and goodwill loss for the semiconductor design companies.

Sample Problem on Setup and Hold

In the post Timing: Basics, we discussed about the basics of setup and hold times. Why is it necessary to meet the setup and hold timing requirements. And how frequency affects setup but does not affect hold.

Let us understand the concept with an example:



I hope the above waveforms are self explanatory.
Setup Slack in the above case (as inferred from the waveforms as well) is:

Setup Slack = Tclk - T(clk-2-q) - Tdata - T(su,FF2)
If this setup slack is positive, we say that setup time constraint is met. Note that setup slack depends upon the clock period and hence in turn frequency at which your design is clocked.
Let us consider hold timing:
Hold Slack = Tdata + T(clk-2-q) - T(ho,FF2)
As evident from the above equation, hold slack is independent of the frequency of the design.
Note:
  • Setup is the next cycle check, we would take the setup time T(su,FF2) of FF2 into account while finding setup slack at input pin of FF2.
  • Hold time is the same cycle check, we would take the hold time T(ho,FF2) of FF2 into account while computing the hold slack at input pin of FF2.
Try and grasp this example. I shall introduce the concept of clock skew next.

State Retention Power Gating

The post titled Power Gating demonstrated the implementation of a Power Gating Cell and how it helps in minimizing the leakage power consumption of an SoC. Make sure you go through it once more. The basic rationale is to cut the direct path from the battery (VDD) to ground (GND). Though efficient in saving the leakage power, the implementation discussed suffers from one major drawback! It does not retain the state! That means, once power of the SoC is restored, the output of the power gated cell goes to 'X'. You can't really be sure whether it is logic 1 or a logic 0. Do care? Yes! Because if this X propagates into the design, the entire device can go into a metastable state! In order to prevent such a disastrous situation: the system software can simply reset the SoC. That would boot-up from scratch and make sure that all the devices are initialized. 
This means, every time I decide to power gate a portion of my SoC, I'll have to reset that power gated portion once power is returned. This imposes a serious limitation to the application of the Power Gate discussed in the last post. How about designing one power gate which retains the state? But convince yourself that in order to do so, you'd need to spend, though small, some leakage power. Let's call this structure: State Retention Pseudo Power Gate. The term "pseudo" signifies that it would consume a little leakage power contrary to the previous structure which doesn't. But at the same time, you no longer need to reset the power gated portion of the SoC, because the standard cells retain their previous data!! Enough said! Let's discuss the implementation.

The above circuit has two parts. 
  • The one inside the red oval is same as the normal power gating structure. 
  • The one inside green box (on the right) is the additional circuitry required to enable this device to retain it's state.
Operation: Let's say before going into the SLEEP mode, the device had the output as logic 1. After entering the SLEEP mode (power off), the sleep transistors come into action and cut the power and ground rails of the device and hence save the leakage power. But the logic on the right (in green rectangle) is still ON! The output of the inverter would now become OUTPUT', i.e., logic 0. This would in turn enable the PMOS transistor Q1 and output would be restored back to logic 1.
Same is true when the output would be logic 0 before power gating. In that case the NMOS transistor Q0 would come into action to help the output node retain it's data.
Note that: All this while, when the device is in sleep mode, the output node would continue to leak. By adding the additional circuitry, as demonstrated, we are basically trying to create a feedback loop, which again helps in retaining the state. The hit, of course, is the leakage power of 4 transistors. However, the standard cell logic (in red oval) is usually bulky. Even a simple 2-input NAND gate has 4 transistors itself. And higher order input would have more! Same technique can be applied to any sequential device like a Flip Flop, latch or even a clock gating integrated cell.

Multi-Cycle Paths: Perspective & Intent

Multi-Cycle Paths, as the name suggests, are those paths which need not be timed in one clock cycle. It is easier said than done! Before we discuss further, let's talk about what Multi-cycle paths does not mean!!
Myth 1: All those paths which the STA team is unable to meet at the highest clock frequency, are potential multi-cycle paths.
Reality: Multi-cycle paths are driven by the architectural and design requirements. STA folks merely implement or appropriately said, model the intent in their timing tools! A path can be a multi-cycle path even if the STA team is able to meet timing at the highest clock frequency.
Myth 2: It is by some magic that the design teams conjure up how many cycles it would be appropriate to for a path to function! <Apologies the hyperbole! :)>. And STA team follows the same in their constraints.
Reality: MCPs are driven by the intent. And implementation is governed by that intent which includes but is not limited to the number of run modes a particular SoC should support.
Consider the following scenario:
Normal mode, Low Power Mode and Ultra Low Power Modes can be considered to be the different run modes of the SoC. You can say that the customer can choose at what time which run mode would be better. Example: when performance is not critical, or your device can go to 'hibernate' mode, you (or the software) can allow the non-critical parts of the SoC to go into a Low Power Mode, and hence save power!
Consider the specifications:
  • Normal Mode: Most Critical IP & Not-So Critical IP would work at f MHz. Least Critical IP would work at f/2 MHz. Interaction between any two IPs would be at slower frequency.
  • Low Power ModeMost Critical IP would work at f MHz. Not-So Critical IPLeast Critical IP would work at f/2 MHz. Interaction between any two IPs would be at slower frequency.
  • Ultra Low Power ModeMost Critical IP would work at f MHz. Not-So Critical IP would work at f/2 MHz. And Least Critical IP would work at f/k MHz; (k=> 3-8). Interaction between any two IPs would be at slower frequency.
Consider the Low Power Mode. Any interaction within the Not-So Critical IP would be at slower frequency. However, any paths between the Most Critical IP and Not-So Critical IP would be Multicycle path of 2 in the low power mode. In this case, the clock at the Not-So Critical IP is gated selectively for every alternate clock cycle to implement an MCP. Hence data launched from Most Critical IP now effectively gets two clock cycles (of the faster clock) to reach the Not-So Critical IP. The following figure explains the intent:

This much for the intent! However, as we mentioned that for the Least Critical IP, depending on the mode, would work at f/k MHz => (k=3-8) one might need an MCP of 2, 3, 4.... and so on. This calls for a need of a configurable implementation of multicycle paths. We shall cover it sometime later. Till then, you can assimilate on the intent part. You can also mail me in case you think of any such implementation at my<dot>personal<dot>log<at>gmail<dot>com. Adios!

Low Power Synthesis: Insertion of Clock Gating Cells

Power consumption is a growing concern for modern SoCs and design engineers today face an arduous task of limiting the power dissipation of their SoCs. It would be unfair to think the backend design cycle as a magical solution to all the power solutions. However, modern synthesis EDA tools are smart enough in identifying some key RTL constructs and synthesizing a low power equivalent of the structure. We will take a look at one such RTL Construct and it's equivalent implementation for low power design.

Consider the following behavioral description:

always @ ( posedge clk )
begin
   if (enable == 1'b1) then
   q [15:0] <= d [15:0]
end

One logical implementation and the corresponding low power implementation of the above description would be:

The synthesis tools find such RTL constructs and try and convert it into the low power implementation shown above. Please note that, the clock gating integrated cell (CGIC) also consumes power and the above implementation might not be an expedient solution if the above enable is mostly high, or even if the number of registers in the register set is small. Therefore, one needs to exercise caution while using or implementing such a structure!

Two Pillars of DFT: Controllability & Observability

Just like Timing is built on two pillars: Setup & Hold, entire DFT is built on two pillars: Controllability & Observablity.  Very often you would find DFT folks cribbing that they can't control a particular node, or don't have any mechanism to observe a particular node in question. You may like to review the previous post: DFT Modes: Perspective before proceeding further.
Shifting our attention to the pillars of DFT, let's define the two.
  • Controllability: It is the ability to have a desired value (which would be one out of 0 or 1) at any particular node of the design. If the DFT folks have that ability, they say that that particular node is 'controllable'. Which in turn means that they can force a value of either 0 or 1, on that node!
  • Observability: It is the ability to actually observe the value at a particular node whether it is 0 or 1 by forcing some pre-defined inputs. Note that, unlike the circuit that we make on paper, the SoC is a colossal design and one can observe a node only via the output ports. So, DFT folks actually need a mechanism to excite a node and then fish that value out of the SoC via some output port and then 'observe' it!
Ideally, it is desired to have each and every node of the design controllable and observable. But reality continues to ruin the life of DFT folks! (Source: Calvin & Hobbes). It is not always possible or rather practical to have all the nodes in a design under your control, because of the sheer complexity that modern SoCs possess. And therefore, it is the reason you would hear them talk about 'Coverage'. Let's say coverage is 99%, this means that we have the ability to control and observe 99% of the nodes in the design (A pretty big number, indeed!).
Now let's take some simple examples.
In the above example, if we have control the flops such that the combo cloud results in 1 at both the inputs of AND gate, we say that the node X is controllable for 1. Similarly, if we can control any input of AND gate for 0, we say that node X is controllable for 0. Similarly, let's say we wish to observe the output of FF1. If we can somehow replicate the value of FF1 by making the combo clouds and AND gate transparent to the value at FF1, we say that output of FF1 is observable. Intuition tells us that for AND gate to be transparent, we should have the controllability of other node for 1. Because when one input of AND gate is 1, whatever is the value at the other input, it is simply passed on!!

Dynamic and Internal Power

Dynamic Power
As the name indicates it occurs when signals which go through the CMOS circuits change their logic state. At this moment energy is drawn from the power supply to charge up the output node capacitance. Charging up of the output capacitance causes transition from 0V to Vdd. Considering an inverter example power drawn from the power supply is dissipated as heat in pMOS transistor. On the other hand charge down process causes NMOS transistor to dissipate heat.

Output capacitance of the CMOS logic gate consists of below components:

1) Output node capacitance of the logic gate: This is due to the drain diffusion region.
2) Total interconnects capacitance: This has higher effect as technology node shrinks.
3) Input node capacitance of the driven gate: This is due to the gate oxide capacitance.
The average power dissipation of the CMOS logic circuit can be mathematically expressed [2]. Integrating the instantaneous power over the period of interest, the energy EVDD taken from the supply during the transition is given by

EVDD= 0->∞∫I. VDD(t).VDD.dt
=VDD. 0->∞∫ CL.(dvout/dt).dt
= CL.VDD. 0->VDD∫.dvout
= CL.VDD2
Similarly integrating the instantaneous power over the period of interest, the energy Ec stored in the capacitor at the end of transition is given by,
Ec = 0->∞∫ I. VDD(t).Vout.dt
= 0->∞∫ CL.(dvout/dt).vout.dt
= CL.(integration from 0 to VDD).Vout.dvout
= (CL.VDD2)/2

Therefore energy stored in capacitor is= CL.VDD2 / 2.

This implies that half of the energy supplied by the power source is stored in CL. The other half has been dissipated by the PMOS devices. This energy dissipation is independent of the size of the PMOS device. During the discharge phase the charge is removed from the capacitor, and its energy is dissipated in the NMOS device.
Each switching cycle takes a fixed amount of energy = CL. VDD2.

If a gate is switched on and off ‘fn’ times / second, then Pdynamic = CL. VDD2. fn.

Where fn à frequency of energy consuming transitions. This is also called "switching activity".

In general we can write,
Pdynamic = Ceff.VDD2.f
Where f à maximum switching activity possible i.e. clock rate.
Hence,
Pavg= 1/T [0->T/2∫Vout (-Cload.dVout/dt)dt+T/2->T∫(VDD-Vout)(Cload.dVout/dt) dt]

i.e. Pavg=1/T Cload.VDD2
i.e. Pavg=Cload.VDD2.Fclk
Here energy required to charge up the output node to Vdd and charge down the total output load capacitance to ground level is integrated. Applied input periodic waveform having its period T is assumed to be having zero rise and fall time. Note that average power is independent of transistor size and characteristics.

Internal power

This is the power consumed by the cell when an input changes, but output does not change [3]. In logic gates not every change of the current running through an input cell necessarily leads to a change in the state of the output net. Also internal node voltage swing can be only Vi which can be smaller than the full voltage swing of Vdd leading to the partial voltage swing.
Below mentioned steps can be taken to reduce dynamic power

1) Reduce power supply voltage Vdd
2) Reduce voltage swing in all nodes
3) Reduce the switching probability (transition factor)
4) Reduce load capacitance 


References
[1] Michael Keating, David Flynn, Robert Aitken, Alan Gibsons and Kaijian Shi, “Low Power Methodology Manual for System on Chip Design”, Springer Publications, NewYork, 2007, www.lpmm-book.org, 4/9/2007
[2] Jan M Rabaey, Anantha Chandrakasan and Borivoje Nikolic, "Digital Integrated Circuits A Design Perspective", 2nd Edition, 2005, Prentice Hall
[3] Astro, User Guide, Version X-2005.09, September 2005

Low Power Design Techniques

Michael Keating et al. [1] lists several low power techniques to tackle the dynamic and static power consumption in modern SoC designs. Dynamic power control techniques include clock gating, multi voltage, variable frequency, and efficient circuits. Leakage power control techniques include power gating, multi Vt cells. Common methods supported by EDA tools include clock gating, gate sizing, low power placement, register clustering, low power CTS, multi Vt optimization.
Some of the low power techniques in use today are listed in below table.

Different Low Power Techniques [3]



Trade-offs associated with the various power management techniques [2]
Above table summarizes trade-offs associated with different power management techniques. Power gating and DVFS demand large methodology change whereas multi vt and clock gating affect least. Unless large leakage optimization is not necessary it is always beneficial to go with either multi vt or clock gating techniques. Based on the design complexity and requirements combination of any low power techniques can be adopted. Multi vt optimization along with the power gating is found to be efficient in some of the complex designs. Advanced improvements in the implementation (i.e. fabrication) technology has allowed substrate biasing techniques to be used heavily as it does not pose any architectural and design verification challenges and also provides high leakage reduction.
References
[1] Michael Keating, David Flynn, Robert Aitken, Alan Gibsons and Kaijian Shi, “Low Power Methodology Manual for System on Chip Design”, Springer Publications, NewYork, 2007, www.lpmm-book.org, 4/9/2007
[2] Creating Low-Power Digital Integrated Circuits – The Implementation Phase, Cadence, 2007

Clock Gating

Clock tree consume more than 50 % of dynamic power. The components of this power are:
1) Power consumed by combinatorial logic whose values are changing on each clock edge
2) Power consumed by flip-flops and

3) The power consumed by the clock buffer tree in the design.

It is good design idea to turn off the clock when it is not needed. Automatic clock gating is supported by modern EDA tools. They identify the circuits where clock gating can be inserted.


RTL clock gating works by identifying groups of flip-flops which share a common enable control signal. Traditional methodologies use this enable term to control the select on a multiplexer connected to the D port of the flip-flop or to control the clock enable pin on a flip-flop with clock enable capabilities. RTL clock gating uses this enable signal to control a clock gating circuit which is connected to the clock ports of all of the flip-flops with the common enable term. Therefore, if a bank of flip-flops which share a common enable term have RTL clock gating implemented, the flip-flops will consume zero dynamic power as long as this enable signal is false.
There are two types of clock gating styles available. They are:
1) Latch-based clock gating
2) Latch-free clock gating.
Latch free clock gating
The latch-free clock gating style uses a simple AND or OR gate (depending on the edge on which flip-flops are triggered). Here if enable signal goes inactive in between the clock pulse or if it multiple times then gated clock output either can terminate prematurely or generate multiple clock pulses. This restriction makes the latch-free clock gating style inappropriate for our single-clock flip-flop based design.

Latch free clock gating
Latch based clock gating
The latch-based clock gating style adds a level-sensitive latch to the design to hold the enable signal from the active edge of the clock until the inactive edge of the clock. Since the latch captures the state of the enable signal and holds it until the complete clock pulse has been generated, the enable signal need only be stable around the rising edge of the clock, just as in the traditional ungated design style.

Latch based clock gating
Specific clock gating cells are required in library to be utilized by the synthesis tools. Availability of clock gating cells and automatic insertion by the EDA tools makes it simpler method of low power technique. Advantage of this method is that clock gating does not require modifications to RTL description.
References
[1] Frank Emnett and Mark Biegel, “Power Reduction Through RTL Clock Gating”, SNUG, San Jose, 2000
[2] PrimeTime User Guide

Power Gating

Power gating is the technique wherein circuit blocks that are not in
use are temporarily turned off to reduce the overall leakage power of
the chip. This temporary shutdown time can also called as "low power
mode" or "inactive mode". When circuit blocks are required for
operation once again they are activated to "active mode". These two
modes are switched at the appropriate time and in the suitable manner
to maximize power performance while minimizing impact to performance.
Thus goal of power gating is to minimize leakage power by temporarily
cutting power off to selective blocks that are not required in that
mode.

Power gating affects design architecture more compared to the clock
gating
. It increases time delays as power gated modes have to be
safely entered and exited. The possible amount of leakage power saving
in such low power mode and the energy dissipation to enter and exit
such mode introduces some architectural trade-offs.

How to shut down the blocks? It can be accomplished either by software
or hardware. Driver software can schedule the power down operations.
Hardware timers can be utilized. A dedicated power management
controller is the other option.

An externally switched power supply is very basic form of power gating
to achieve long term leakage power reduction. To shutoff the block for
small interval of time internal power gating is suitable. CMOS
switches that provide power to the circuitry are controlled by power
gating controllers.

Output of the power gated block discharge slowly. Hence output voltage
levels spend more time in threshold voltage level. This can lead to
larger short circuit current.

Isolation Cells
Isolation cells are used to prevent short circuit current. As the name
indicates these cells isolate power gated block from the normally on
block. Isolation cells are specially designed for low short circuit
current when input is at threshold voltage level. Isolation control
signals are provided by power gating controller.

Retention Registers
Retention registers are special low leakage flip-flops used to hold
the the data of main register of the power gated block. Thus internal
state of the block during power down mode can be retained and loaded
back to it when the block is reactivated. retention registers are
always powered up. The retention strategy is design dependent. During
the power gating data can be retained and transfered back to block when
power gating is withdrawn. Power gating controller controls the
retention mechanism such as when to save the current contents of the
power gating block and when to restore it back.


Reference

Michael Keating, David Flynn, Robert Aitken, Alan Gibbons, Kaijian
Shi,"Low Power Methodology Manual For System on Chip Design",
Electronic Edition,Springer, 2007. www.lpmm-book.org

Faulty Clock Gating: How "Not" to Gate the Clock

You would come across a plethora of technical literature on clock gating and it's associated techniques. It does not come as a surprise because clock gating is the most commonly employed design technique to save dynamic power. However, many implementations are faulty, in the sense that while they indeed gate the clock, but the result in an overall increased dynamic power consumption. We would discuss one such common technique, which obviates all the power saving benefits of clock gating. You are advised to use your discretion before using it.
The basic rationale behind clock gating:
  • Even when the output of a flip-flop is not toggling, owing to the transitions (and hence charging/discharging of nodes) in the internal circuitry of the flop-flop, it still continues to dissipate dynamic power when it is being fed by a clock signal.
  • When the input of the flip-flop is not toggling or would not toggle, one can effectively gate the clock to that flip-flop for that particular time and save dynamic power. 
One logical implementation for the above problem statement (and this is indeed the implementation employed in many technical papers and patents) is depicted below:
Let's take a look at the above implementation. The XOR gate between the D input and the Q output of the flip-flop has been used as the enable signal for the clock gate CGIC. The logical explanation behind this is: when the output of the flop is same as input, which would be detected by XOR'ing the two, one can gate the clock to the clock gate.
Example: Let's say initially Q =1. Now D = 1, which means that t he output of the flop is destined to stay at "1" for the next cycle as well. XOR'ing these two signals: Q XOR D = 0, EN = 0 would gate the clock to the flip-flop. So, would that save power? Well, one would expect it that way. Let's take a look at why it would result in an increased power dissipation.

The circuit shown above is a trap! The actual circuit would be something like the one shown below:

  • As evident from the above figure, the  XOR gate would continue to toggle for the entire time period of the clock and would become stable only "setup time" before the next clock edge. And during this entire duration, it would continue dissipating dynamic power. You might argue here that the power dissipated must be less than the power dissipated by an idle flop receiving clock. Well, that might be true for some technology, but XOR gate is the most bulky gate (among all primitive gates) and I would say that this power, if not less, would at least be comparable to that of an idle flop receiving a clock signal.
  • Secondly, the circuit above uses a CGIC. Note that CGIC comprises off one latch and an AND gate, while a flop comprises  of two latches. The internal circuitry of the CGIC would continue to charge/discharge and hence dissipate power.
The sum of the above two power dissipation would over-shadow the benefits one was expecting in the first place, and hence it  is a common design trap. Beware of it.

High Speed Counter Design

In this post, I'll talk about the limitations associated with the conventional binary counter design in terms of it's maximum operating frequency, and also discuss an ingenious yet simple design (not invented by me!) which can operate at a very high frequency.
Conventional Binary Counter: The operating speed of any binary counter, or for that matter, any sequential circuit is governed by the setup time limitation that the combinatorial logic between any two registers (flip-flops). Note that:
  • Any higher-order counter bit toggles only when all the lower-order bits are logic 1.
  • The input for any higher-order counter bit is a function of all the lower-order bits and itself during the last clock cycle.
  • The operating speed for an n-bit counter is limited by the following equation:

    Time Period of the Clock
    T(clk-to-q),FF0 + (n-2).TAND + TXOR + Tsu,FF(n-1)
     
  • The following figure shows the circuit for a 4-bit conventional binary counter. It must be noted as as the counter width increases, the operating frequency decreases.
High Speed Binary Counter: How about designing a binary counter where there is no combinatorial cells between any two registers, so that such a design is able to achieve the highest operating frequency for a given technology node? For this counter the basic premise is:
  • Since the counting sequence for any counter bit is deterministic in nature, it should be possible to design a counter in a manner that: each bit is a function of only itself over all the previous clock cycles.
  • Johnson Counter enables us to design in such a manner that there is no combinatorial cell between any two registers. Let's have a look:
    Since the LSB i.e. Q0 toggles itself at every clock cycle, the 1-bit Johnson Counter can be used for Q0. Note that here we are using bit-by-bit synthesis instead of the conventional Karnaugh map approach to design our binary counter.
  • Similarly, higher order counter bits can be realized by higher order Johnson Counter, where the last bit would represent the binary counter bit. For Q1, the circuit would be:
  • The same can be extended in a recursive manner to design any n-bit binary counter.
  • Note that in this design, there is absolutely no combinatorial cells between any two registers, thereby making high operating speed possible.

    Time Period of the Clock T(clk-to-qbar),FF + Tsu,FF
    What is the trade-off here? The answer is dynamic power dissipation. Note that a conventional n-bit counter would use n flops. However, for the proposed design, 3-bit counter would need (1+2+4=) 7 flops, 4 bit counter would need (1+2+4+8=) 15 flops and so on. This design might find practical application for lower order counter widths like 4-6. 
    Above that, the design would dissipate too much power to be of any practical use.

Integrated Clock and Power Gating

Clock Gating and Power Gating are two most commonly used design methods to save dynamic and leakage power respectively. How about integrating the two solutions such that they complement each other? In this post, I will talk about a simple way to do so.
Clock Gating is accomplished by using Clock Gating Integrated Cell (CGIC) which gates the clock to the sequential elements present in its fan-out when the enable signal is logic 0. Power Gating structures may be of two types: Simple Power Gating and State Retention Power Gating. Using the former technique, the output of the logic gates slowly leaks the charge at the output and thereby when the SLEEP signal is de-asserted, one cannot predict the logic value at the output. The latter technique is able to retain the state at the output which was last present before asserting the SLEEP signal.
Let's take up a few plausible scenarios:
  • Case I - Normal Case: Which employs  only conventional clock gating. It is depicted in the figure.
 

  • Case II - When one does not need to retain the states of the combinatorial cells or the sequential elements. One possible scenario could be in the case of a standalone IP, which is not comunicating with any other IP on the SoC. Here one can use thee simple power gating where the SLEEP signal is derived from the CGIC itself using a latch, as depicted in the figure below. Doing so, we would save both dynamic and leakage powers. 


  • Case IIII - When one does not need to retain the states of the combinatorial cells, but the sequential outputs need to be safe-stated. Possible use-case could be where only the sequential outputs communicate with other IPs on the SoC. This can be accomplished by using State Retention Flip Flops instead of the conventional flip-flops.
  • Case IV - When both the combinatorial cells and the sequential cells interact with other IPs. But the previous value need not be required. Since it is a classic case of interaction between "switchable power domain" with" always ON", it entails the use of isolation cells between such power domain crossings. It must be noted that in such a case, isolation cell would always be present in the always  ON power domain, i.e., it would receive it's VDD supply from the always ON power domain supply. This is because, when the switchable power domain in OFF, the isolation cell can function only if receives the power supply! 
Isolation Cells can be simple cells like AND or an OR gate, which receive one input in a way that, irrespective of the second input coming from the switchable power domain, the value would be controllable. For example, logic 0 for AND gate and logic 1 for an OR gate. I will try to take this up in a separate post.