A chain fail is when one or more scan chains in non-operational, that is to say data cannot be accurately shifted through the flip-flops arranged as shift registers that constitute a scan chain. The reasons they suck are many, but here are my Top 5:
- Scan chain fails usually dominate the logic yield loss early in the technology ramp, right when it’s most critical to get some actionable data
- Scan chain fails require a huge amount of data to perform fault simulation based chain diagnosis for defect localization
- Scan chain fails require specialized ATE hardware to perform defect localization without resorting to simulation based chain diagnosis
- Scan chain fails have all the same possible root cause explanations as logic fails (logic fails = scan chains are functional, silicon is not) and several additional ‘chain fail only’ root causes
- Scan chain fails are often mixed in with logic fails and product/yield engineers disregard this important distinction.
What Causes a Chain Fail?
Defects That Cause Chain Fails
A chain fail occurs when the data being shifted out of a scan chain is found by the Automatic Test Equipment (ATE) to be corrupted for a reasonably higher percentage of cycles. Defects that cause chain fails come from:
- A defect in the scan flip-flop that effects shift functionality. Note that it is possible to have a defect in a flip-flop that only effects functional operation, and this would not be a chain fail.
- A defect on the net that connects the scan out signal from one flop to scan in signal on another flop
- A defect in the scan enable tree that feeds one or more flops. The scan enable tree is huge (high critical area) and usually not signed off for timing performance.
- A defect in the clock tree that feeds one or more flops. The clock tree is also extremely high critical area, but is obviously signed off for timing performance.
Parametrics That Cause Chain Fail
There is another equally important list of chain fails to consider. This class of fails can’t really be considered defects in the traditional sense because a parametric problem can’t typically be imaged in a failure analysis Scanning Electron Microscope (SEM) or Transmission Electron Microscope (TEM). However parametric effects can be just as devastating to yield, and often are responsible for yield loss early in the process or product ramp. Examples include:
- Setup or Hold time faults. These can often be blamed on a timing mismatch that was either found by Static Timing Analysis (STA) and ignored, or missed by STA. I once worked on a yield problem where a setup fault could only be identified by STA when the design is simulated at the slow timing, high voltage, low temperature corner (no one simulates that corner for signoff) and yet the flop was failing in silicon.
- Power droop can cause a flop to lose it’s state. This may only effect a certain flop or set of flops in the design, but this lost state will effect the whole chain as each value is shifted through a weak flop. Power droop is especially problematic during scan testing and logic BIST testing because there is abnormally high activity on the chip (half the flops toggling simultaneously). I remember one time my classmate and I were tying to figure out why a part running logicBIST was running below its intended frequency. We would check the max frequency, solder a capacitor onto the loadboard, and recheck the max frequency. It went up every time we added a capacitor: power droop in action.
- Connection between ATE and scan chain inside the chip. It is possible to have a chain fail occurring because of effects outside of the silicon. Not usually for digital pins like scan chains, but it’s possible to have a bad calibration on the test head. Also I’ve observed chain fails that were induced because of a resistive connection between the probecard and the chip pads. You also can’t rule out the possiblility of a bad package connection if you are working on final test units. Note that there is an extension of this where all chains are failing because power is not being supplied to the chip, like when I would forget to touch the probecard down on the wafer before I started testing.
Some Details on Why Chain Fails Suck
- Chain Fails are often the dominant yield loss mechanism during product or process ramp. There are several reasons why this is true and they all serve to confuse the issue as well as the engineer. Firstly, lets note the fact that scan chains are usually heavily routed in the upper metal layers. This is a natural consequence of the fact that scan enable and clock-trees are for the most part considered global routes and a scan out net can connect fairly distant flops because it’s a slow speed and non-functional connection. All of this ends up meaning that there is a whole lot of upper metal layer critical area tied up nets that, if defective, would induce a chain fail. Early in a process ramp the defectivity is always high. Combining the high critical area of the chain fails with the high defectivity of the process ramp and you can see why there are so many chain fails at the beginning of a process. It’s not because they are systematic, it’s because they catch a lot of defects. Also, a chain fail will almost always mask the existence of a logic fail (the exception being compound diagnosis which can separate the two), so when there are a lot of defects it will look like a lot of chain fails. Do not be fooled into thinking that a high number of chain fails can only be the result of a systematic defect. Every design is different and it’s possible to have 30-70% of logic defects result in a chain fail with perfectly random defectivity.
- Scan chain fails require a huge amount of data to perform fault simulation based chain diagnosis for defect localization. Why is this? It’s because there are so many freaking failing bits! In actuality a lot of those bits could be ignored (we already know the chain is Stuck-at-0), but the tester isn’t smart enough to know which bits to ignore. Scan chain diagnosis, like logic diagnosis, needs a certain number of failing patterns (entire shifts through the chain) in order to have good resolution. Therefore the longer the scan chain the more data needs to be collected. As an interesting sidenote, the opposite is also true (*). An interesting shift in the industry is occurring in this space with the emergence of zero overhead datalogging like theTeseda V520 and its kin. What used to be an impossible amount of data to collect for chain fail diagnosis (often 10′s of thousands of cycles) doesn’t really take that long with a symmetric capture buffer. This is leading to a more widespread adoption of chain diagnosis (both hardware and software) as a way of understanding yield loss. Unfortunately, for many existing ATE fleets this rapid data collection capability isn’t the case, which is why is stays on my list of things that suck.
- Scan chain fails require specialized ATE hardware to perform defect localization without resorting to simulation based chain diagnosis. It is possible to perform scan chain defect localization using only hardware. The company that I know of doing this is Verigy with their ChainAnalyzer, part of the Inovys acquisition. From what I have heard about this, it seems to be compelling technology which is able to automatically apply patterns to the scan input pins, and check values on the scan output pins, in order to localize which flip-flop is failing. The only reason I would call this a downside is that it requires specialized hardware, and if you don’t have the hardware, you can’t use it (obviously). To those of you that have the hardware and can use it, Kudos!
- Scan chain fails have all the same possible root cause explanations as logic fails and several additional ‘chain fail only’ root causes. Continuing on the line of thinking in 1. above, chain fails can come from random defects. For random defects, chain fails and logic fails have the same basic root cause. The problem comes when there really are systematic defect related chain fails that are buried in amongst all of the random defect related chain fails. Thus in order to tackle the systematic chain fails, one must first separate the random and systematics. This is much easier said than done. Take for example a systematic defect that affects the chain functionality in a scan-flip flop. Since the flip-flop can be placed anywhere on the design, the defect will seem to occur randomly on the design; and yet it is a systematic chain fail. Separating random from systematic generally involves either zonal analysis (ITC05, ISTFA11) or design normalization (let me know if you have a great reference to this). The problem with design normalization for chain fails is that it’s not typically clear which locations result in a logic fail vs. a chain fail. I’ve seen a lot of people normalize by chain length, but they really should be normalizing by chain critical area (at the very least).
- Scan chain fails are often mixed in with logic fails and product/yield engineers disregard this important distinction. I actually consider this one to be the biggest reason chain fails suck, and yet the easiest to solve. The overwhelming majority of the time product and test engineers have no idea what % of the yield loss is related to chain fails vs. logic fails. They will end up datalogging 2 parts, both will be chain fails, and they will wrongly conclude that they have 100% systematic chain fallout. The right answer is to separate the Chain Test Pattern Set from the Stuck-At Pattern Set, test them individually, and assign each it’s own softbin. This will be a ~zero test time impact approach that will provide visibility on how much of the scan yield loss is chain related. To me this is a no-brainier. Even if the data never gets used, it’s free to collect: no datalogging required!.
By now I hope I’ve convinced you that chain fails suck, but can be conquered with patience and hard work (and some sort of tester). I’d be happy to hear about any stories you have to support my suckness conclusion, or solutions you have for additional desuckifying.
(*) When scan chains are compressed, like with Mentor Graphics TestKompress and Synopsys DFTMAX, less data volume is required to be collected for chain diagnosis because the number of patterns required is the same but the chain length is reduced by the compression factor!