Clock-Skew-Aware Scan Chain Grouping for Mitigating Shift Timing Failures in Low-Power Scan Testing

Zhang, Yucong; Wen, Xiaoqing; Holst, Stefan; Miyase, Kohei; Kajihara, Seiji; Wunderlich, Hans-Joachim; Qian, Jun

Proceedings of the 27th IEEE Asian Test Symposium (ATS’18), Hefei, Anhui, China, 15-18 October 2018, pp. 149-154

doi: https://doi.org/10.1109/ATS.2018.00037

**Abstract:** High scan shift power often leads to excessive heat as well as other severe problems such as shift timing failures. Partial shift (shifting only a subset of scan chains at a time) is a widely-adopted general approach for mitigating shift power problems. Although partial shift is effective in avoiding excessive heat by reducing global switching activity, we show for the first time that it may actually worsen shift clock skews, thus increasing the risk of shift timing failures. This paper addresses this problem with an innovative method, namely Clock-Skew-Aware Scan Chain Grouping (CSA-SCG). CSA-SCG properly groups scan chains to be shifted simultaneously so as to reduce the imbalance of switching activity around the clock paths for neighboring scan flip-flops in the scan chains. Experiments on large ITC’99 benchmark circuits demonstrate the effectiveness of CSA-SCG for reducing scan shift clock skews to lower the risk of shift timing failures in partial shift.

Preprint

General Copyright Notice

This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden.

This is the author’s “personal copy” of the final, accepted version of the paper published by IEEE.¹

¹ IEEE COPYRIGHT NOTICE

©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Clock-Skew-Aware Scan Chain Grouping for Mitigating Shift Timing Failures in Low-Power Scan Testing


*Kyushu Institute of Technology, Iizuka, 820-8502, Japan
**University of Stuttgart, Stuttgart, 70569, Germany
***Advanced Micro Devices, Inc., Sunnyvale, CA 94088, USA

Abstract—High scan shift power often leads to excessive heat as well as shift timing failures. Partial shift (shifting a subset of scan chains at a time) is a widely-adopted approach for avoiding excessive heat by reducing global switching activity, we show for the first time that it may actually cause excessive IR-drop on some clock buffers and worsen shift clock skews, thus increasing the risk of shift timing failures. This paper addresses this problem with an innovative method, namely Clock-Skew-Aware Scan Chain Grouping (CSA-SCG). CSA-SCG properly groups scan chains to be shifted simultaneously so as to reduce the imbalance of switching activity around the clock paths for neighboring scan flip-flops in scan chains. Experiments on large ITC'99 benchmark circuits demonstrate the effectiveness of CSA-SCG for reducing scan shift clock skews to lower the risk of shift timing failures in partial shift.

Keywords—scan testing, switching activity, IR-drop, clock skew, shift timing failure, partial shift, scan chain grouping

1. Introduction

Scan design is the most widely used design-for-testability (DFT) technique for achieving high quality required with modern LSI circuits [1, 2]. However, during scan testing, which is based on the scan design, excessive switching activity is increasingly causing severe problems through elevated scan test power.

Scan testing has two modes, namely shift and capture, both having their own test power issues [2]. In this paper, we focus on shift power while various techniques for reducing capture power can be found in [3, 4]. Fig. 1 illustrates the three major shift power issues. The accumulative impact of switching activity may cause excessive heat due to excessive average power dissipation [2]. In addition, the instantaneous impact of switching activity may cause severe IR-drop that leads to two problems, namely test data corruption and shift timing failure. On the other hand, excessive IR-drop occurring at a scan flip-flop may flip its state, leading to test data corruption. On the other hand, switching activity around the clock paths for neighboring scan flip-flops (subsequent flip-flops in a scan chain) may change the delays of the corresponding clock buffers in an unbalanced manner, resulting in severe shift clock skews that may lead to shift timing failures due to hold and/or setup time violations [5, 6]. The difference between test data corruption and shift timing failure in this paper is that the former is caused by excessive IR-drop on flip-flops while the latter is caused by excessive IR-drop on clock buffers. Especially, the problem of IR-drop-induced shift timing failures is increasingly getting worse with the scaling down of supply voltages and process feature sizes [7]. Previously we proposed an effective method for mitigating the problem of test data corruption [8]. In this paper, we focus on the problem of shift timing failures.

Partial shift, which shifts only a subset of scan chains at a time, is a widely-adopted approach to reduce shift power. Numerous implementations, such as output gating [9], scan segmentation [10, 11], and scan chain disable in combination with test pattern reordering [12], have been proven effective for reducing average shift power so as to mitigate the excessive heat problem. Partial shift has also been applied for multi-core SoCs to reduce global switching activity [13]. In addition, partial shift has been applied for reducing IR-drop caused by scan shift operations [8, 14, 15].
Fig. 2. Impact of partial shift on shift clock skew (b19).

However, implementations, such as output gating, may introduce extra delay on flip-flops and increase circuit area. Furthermore, if not properly implemented, partial shift may actually increase the risk of shift timing failures despite reduced average shift switching activity.

Fig. 2 shows the result of an experiment for investigating the impact of partial shift on the risk of shift timing failures. The target circuit was the biggest ITC’99 benchmark circuit, b19, with 30 scan chains (only three of them are shown for clarity). Fig. 2(a) corresponds to conventional scan testing in which all the scan chains are shifted in one group (s1/s2/s3). Fig. 2(b) illustrates partial shift, in which the scan chains are divided into two groups (s1/s3, s2) and only one group is shifted at a time. We conducted gate-level simulation for 6,144 shift cycles with random patterns to estimate weighted switching activity (WSA) [4]. The WSA of a cell c is calculated as follows: WSAc = Tc · Cc, where Tc = 1 if a transition occurs at the output of c; otherwise Tc = 0. Ideally, the other parameter, Cc, reflects the output capacitance of c but due to lack of capacitance data, Cc was set to fanout count of c plus one in our experiments. We set the region of eight rows in Y-direction and 200 NAND2X1 cell-widths in X-direction around a clock buffer b as its sensitive area, and any switching activity inside this region affects the supply voltage of b. As shown in Fig. 2, partial shift reduced the average WSA by 40%, thus effectively reducing the risk of excessive heat. However, for two neighboring scan flip-flops (ff1 and ff2,1) in s3, the difference between the WSA values in the areas around the clock buffers of their clock paths was actually increased by 35% in partial shift. This larger WSA difference indicates a larger shift clock skew, thus leading to a higher risk of shift timing failures. This is a serious problem with partial shift although it can effectively reduce scan test heat.

In general, shift timing failures caused by excessive shift switching activity can be mitigated with four major approaches: (I) increasing design margins, (II) manipulating scan shift clocks, (III) manipulating test data, and (IV) applying design for testability (DFT). They are briefly reviewed as follows.

Approach-I includes the strengthening of the power distribution network and Approach-II includes the slowing-down of the scan clock speed [5]. However, both of them significantly increase design and test costs.

Several methods based on test data manipulation (Approach-III) are available for filling don’t-care bits in a test cube to reduce scan cell toggle rates [16–19]. In addition, a recent method can identify potential locations of IR-drop-induced shift timing failures, and by adjusting the test data or directly masking them, the method can avoid false testing results [20]. However, the disadvantages of these methods include (1) the degradation of fault coverage, (2) test vector count inflation, and (3) the limitation of applicability to only shift-in operations.

Several methods based on DFT (Approach-IV) are available for reducing instantaneous impact of switching activity, thus IR-drop, during scan shift operations. In [14], multiple staggered clocks are used to shift scan chains at slightly different times so as to reduce peak switching activity. In [21], scan segmentation in combination with clock gating is used to effectively reduce peak shift power. In [8], optimal scan chain grouping for reducing local IR-drop on flip-flops is used to mitigate the IR-drop-induced test data corruption. In [22], flip-flops are assigned to scan segments and shift clock domains so as to reduce peak and average shift power. In [15] scan segments are regrouped so as to reduce excessive IR-drop on individual clock paths. However, none of these methods can directly and effectively reduce scan clock skews caused by the imbalanced switching activity around the clock paths for neighboring scan flip-flops in the scan chains.

In this paper, we propose Clock-Skew-Aware Scan Chain Grouping (CSA-SCG), a new DFT-based method for effectively reducing the risk of shift timing failures by reducing scan clock skews in partial shift during both shift-in and shift-out. To the best of our knowledge, this is the first work directly addressing excessive scan clock skews in partial shift, which significantly improves the quality of low power scan testing.

The rest of this paper is organized as follows: Section 2 introduces a flexible cost function to estimate clock skews in partial shift and formally defines the CSA-SCG problem. Section 3 describes an efficient CSA-SCG algorithm. Section 4 presents experimental results and Section 5 concludes the paper.

2. Problem Formulation

2.1. Cost Function

Let C be the set of all cells (logic gates, flip-flops, and clock buffers) in a circuit. Let F ⊂ C be the set of all scan flip-flops and B ⊂ C be the set of all clock buffers. Let S be the set of all scan chains in the circuit.
of demand to switch the cell and the weight factor $c$ in the circuit which contains: technology and the power network design.

The impact area of $c \in \mathcal{C}$ is estimated by $d(c, C) = 2$, because among all the cells in $C$, only $c_1$ and $c_4$ affect the supply voltage of the clock buffer $c$. Since the clock path of $f \in S$ only contains one clock buffer $c_4$, the total IR-drop on the clock path of $f$ is $d(B, C) = 2$. Similarly, the total IR-drop on the clock path of $f$ can be estimated by $d(B, C) = 2$.

We estimate the clock skew between two neighboring flip-flops $f_i, f_{i+1} \in S$, $1 \leq i < n$ as follows:

$$\text{skew}(f_i, C(S')) = |d(B, C(S')) - d(B_{i+1}, C(S'))|$$

In Fig. 3, when $s_1$ and $s_2$ shift together, the clock skew between $f \in S$ and $f_{i+1}$ is $\text{skew}(f_{i+1}, C(S')) = 0$.

Let $F(S') \subseteq F$ be the subset of all flip-flops of the scan chains in $S'$. The cost of shifting a subset of scan chains $S' \subseteq S$ is determined by the pair of neighboring flip-flops which has the largest clock skew:

$$\text{skew}(S') = \max\{\text{skew}(f_i, C(S')) | f \in F(S')\}$$

It is important to note that this cost function only depends on the circuit layout and the netlist of a circuit to estimate the clock skew. Scan chain grouping under the assumption that only one group of scan chains are shifted at a time. It is independent of any specific test vectors. Therefore, any scan chain grouping algorithm based on this cost function is also test-vector-independent, making it more practical in a real test design flow.

### 2.2. Problem Statement

The primary goal of the proposed CSA-SCG method is to properly group scan chains to be shifted simultaneously as to reduce the imbalance of switching activity around the clock paths for neighboring scan flip-flops in the scan chains. The grouping problem can be formulated as follows: Given a full scan design with a set of scan chains $S$, the cost function $\text{skew} : \mathcal{P}(S) \rightarrow \mathbb{R}$ and the number of available groups $k$. A grouping $G = \{S_1, S_2, ..., S_k\}$ over $S$ should be found that minimizes:

$$\text{skew}(G) = \max\{\text{skew}(S') | S' \subseteq G\}$$

The definition of a grouping $(\cup_{S' \in S'} = S)$, and $S \cap S_j = \emptyset$ for all $S_i, S_j \in G, i \neq j$ ensures that when all groups of scan chains are clocked exactly once, a complete shift cycle is executed. Therefore, the lower the cost $\text{skew}(G)$, the lower the risk of shift timing failures.

The CSA-SCG problem is similar to the scan chain grouping for reducing local IR-drop at flip-flops [8], in that we need to find a partitioning over $S$ subjected to the constraints on the defined cost function. We have shown in our previous work [8] that scan chain grouping for reducing local IR-drop at flip-flops is NP-complete by reducing the problem of graph $k$-colorability to it. This indicates that CSA-SCG is an intractable problem as well. The algorithm proposed in [8] depends on the fact that adding a scan chain into a group
In CSA-SCG, however, whenever a scan chain is added into a group of scan chains, the resulting cost (maximum IR-drop at flip-flops) can only increase the cost. Thus, we propose a fast and effective heuristic scan chain grouping algorithm.

Algorithm 1 predictably terminates after \(|S| \times \log \min \{G, k\}\) iterations, where \(G\) is the number of groups and \(k\) the number of scan chains. As the last resort, lines 10-17 assign any chain to a group with the least cost. Finally, a swap procedure is applied to achieve better grouping through swapping a scan chain to another group.

The pseudo-code of the algorithm for CSA-SCG is shown in Algorithm 1. First, the grouping \(G\) is initialized to an empty set, which means there are no groups yet. Next, the outer loop iterates over the set of scan chains \(S\). In each iteration, the algorithm makes three attempts to assign the current scan chain to a group. If this reduces the cost, no chain is swapped. Finally, the output scan chain group is assigned to a group at a cost no greater than \(\text{skew}_{\text{thr}}\), the algorithm will choose the group that results in the least cost increase. Finally, a swap procedure is applied to achieve better grouping through swapping a scan chain to another group.

The number of scan chains can easily reach hundreds in modern large designs. However, as test time increases with the number of groups in most of the partial shift implementations, it is necessary to group a large number of scan chains into a limited number of groups. The computation time is predictably high to solve the problem with graph coloring solutions. Thus, we propose a fast and effective heuristic scan chain grouping algorithm.

The proposed algorithm for CSA-SCG (Algorithm 1) assigns each scan chain from the scan chain set \(S\) to one of the \(k\) available groups in a greedy manner. With each assignment of a scan chain into a group, the primary objective is for the cost \(\text{skew}\) to remain below a given cost threshold \(\text{skew}_{\text{thr}}\). The maximum acceptable cost \(\text{skew}_{\text{thr}}\) is decided by designers or DFT engineers. If a scan chain cannot be assigned to any group at a cost no greater than \(\text{skew}_{\text{thr}}\), the algorithm will choose the group that results in the least cost increase. Finally, a swap procedure is applied to achieve better grouping through swapping a scan chain to another group.

An example is shown in Fig. 5. The goal is to group the three scan chains \(|S| = 3\) in the example circuit shown in Fig. 3 into two groups \((k = 2)\) with a cost limit of \(\text{skew}_{\text{thr}} = 0\). In this example, the scan chain handled in the \(i\)th iteration is noted as \(s_i\). For \(s_1\), the grouping \(G\) is empty, a new group is added and \(s_1\) is placed into the group. At this point, swapping does not take effect as there is only one group. For \(s_2\), it is placed into the same group with \(s_1\) as this operation results in the cost \(\text{skew}_2 = 0\). In this step, as there is still only one group, the swap procedure does not come into force. For \(s_3\), a new group is generated, because placing \(s_3\) into the non-empty group has a cost of \(\text{skew}_{\text{temp}} = 1 > \text{skew}_{\text{thr}}\). In this step, as any chain swapping does not further reduce the cost, no chain is swapped. Finally, the output scan chain grouping is \(\{s_1, s_2\}, \{s_3\}\) with the cost \(\text{skew}_3 = 0\). Algorithm 1 predictably terminates after \(|S|\) iterations. In
one iteration, during the grouping procedure, the cost function \( \text{skew} \) is evaluated at most \( k \) times, during the swapping procedure, \( \text{skew} \) is evaluated at most \( (k-1) \cdot |S| \) times, making the time complexity as \( O(k \cdot |S| + k \cdot |S|^2) \). Given the fact that \( k \) is usually a small number (less than ten) \( k \) in practice, the CSA-SCG algorithm is sufficient to process large circuits which contain hundreds of scan chains.

4. Experimental Results

We conducted experiments on the six largest ITC’99 benchmark circuits to evaluate the algorithm for CSA-SCG (Algorithm 1) for its effectiveness in reducing shift clock skews. The algorithm was implemented in Java and executed on a workstation with 3.47GHz Intel(R) Xeon(R) X5690 CPU. The used memory never exceeded 4GB in the experiments. For ease of comparison, the execution time measurements were performed without thread-level parallelism.

The benchmark circuits were synthesized, placed and routed with the SAED90nm EDK Digital Standard Cell Library [23] in a standard commercial tool flow and under typical operating conditions. We used a standard full-scan test infrastructure with various numbers of scan chains for each benchmark circuit depending on its size. After scan chain insertion, a separate clock tree was generated for each scan chain so that each chain can be shifted independently from the other chains.

Table I shows the basic statistics of the synthesized circuits. Column \( |C| \) shows the number of cells in a circuit. Column \( |F| \) shows the number of flip-flops in a circuit. Column \( \text{area} \) shows the chip area after place and route. Columns \( |S| \) and \( \text{maxlen} \) show the number of inserted scan chains and the maximum chain length, respectively. We synthesized each circuit into multiple versions with different numbers of scan chains, i.e., 10, 30 and 50 chains for \( b18, b19 \); 10 and 30 chains for \( b17 \); 10 chains for \( b20-22 \). These different versions had the same number of cells and occupied almost the same amount of circuit area.

For the ease of verification, we chose rather simple parameters for the cost function. We set \( w(c) = 1.0 \) for all cells \( c \in C \) in a circuit. The proximity factor \( p(b, c) \) was set to 1.0 whenever the cell \( c \) was located within 8 rows in the y-direction and 200 NAND2X1 cell-widths in the x-direction of the clock buffer \( b \). If a cell was outside this rectangular area, the influence factor was set to 0.0. The size of the aggressor region was chosen rather large to demonstrate our algorithm with a lot of interactions between the scan chains. In practice, these weights can be readily selected by designers or test engineers based on a specific IR-drop model.

Table II compares the results of clock-skew-aware scan chain grouping (represented by ‘csa’) with random scan chain grouping (represented by ‘rnd’) used in most partial-shift-based low-power testing schemes. This table shows the reduction effects on the cost defined in Subsection 2.1 for estimating the clock skews between neighboring scan flip-flops in the scan chains for different benchmark circuits. The first two columns show the circuit name and the number of scan chains. Column \( k = 1 \) shows the estimated clock skew when all scan chains are shifted simultaneously, i.e., the value of \( \text{skew}(S) \). The results for various numbers of scan chain groups are shown in the remaining columns. In the experiments, \( \text{skew}_{\text{thr}} \) for \( b17-b19 \) was set to \( \text{skew}(S) \). For \( b20-22 \), the \( \text{skew}_{\text{thr}} \) was set to 70% of \( \text{skew}(S) \), as using \( \text{skew}(S) \) did not constrain the grouping algorithm enough and resulted in all the scan chains being grouped into the same group. For each group count \( k \), sub-column \( G_{\text{rnd}} \) shows the cost of random grouping, \( \text{skew}(G_{\text{rnd}}) \) (average over 128 random groupings). As our goal is to reduce clock skew in partial shift, we use a reduction ratio relative to the cost of random grouping, \( \text{skew}(G_{\text{csa}}) \), for the columns \( \Delta G_{\text{csa}} \).

\[
\Delta G_{\text{csa}} = 100\% \cdot \frac{\text{skew}(G_{\text{csa}}) - \text{skew}(G_{\text{rnd}})}{\text{skew}(G_{\text{rnd}})}
\]

The reduction ratio is negative when the cost is reduced and positive when the cost is increased compared to random grouping. It can be seen from Table II that, in most cases, the average cost of random grouping was much higher than \( \text{skew}(S) \). This confirms that partial shift can actually increase shift clock skews, resulting in higher risk of shift timing failures, if scan chains are not properly grouped. It can also be observed from Table II that, in all cases, the proposed method achieved better results than random grouping, especially for larger circuits. Table II also shows that the proposed method provided a continuous improvement with the number of groups. Furthermore, even with 2 available groups, the reduction ratios \( \Delta G_{\text{csa}} \) exceeded 30% in most cases. This is an important finding as test time increases with the number of groups in various DFT implementations. Finally, sub-columns \( t_{\text{csa}} \) show the execution time of the proposed algorithm. Table II shows that even for the biggest circuit \( b19 \), the execution of the proposed algorithm was completed within several minutes.

In summary, the experimental results clearly demonstrated the following important points:

- Although partial shift can effectively reduce average shift power, it may actually increase shift clock skews (thus resulting in higher risk of shift timing failures) if

| circuit | \( |C| \) | \( |F| \) | area | \( |S| \) | maxlen |
|---|---|---|---|---|---|
| b17 | 31k | 1317 | 0.37 mm\(^2\) | 10 | 132 |
| b18 | 103k | 3020 | 1.22 mm\(^2\) | 10 | 302 |
|    |      |      |      | 30 | 101 |
|    |      |      |      | 50 | 61 |
| b19 | 185k | 6042 | 2.21 mm\(^2\) | 10 | 605 |
|    |      |      |      | 30 | 202 |
|    |      |      |      | 50 | 121 |
| b20 | 34k | 430 | 0.41 mm\(^2\) | 10 | 43 |
| b21 | 34k | 430 | 0.38 mm\(^2\) | 10 | 43 |
| b22 | 53k | 613 | 0.60 mm\(^2\) | 10 | 62 |
scan chains are not properly grouped.

- The proposed algorithm for clock-skew-aware scan chain grouping can effectively reduce the maximum clock skew between neighboring scan flip-flops in the scan chains.

The proposed algorithm guides scan chaining by a static or structural cost metric, which is calculated from circuit layout and scan chain grouping regardless of test vectors. As a result, the proposed algorithm is fundamentally fast, making it highly scalable for large industrial circuits.

5. Conclusions

Partial shift is a widely-used technique to mitigate shift power problems; however, it may worsen shift clock skews, thus increasing the risk of shift timing failures, if scan chains are not properly grouped. We have proposed a flexible cost function to estimate shift clock skews based on only structural information. We have further proposed an effective method, namely Clock-Skew-Aware Scan Chain Grouping (CSA-SCG), for properly grouping scan chains so as to reduce shift clock skews. Experimental results have demonstrated that the proposed method reduced the estimated shift clock skew by up to 60% and in most cases over 30%. The proposed method is also fast and highly scalable.

Future work includes: (1) conducting circuit-level full-time simulations to further evaluate the capability of the proposed method, and (2) implementing the proposed method into an existing test design flow.

Acknowledgments

This work was supported in part by JSPS Grant-in-Aid for Scientific Research (B) #17H01176 and Grant-in-Aid for Young Scientists #18K18026.

References