Faster Constant-Time Decoder for MDPC Codes and Applications to BIKE KEM

(1)

Codes and Applications to BIKE KEM

Thales B. Paiva and Routo Terada

University of Sao Paulo, Sao Paulo, Brazil,{tpaiva,rt}@ime.usp.br

Abstract. BIKE is a code-based key encapsulation mechanism (KEM) that was recently selected as an alternate candidate by the NIST’s standardization process on post-quantum cryptography. This KEM is based on the Niederreiter scheme instantiated with QC-MDPC codes, and it uses the BGF decoder for key decapsulation.

We discovered important limitations of BGF that we describe in detail, and then we propose a new decoding algorithm for QC-MDPC codes called PickyFix. Our decoder uses two auxiliary iterations that are significantly different from previous approaches and we show how they can be implemented efficiently. We analyze our decoder with respect to both its error correction capacity and its performance in practice. When compared to BGF, our constant-time implementation of PickyFix achieves speedups of 1.18, 1.29, and 1.47 for the security levels 128, 192 and 256, respectively.

Keywords:Post-quantum cryptography · BIKE · MDPC · LDPC · constant-time decoding

1 Introduction

BIKE [ABB⁺21] is a code-based key encapsulation mechanism (KEM) selected as an alternate candidate for the NIST post quantum standardization process. The scheme consists of a variant of the Niederreiter [Nie86] scheme using quasi-cyclic moderate-density parity-check (QC-MDPC) codes instead of Goppa codes. As such, BIKE can be seen as a refinement of Misoczki’s et al. QC-MDPC McEliece [MTSB13].

The use of QC-MDCP [MTSB13] codes yields two advantages. The first one is that the public key is much smaller, since one needs only one row to represent a quasi-cyclic matrix in systematic form. The second is that matrix multiplication, and thus encoding, is much faster for quasi-cyclic matrices. However, QC-MDPC codes comes with an important disadvantage: their decoding algorithms have a non-zero probability of failure. This fact was exploited in the famous GJS [GJS16] key-recovery reaction attack, that provided the ground for side-channel attacks against QC-MDPC [RHHM17] and further attacks against other code-based encryption schemes [SSPB19,FHS⁺17].

To deal with this problem, BIKE’s original proposal [ABB⁺17] used ephemeral keys.

However, recent approaches on obtaining negligible decryption failure rate (DFR) [Til18, SV20a,Vas21], together with Hofheinz et al. [HHK17] CCA security conversions that accounts for decryption errors, motivated BIKE proponents to consider key-reuse. In particular, Sendrier and Vasseur [SV20a,Vas21] propose a framework that, under reasonable assumptions, allows them to find parameters where the DFR should be negligible using experiments and statistical analysis. This framework was used in BIKE’s last revision [ABB⁺21], which uses the state-of-the-art BGF decoder [DGK20c,DGK19] with parameters that supposedly achieve negligible DFR.

While trying to improve BGF’s performance, we noticed two limitations. The first one is that its performance cannot be improved by considering a lower number of iterations,

(2)

otherwise it breaks the main hypothesis for using Vasseur’s extrapolation framework [Vas21].

The second is that some of its iterations can be made more efficient by merging them into one iteration. After analyzing BGF’s strengths and weaknesses, we were able to derive a new and more efficient decoder.

Contribution. We propose a new decoding algorithm for QC-MDPC codes, called Picky- Fix. This decoder uses two auxiliary iterations that are significantly different from previous approaches: the FixFlip iteration, which flips a fixed number of bits, and the PickyFlip iteration, which uses different thresholds to flip ones and zeros. These iterations allow PickyFix to work with a lower number of iterations than BGF, which, together with our constant-time implementation, makes PickyFix achieve speedups of 1.18, 1.29, and 1.47 for the security levels 128, 192 and 256, respectively. The code and data are publicly available at https://github.com/thalespaiva/pickyfix.

Organization. We begin by quickly reminding some basic concepts from coding theory and QC-MDPC decoding in Section2. BIKE and its security parameters are presented in Section 3. Then, in Section 4, we analyze BGF in detail to show its strengths and weaknesses. Our proposed decoder PickyFix is introduced in Section5, and then we analyze its parameters and decoding performance in Section6. In Section7, we discuss how to implement PickyFix efficiently and in constant-time and then compare its performance with BGF. Finally, we conclude and discuss interesting future work in Section8.

2 Background

Abinary[n, k]-linear code is ak-dimensional linear subspace ofFⁿ₂, whereF2denotes the binary field. If Cis a binary [n, k]-linear code spanned by the rows of a matrix GofF^k×n₂ , we say thatGis a generator matrix ofC. Similarly, ifC is the kernel of a matrixHof F^r×n₂ , we say thatHis a parity-check matrix ofC. The Hammingweight of a vectorv, denoted by|v|, is the number of its non-zero entries. Thesyndromezof a vectorewith respect to a parity check matrix His the vectorz=eH^⊤. If the vectoreis sufficiently sparse and the linear code defined byH is sufficiently good, it may be possible to recover efrom the syndrome zby using efficient decoding algorithms. The support of a binary vectorv, denoted as supp (v), is the set supp (v) ={i:v_i= 1}.

A moderate-density parity-check (MDPC) code [MTSB13] is a linear code that admits a moderately sparse parity-check matrixH∈F^r×n₂ . The weight of each column ofHis set to be all equal to a fixed value d, and require that d=O(√

n). For applications in cryptography, it is particularly useful to consider quasi-cyclic MDPC (QC-MDPC) codes, because they allow for smaller keys and more efficient operations. BIKE [ABB⁺21] is defined over QC-MDPC codes with two circulant blocks, which are MDPC codes that admit a sparse parity check matrix of the form H= [H₀|H₁], where each r×rbinary matrixH₀ andH₁is circulant.

MDPC codes admit very efficient decoders, which are called bit-flipping decoders [Gal62].

All variants of bit-flipping decoders work based on the following observations. Letebe a sparse vector whose syndrome with respect to the sparse matrixHisz=eH^⊤. Suppose we do not knowebut want to recover it fromzusing our knowledge fromH. We know that z=P

i∈supp(e)H^⊤_i , whereH^⊤_i denotes the transpose of thei-th column ofH.

Now, since e and each column H^⊤_i are sparse, we can estimate the likelihood that ei= 1 by checking how closelyzmatches with columniofH: the more they are similar, the higher is the probability thatei= 1. The similarity measure for each columniis what is known as the unsatisfied parity-check (UPC) counter, denoted asupc_i, and it is equal to the size of the intersection of supp (z) and supp H^⊤_i

. The name UPC comes from the

(3)

Algorithm 1General bit-flipping decoding algorithm.

1: procedureGeneralBitFlipping(z,H)

2: Start with the partial error vector ˆe←0∈Fⁿ2

3: Initialize the number of iterationsit←0 4: Let the partial syndromes←z+ ˆeH=z

5: while s̸=0andit<maximum number of iterationsdo

6: Compute the UPC countersupc_iwith respect tosandH, fori= 1 ton 7: For eachi= 1 ton, flip bit ˆeiifupc_i is above a certain threshold 8: Update the partial syndromes←z+ ˆeH

9: it←it+ 1

10: if s=0 then 11: returnˆe 12: else

13: return⊥, indicating that the maximum number of iterations was reached

fact that the set supp (z) is sometimes called the set of unsatisfied equations, and therefore upc_i counts the number of unsatisfied equations that are caught byH^⊤_i .

Algorithm1shows the steps that a general bit-flipping algorithm performs when trying to obtaine fromzandH. The algorithm stops when it finds a vector ˆewith the same syndrome ase, or if the number of iterations exceeds some limit. Notice that the partial syndrome defines the objective syndrome in each iteration, and in the ideal case, vector ˆe gets closer and closer toeafter each iteration. Although most bit-flipping algorithms used in cryptography [Gal62,MTSB13,DGK19,SV19] can be framed in the general description above, they can vary significantly with respect to how the threshold for flipping bits is selected in each iteration.

3 BIKE

The purpose of a key encapsulation mechanism is to use public-key encryption algorithms to securely exchange a key between two parties. These parties can then use secret-key algorithms, which are much more efficient, to exchange large messages.

For a clearer presentation, we describe BIKE algorithms without the implicit-rejection Fujisaki-Okamoto transformation [HHK17], usually denoted byFO^̸⊥. However, notice that when discussing the experimental performance of our algorithm in Section7.3, we consider the full decapsulation with theFO^̸⊥ transformation applied.

3.1 Parameters and Algorithms

Setup. On input 1^λ, whereλis the security level, the setup algorithm returns parameters r, wandttaken from the parameter Table1. Parametersrandwwill define the family of QC-MDPC codes to be used whiletcontrols the weight of the error used for encryption, as will be detailed in the following sections. The table also provides the estimated decryption failure rates (DFR) for each parameters set according to Vasseur’s framework [Vas21].

Table 1: BIKE parameters for each security level.

Parameter set Security

levelλ r w d=w/2 t Decoder DFR estimate

BIKE Level 1 128 12,323 142 71 134 BGF 2⁻¹²⁸

BIKE Level 3 192 24,659 206 103 199 BGF 2⁻¹⁹²

BIKE Level 5 256 40,973 274 137 264 BGF 2⁻²⁵⁶

(4)

Key Generation. Leth₀ andh₁be two vectors of rbits of odd weightd=w/2. Build the circulant matricesH₀ andH₁ by takingh₀andh₁ as their first rows, correspondingly.

IfH₁ is not invertible, restart the process by selecting anotherh₁. Then the secret key is the sparse matrix H= [H₀|H₁]∈F^r×₂ ²^r and the public key is the dense circulant matrix H_Pub=H₁H⁻₀¹.

Notice that matricesHand [I|H_Pub] are both parity checks of the same quasi-cyclic linear code. However, the sparsity of the first one allows for efficient syndrome decoding using bit-flipping algorithms.

Encapsulation. Select two random binary vectorse₀ ande₁ such that |e₀|+|e₁|=t.

Then the key to be shared isk_Shared=H([e₀|e₁]), for some cryptographic hash function H. To encapsulate the keyk_Shared, compute the ciphertextc=e₀+e₁H^⊤_Pub∈F^r₂. Notice that ciphertextcthen corresponds to the syndrome of the low weight vector [e₀|e₁] with respect to the public parity-check matrix [I|H_Pub].

Decapsulation. Given the ciphertext c, the receiver, who knows the sparse parity-check matrixH, first compute the secret syndromez=cH^⊤₀. Notice that

z=cH^⊤₀ = e₀+e₁H^⊤_Pub

H^⊤₀ =e₀H^⊤₀ +e₁H^⊤_PubH^⊤₀ =e₀H^⊤₀ +e₁H^⊤₁.

Therefore, as mentioned by the end of Section2, the receiver can use some QC-MDPC bit-flipping decoding algorithm, together with their knowledge of the secret matrixHto recover the sparse vector [e₀|e₁] and compute the shared keyk_Shared=H([e₀|e₁]).

In the last revision of BIKE [ABB⁺21], the authors recommend the BGF decoding algorithm [DGK20c], which is the state-of-the art QC-MDPC decoder. Before introducing BGF, let us first discuss the security of BIKE and, in particular, why good decoders are very important to ensure BIKE’s security.

3.2 Security and Negligible Decryption Failure Rate

The security of the scheme is based on three hypotheses. The first two are standard conjectures for quasi-cyclic codes, namely the hardness of the syndrome decoding problem and the hardness of finding codewords of a fixed low weight. This ensures that one can neither recover the secret sparse matrices H₀ andH₁ fromH, nor the secret message [e₀|e₁] from ciphertextc. The third hypothesis is that the decryption failure rate (DFR) is negligible with respect to the security parameter. Although we cannot prove the third hypothesis, Vasseur [Vas21] proposed a framework that, under weaker hypothesis, allows one to get confident that some decoders achieve negligible DFR for selected parameter sets.

It is shown [TS16,Sen11] that parameters t and w are the most important when determining the security level, since they control the weight of the sparse vectors. Intuitively, ifwortare too small, it is easy to findh₀or the partial encryption errore₀by enumerating low weight vectors. But they may not be so large, with respect tor, otherwise the probability of failing to decrypt a ciphertext would be too high. Therefore, to define parameters (t, w, r), one typically fixes (t, w) sufficiently large to achieve high security levels, and then definer such that the decryption failure rate is low enough for the desired application.

In 2016, Guo et al. [GJS16] showed that decryption failures could lead to a full key recovery attack against schemes based on QC-MDPC codes. To deal with the potential vulnerability faced by schemes within which decryption failures occur, Hofheinz et al. [HHK17] refined the Fujisaki-Okamoto [FO99] transformation showing that a scheme whose decryption failure rate is below 2^−λ can be transformed into a CCA secure one.

Unlike for algebraic codes, such as Goppa or Reed-Solomon codes, whose decoders are guaranteed to decode all errors in vectors up to a given weight, we cannot yet give strong

(5)

9500 10000 10500 11000 11500 12000 Parameterr

−200

−150

−100

−50 0

log2(DFR) DFR = 2⁻¹²⁸

rext = 11159 (rA, pA)

(rB, pB)

DFR from simulations Real DFR

Extrapolated DFR

Figure 1: Illustration of Vasseur’s [Vas21] DFR extrapolation framework considering 128 bits of security and a hypothetical decoder.

mathematical guarantees on the error correction capability of decoders for QC-MDPC codes. Recently, Sendrier and Vasseur [SV20a,Vas21] proposed a method that, under reasonable hypotheses, allows one to use simulations and simple statistical analysis to find parameters (r, t, w) such that a QC-MDPC decoder fails with negligible probability with respect to some security parametersλ.

Lettandwbe fixed positive integers and let us consider a hypothetical QC-MDPC decoder D. Let DFRD(r) denote the decryption failure rate of D when decrypting a ciphertext generated at random with respect to a random QC-MDPC key with parameters (r, t, w). The main observation by Sendrier and Vasseur [SV20a] is that the curve log₂(DFRD(r)) is typically concave for practical QC-MDPC decoders and for all values of rsuch that DFRD(r) is high enough so that failures can be observed in simulations.

Vasseur’s [Vas21] model then makes the following assumption: for a given decoder Dand security levelλ, the curve log₂(DFR_D(r)) is concave in the region where DFR_D(r)≥2^−λ. This assumption is somewhat consistent with Tillich’s [Til18] asymptotic theoretical model for MDPC codes, which shows that the dominating term in log₂(DFR_D(r)) decreases linearly withr.

Figure1illustrates how Vasseur’s [Vas21] model can be used to estimate the block parameterrthat allows for negligible failure rate with respect to the security parameter λ= 128. First, one performs DFR simulations for increasing values of runtil it cannot see any decoding failure. Then, they take the last two points (r_A, p_A) and (r_B, p_B) in the log₂DFR plot such that a number of failures were observed and compute the line passing through them. According to the extrapolation hypothesis, the decoder fails with negligible probability forr=r_ext, the point where the line intercepts DFR = 2^−λ. Finally, choose parameterrto be the least primer≥r_ext such that 2 is primitive modulor. This avoids both squaring attacks [LJS⁺16] and other potential attacks based on the factorization of the cyclic polynomial ring¹ F2[X]/(X^r−1).

Since there is always some error in the DFR estimates, Vasseur [Vas21] uses confidence intervals for the observed DFR and compute a conservative extrapolation forras follows.

LetpAandpB be the DFRs forrAandrB, respectively, whererA< rB. Considerp⁻_A and p⁺_Bto be the lower and upper limit forpAandpB according to Binomial confidence intervals forpAand pB. Then a conservative extrapolation forr_ext is obtained by considering the

1The cyclic polynomial ringF2[X]/(X^r−1) is isomorphic to the ring of circulant matrices used in BIKE.

(6)

line passing through (r_A, p⁻_A) and (r_B, p⁺_B). Vasseur [Vas21] uses the Clopper-Pearson confidence interval together with posterior probabilities to obtain a narrower interval, with confidence level α= 0.01. In this work we use the same αwith the Clopper-Pearson interval, but we do not use the posterior probabilities. Even though this tends to give slightly more conservative estimates, it is easier to compute.

3.3 BGF: State-of-the-art QC-MDPC Decoder

BGF [DGK20c], which stands for Black-Gray-Flip, is one of the most efficient known decoders for QC-MDPC codes. This decoder is an improvement of the Black-Gray decoder first proposed by Sendrier and Misoczki in a previous version² of CAKE [BGG⁺17], a predecessor of BIKE.

As a decoding algorithm, BGF’s goal is to, given a syndrome ciphertextc=e₀+e₁H^⊤_Pub, recover the sparse error vectore= [e₀|e₁] using the secret sparse matrixH. The algorithm first computes the secret syndrome z =cH^⊤₀, then starts withe ← 0and performs a sequence ofN_Iteriterations, each of which updates its knowledge oneuntil eitherz=eH^⊤ or the number of iterations exceeds a certain limit N_Iter and a decoding failure occurs.

Before introducing BGF, let us first define its auxiliary procedures.

BGF Auxiliary Algorithms. BGF uses two bit-flipping auxiliary procedures:BitFlipIter andBitFlipMaskedIter, which are formally described in Algorithm2. These procedures are very similar to other iterative decoders, such as the original Gallager’s bit-flipping algorithm [Gal62].

Both algorithms flip bits of the partial error vectorewhen their corresponding UPC counters are above some threshold,τ₀forBitFlipIterandτ₁ forBitFlipMaskedIter. However they differ in some important points. First,BitFlipIternot only flips the bits, but it also marks the bits in either black or gray, using bit-masksBlackMaskandGrayMask.

Black bits are the ones that are flipped with a somewhat high confidence (upc_j ≥τ₀), while gray bits are the ones that were almost selected for flipping (τ₀>upc_j≥τ₀−δ), but did not make it because of a minor differenceδ. On the other hand,BitFlipMaskedIter is a simple bit flip iteration based on the UPC value, but it only flips bits that are marked 1 in a given maskMask.

The BGF algorithm. BGF is defined as Algorithm 3. Intuitively, the first call to BitFlipIter flips the bits for which it has a high confidence that they are wrong, by using a selective threshold function Thresh. Then it comes the two regret steps: first the black and then the gray. In the black regret, all the 1 bits added in the previous step that have an UPC strictly greater³ than (d+ 1)/2 will be flipped back to 0. The gray regret step is analogous, but now over the bits marked inGrayMask, which are called gray bits. These consist of 0 bits that were not flipped in the first step because their UPC were smaller than, but somewhat close to, the selected threshold.

After the first and most costly iteration ensured a good start, hopefully with only a small number of errors left to be corrected, BGF continues withN_Iter−1 iterations of BitFlipIter that will try to correct the remaining errors. Notice that the masks are not needed after this point, and thus, are ignored.

BGF Parameters. Table2shows the parametersδ,N_Iterand threshold functionThresh proposed for the different security levels together with their performance under our platform⁴. We considered the constant-time implementation provided in BIKE Additional

2Unfortunately, there appears to be no reference to the version in which the Black-Gray decoder appeared.

3Notice that this is done by choosingτ₁= (d+ 1)/2 + 1, since the flipping condition isupc_j≥τ₁.

4Intel^®Xeon^TMGold 5118 CPU at 2.30GHz.

(7)

Algorithm 2Auxiliary iterations used by BGF.

1: procedureBitFlipIter(H,z,e,s, τ₀) 2: BlackMask←0∈F²2^r

3: GrayMask←0∈F^2r₂ 4: forj= 1 to 2rdo

5: upc_j←

supp (s)∩supp H^⊤_j

6: if upc_j≥τ₀ then

7: ej←ej ▷Flips coordinatejofe

8: BlackMaskj←1 9: else if upc_j≥τ₀−δ then 10: GrayMask_j←1

11: s←z+eH^⊤ ▷Recomputes the partial syndrome

12: return e,s,BlackMask,GrayMask

13: procedureBitFlipMaskedIter(H,z,e,s,Mask, τ₁) 14: forj= 1 to 2rdo

15: upc_j←

supp (s)∩supp H^⊤j

16: if Maskj= 1 andupc_j≥τ₁ then

19: return e,s

Implementation [DGK20a] with minor changes to account for the updated threshold function in BIKE’s last revision [ABB⁺21].

Notice howδandN_Iter are the same in all security levels. The threshold function is an increasing linear function on the syndrome weight truncated above the minimum value (d+ 1)/2. Since the threshold function is used to determine when to flip a bit, this means that when the weight of the syndromesis large, fewer bits will be flipped.

Table 2: BGF parameters and their corresponding performance when considering the portable and AVX512 implementations.

Security

levelλ δ N_Iter Thresh(s) Cycles

Portable

Cycles AVX512 128 3 5 max (36,⌊0.00697220|s|+ 13.5300⌋) 10,955,732 1,323,322 192 3 5 max (52,⌊0.00526500|s|+ 15.2588⌋) 32,982,825 4,130,087 256 3 5 max (69,⌊0.00402312|s|+ 17.8785⌋) 94,902,236 11,497,288

4 Critical Analysis of BGF

In this section, we dive a little deeper into the BGF decoding algorithm. This allows us to better understand why BGF is effective, but, more importantly, it will show some of BGF’s weaknesses and lay the ground over which a better decoder can be designed.

It is well-known to be difficult to provide a theoretical analysis for QC-MDPC iterative decoders, because of the inherent dependency caused by the circulant matrices involved.

Therefore, our analysis is based on observations of BGF’s behavior in practice.

4.1 BGF’s First Iteration: The Black-Gray Step

Let us first discuss BGF’s first iteration and its importance for the extrapolation framework.

As described in the previous section, in the first iteration, BGF performs a sequence of 3

(8)

Algorithm 3The BGF decoding algorithm.

1: procedureBGF(H= [H₀|H₁],z=cH^⊤₀)

2: e←0∈F²2^r ▷Initializes the partial error vector

3: s←z ▷Initializes the partial syndrome

4: fori= 1 toN_Iterdo

5: ▷Every timeeandsare updated, it holds thats=z+eH^⊤ 6: if i= 1then

7: e,s,BlackMask,GrayMask←BitFlipIter(H,z,e,s, τ₀=Thresh(s)) 8: e,s←BitFlipMaskedIter(H,z,e,s,BlackMask, τ₁= (d+ 1)/2 + 1) 9: e,s←BitFlipMaskedIter(H,z,e,s,GrayMask, τ₁= (d+ 1)/2 + 1)

10: else

11: e,s←BitFlipIter(H,z,e,s, τ₀=Thresh(s)) ▷Ignores the black-gray masks 12: if eH^⊤=z then ▷This condition is equivalent tos=0 13: return e

14: else

15: return⊥ ▷Decoding failure

bit-flipping calls: oneBitFlipIterfollowed by twoBitFlipMaskedIter.

We know thatBitFlipIter flips all bits whose UPC counters are above a certain threshold. This makes it very sensible to the threshold selected, as illustrated in Figure2.

Consider the difference if, by chance, the thresholdτ₀= 76 was selected, then the number of errors made after callingBitFlipIter, that is, correct bits that would be incorrectly flipped, would be twice the number ifτ₀= 77 were selected.

This problem is particularly important under the extrapolation framework, where the algorithm needs not only to perform well, but also to improve its performance at a very fast rate for small, but increasing, values ofr. Therefore, BGF uses a very conservative threshold in the first iterationBitFlipIter. Additionally, the black and gray regretting phases, corresponding to the two calls ofBitFlipMaskedIter, also work by flipping a controlled number of bits: only those bits in the black or gray masks whose UPC is above (d+ 1)/2. This makes the whole first iteration very conservative.

40 50 60 70 80 90

UPC counter 0

10 20 30 40 50

Numberofoccurrences

UPC = 76 UPC = 77

Correct bit Incorrect bit

Figure 2: Histogram of the UPC counters for each of the 2rbits in the partial error vector e=0, in the beginning of the first iteration, separated by the cases when the bit is right or wrong. The values correspond to a real observation corresponding to the BIKE Level 5 security parameters.

(9)

9600 9800 10000 10200 10400 10600 Parameterr

2⁻¹³ 2⁻¹⁰ 2⁻⁷ 2⁻⁴ 2⁻¹

DecryptionFailureRate

Iterations 2 3 4 5

Figure 3: The impact of the number of BGF iterations on the DFR, considering parameter set BIKE Level 1 (t= 134,w= 142).

Even though a conservative first iteration is important to ensure a fast DFR decay when rincreases, it may result in useless iterations when ris close to the value when negligible DFR is reached. In particular, for the case presented in Figure 2, where r = 40,973, Threshreturnedτ₀= 86. This would result in no error being made afterBitFlipIter, but at the cost of flipping only a small number of bits, compared to the case whereτ₀= 80, for example.

This suggests that removing the black regret step may be a good starting point for optimization. For example, we could merge both black and gray regret steps into one iteration in such a way that the black regret is critical for smallr, but whenrgets larger, the gray regret steps gets more important than the black one. This is the key idea behind our PickyFlip iteration that we introduce in Section5.

4.2 The Number of Iterations and the Threshold Function

One straightforward method to improve the decapsulation performance would be to decrease the number of BGF iterations, at the cost of increasing the key sizes. Intuitively, one may think that there is a direct trade-off between the number of iterations and the block length parameter r: the DFR may not decay as fast when using a lower number of iterations, but one might be lucky to obtain a reasonable value ofrafter the extrapolation.

However, as discussed in the previous section, since the thresholds are so conservative, if the number of iterations is too small, the decoder may not be able to fully correct the errors even for large values ofr.

Figure3shows how the number of iterations affects the decay of the DFR as a function ofr. Notice how 2 iterations are not enough to allow for a complete decoding of errors of weightt= 134. Furthermore, the curve for 3 iterations does not appear to be concave, therefore it is not safe to use the extrapolation framework for this value. This odd behavior of the DFR curves for 2 and 3 iterations is caused by the following problem. On the one hand, increasingrshould make it easier to correct more errors, since there is more redundancy, but, for large values ofr, the thresholdτ₀used in the first iteration is so high that only very few errors are corrected in the first iteration.

Let us analyze the thresholdsτ₀in more detail. The average values of the thresholdsτ₀ used in each of BGF’s iteration are shown in Figure4, considering 10,000 decapsulations under BIKE Level 1 parameter set. We make three observations. First notice how the first

(10)

9000 9500 10000 10500 11000 11500 12000 12500 13000 Parameterr

37.5 40.0 42.5 45.0 47.5

Thresholdτ0

Iteration 1 2 3 4 5

Figure 4: Average values of thresholdτ₀in each of the 5 iterations of BGF, considering parameter set BIKE Level 1 (t = 134,w= 142).

threshold increases asrincreases. This is a consequence of the linear dependency of τ₀ on the syndrome weight|s|, which turns out to increase withr. The second observation is that the thresholds used in iterations 2 to 5 appear to converge to the floor (d+ 1)/2 = 36. This happens because the first iteration, in general, is able to flip a sufficiently large number of errors, and leave only fine adjustments for the next iterations to deal with. The third is that τ₀, for the second iteration, starts increasing afterr= 11,500. This is caused by the threshold in the first iteration being too high, which leaves a lot of errors to be corrected by the second iteration.

4.3 Impact of the Threshold on the Concavity Assumption

Back to the DFR curves, the non-concave behavior of the curves for 2 and 3 iterations raises a potentially deep problem with the BGF threshold: why should we expect the curves for 4 and 5 iterations to be concave as well? It is possible that we just cannot see an inflection point because it is located at a DFR smaller than what we can simulate.

To evaluate the concavity of the DFR curves for 5 iterations, we propose the following experiment. Consider BIKE Level 1 parameter set. Since we cannot see the inflection points for t = 134, we can exaggerate the error weight t so that we can see the DFR curve in the interval of interest. Ideally, it should be concave at least within all values of r <12,323, since this is the extrapolated value ofrfor BIKE Level 1.

As we can see in Figure5, this is not what happens fort= 151, 153 and 155, for BGF with 5 iterations. Therefore, considering our results regarding the non-concavity of BGF with 2 and 3 iterations, together with the non-concavity of BGF with 2 to 5 iterations whent= 151, we believe that it is not conservative to assume that the DFR curve for BGF is concave. We also tested BGF for levels 3 and 5, observing an analogous behavior fort= 220 andt= 300, respectively.

The main cause for this behavior appears to be the threshold function that depends on

|s|. We conclude that it is not safe to use it for the first, and most important, iteration, but Figure4 suggests that it might be used in further iterations, since it converges to (d+ 1)/2. Initially, we though that the threshold problem would be fixed by defining a maximum value forτ₀. In our exploratory tests, this indeed make concave DFR curves for exaggerated values of t, but the error correction was negatively affected. Therefore, we leave the problem of finding better thresholds for future work.

(11)

10000 10500 11000 11500 12000 12500 Parameterr

2⁻¹⁴ 2⁻¹¹ 2⁻⁸ 2⁻⁵ 2⁻² 2¹

DFR Parametert 155 153 151

Figure 5: The DFR plot for different values of tconsidering BIKE Level 1 parameter set.

Our approach to deal with the first iteration is simple: we do not use a simple threshold to flip bits. Instead of starting with a BitFlip iteration, we propose to start with FixFlip, a new type of iteration that works by flipping a predetermined number of bits that have the largest corresponding UPC.

5 PickyFix

In this section, we describe a new BIKE decoder called PickyFix. Similar to other iterative decoders for LDPC codes, PickyFix works by performing a sequence of iterations that progressively increases the knowledge of the secret sparse error used for encrypting.

However, it differs significantly in how it chooses which bits to flip in its iterations. We begin by defining two new types of auxiliary procedures: the FixFlip and PickyFlip iterations, that are the building blocks of our decoder.

5.1 The FixFlip Auxiliary Iteration

While the majority of previous bit-flip approaches are based on flipping all bits whose UPC counters are above a certain threshold, FixFlip flips a predetermined number of bits, denoted by n_Flips, that have the highest UPC counters. The formal description of a full iteration of FixFlip is described as Algorithm4.

Almost every step of the algorithm is standard for other bit-flipping algorithms. How- ever, despite its simplicity, one has to be careful with line3when implementing the FixFlip iteration, In Section7we discuss this issue and show how this can be done efficiently in linear time on rby using important observations on QC-MDPC parameters.

Algorithm 4The FixFlip iteration.

1: procedureFixFlipIter(H,z,e,s, n_Flips)

2: upc←

forj= 1 to 2r

▷We need the full UPC array 3: worst_indexes←list of the indexesjof then_Flipslargest values ofupc

4: forj inworst_indexesdo

7: return e,s

(12)

This iteration is very useful at the start of the decoding process, when there is a lot of uncertainty about the correctness of the bits. We can point two immediate advantages of using FixFlip. First, since the number of flips is fixed, the number of wrong flips done by this iteration is limited. This makes FixFlip useful for small values of r, which is an important property for decoders to be used in Vasseur’s [Vas21] DFR extrapolation framework. Second, and most important, FixFlip is immune to the problem of BGF’s first threshold that gets larger asrgrows, since it does not rely on a generic threshold function that depends only on|s|. In fact, the threshold function for FixFlip depends directly on the UPC values and the target numbern_Flipsof bits to flip.

5.2 The PickyFlip Auxiliary Iteration

PickyFlip is very similar to the BitFlip iteration, except that it uses 2 different threshold:

τ_In is used to flip zeros to ones andτ_Out to flip ones to zeros. In particular, PickyFlip requires that the threshold to flip a zero to a one is greater than or equal to the threshold to flip a one to zero. This makes it picky with respect to the support ofeand explains why we useinandout to differentiate the thresholds. The iteration is formally described as Algorithm5.

The power of this iteration is that the weight of e does not grow too much in one iteration because it is easier to give up on a 1 in the partial error vectorethan to accept one more. Additionally, the effect of one PickyFlip iteration is similar to the sequence of black regret and gray regret steps, for a sufficiently highr. Luckily, because of its similarity with the BitFlip iteration, it can be easily implemented by small adjustments of the code by Drucker et al. [DGK20a] in BIKE Additional Implementation.

5.3 The PickyFix Decoder

We are now ready to define a full decoder, which is described as Algorithm6. To allow for a direct comparison between PickyFix and BGF, we decided to define it in a similar fashion: the first iteration makes 3 calls of the auxiliary steps, which are then followed by single calls in the nextN_Iter−1 iterations.

The thresholdτ_Out for PickyFix is fixed as (d+ 1)/2 in every iteration, which is the value typically used as the minimum threshold for flipping bits. For the value ofτ_In, we decided to use the BGF’s auxiliary function Thresh which was carefully built by the BIKE team and is sufficiently restrictive for our use case.

FixFlip depends on the following parameters: the numbern_Flips of flips to be done by FixFlipIterand the numberN_Iterof iterations. These parameters depend on the security level and significantly impact the decoder’s performance. We analyze these parameters in the next section.

Algorithm 5The PickyFlip iteration.

1: procedurePickyFlipIter(H,z,e,s, τ_In, τ_Out) 2: forj= 1 to 2rdo

3: upc_j=

4: if ej= 0 andupc_j≥τ_In then

5: ej←ej

6: else if ej= 1 andupc_j≥τ_Outthen

7: ej←ej

9: return e,s

(13)

Algorithm 6The PickyFix decoding algorithm.

1: procedurePickyFix(H= [H₀|H₁],z=cH^⊤₀)

2: e←0∈F²2^r ▷Initializes the partial error vector

3: s←z ▷Initializes the partial syndrome

4: fori= 1 toN_Iterdo

5: ▷Every timeeandsare updated, it holds thats=z+eH^⊤ 6: if i= 1then

7: e,s←FixFlipIter(H,z,e,s, n_Flips)

8: e,s←PickyFlipIter(H,z,e,s, τ_In=Thresh(s), τ_Out= (d+ 1)/2) 9: e,s←PickyFlipIter(H,z,e,s, τ_In=Thresh(s), τ_Out= (d+ 1)/2)

10: else

11: e,s←PickyFlipIter(H,z,e,s, τ_In=Thresh(s), τ_Out= (d+ 1)/2)

12: if eH^⊤=z then ▷This condition is equivalent tos=0 13: return e

14: else

15: return⊥ ▷Decoding failure

6 Analysis

The main problem when searching for good parameters (n_Flips, N_Iter) is that they are not independent. For example, if n_Flips is too small, we may need a large number N_Iter of iterations to compensate. To simplify our search, we will take a greedy approach and break the search into two parts.

In this section, first we find good values forn_Flipsby focusing only on the first iteration and then show that these values indeed yield decoders with a concave DFR curve. Finally, we proceed to evaluate the decoder performance for different numberN_Iterof iterations.

6.1 Choosing the FixFlip Parameter

Intuitively, the best value of n_Flipsis the one that minimizes the number of errors left to be corrected by further PickyFlip iterations. Ideally, one could see how each possible value of n_Flipsaffects the DFR curves following the extrapolation framework, and choose the one that has the fastest decay. The problem of this approach is that these experiments are very expensive and could easily take months of computing power.

To deal with this problem, instead of counting decoding failures, we count the average number of uncorrected errors left, which can be estimated with a much smaller sample than what is needed for the DFR estimation. Consider the curves PickyFixⁿ₁^Flips(r) that represent the average number of errors left after the first iteration of PickyFix when the FixFlip iteration performs n_Flips bit flips. Similarly, define the curve BGF₁(r) as the average number of errors left after the first iteration of BGF.

Figure6shows selected curves, where the average number of errors left was obtained by simulations of 10,000 runs. Notice how each PickyFlip curve eventually leaves about 0 errors after the first iteration. Furthermore, we can see that BGF appears to stall its error correction in its first iteration asrincreases. In Level 1, BGF even starts to leave more errors for sufficiently large values ofr, which is a consequence of the very conservative threshold used in the first iteration that we discuss in Section4.1.

To obtain the best value ofn_Flips we used the following criteria: for each security level, select the valuen_Flips such that

PickyFixⁿ₁^Flips(r) = 0, (1)

for the lowest value ofr. Furthermore, we restricted the search forn_Flipsto multiples of 5

(14)

9600 9800 10000 10200 10400 10600 10800 11000 Parameterr

0 100 200 300

Errorsleft

Level 1

BGF1

PickyFix²⁵₁ PickyFix⁵⁵₁ PickyFix⁸⁰₁

19200 19400 19600 19800 20000 20200 20400 20600 20800 21000 Parameterr

0 200 400

Errorsleft

Level 3

BGF1

PickyFix³⁰₁ PickyFix⁶⁵₁ PickyFix⁹⁵₁

33200 33400 33600 33800 34000 34200 34400 34600 34800 35000 Parameterr

0 200 400 600

Errorsleft

Level 5

BGF1

PickyFix⁵⁰₁ PickyFix¹⁰⁰₁ PickyFix¹⁵⁰₁

Figure 6: Comparison of the number of uncorrected errors after the first iteration for BGF and FixFlip using different values of n_Flips, for the three BIKE parameter sets. The different values ofn_Flipsare indicated by the label format FixFlipⁿ₁^Flips.

to speed up the search. The best values ofn_Flipsobtained for each security level are shown in Table3, where 10,000 tests were performed to estimate PickyFixⁿ₁^Flips(r) for each r.

Let us now see how PickyFix behaves with respect to the concavity with an experiment similar to the one done in Section4.3for BGF. First notice that we could not uset= 155 because PickyFix was much better than BGF’s and its DFR quickly got to the point where no failure could be observed in our simulation. Therefore, we had to consider t= 160.

Figure7shows our results for this experiment. We invite the reader to compare this figure with Figure4.3and see that, not only PickyFix’s DFR appears to be concave in the same interval, but it also outperforms BGF with 5 iterations for a higher value oft. Furthermore, we also tested PickyFix for levels 3 and 5, usingt= 240 andt= 330, respectively, and the DFR curves appear to be concave, unlike the ones for BGF.

(15)

Table 3: The best values ofn_Flipsfor each security level. Valuer₀denotes the first value ofrwhen Equation1is satisfied.

Parameter set Security level Valuer₀ n_Flips PickyFixⁿ₁^Flips(r₀) BGF₁(r₀)

BIKE Level 1 128 11,001 55 0.0 63.97

BIKE Level 3 192 21,201 65 0.0 109.06

BIKE Level 5 256 35,001 100 0.0 105.79

10000 10500 11000 11500 12000 12500

Parameterr 2⁻¹⁸

2⁻¹⁴ 2⁻¹⁰ 2⁻⁶ 2⁻²

DFR

Iterations 2 3 4 5

Figure 7: The DFR curves for PickyFix when using 2 to 5 iterations considering the BIKE Level 1 parameter set witht= 160.

6.2 Achieving Negligible DFR

Now comes the most important evaluation of PickyFix, which consists of its decoding performance under the extrapolation framework. Our results are shown in Figure8. The number of tests to determine each DFR estimate was selected to be enough to obtain approximately 1000 failures (at least) for each point and can be found in data/setup/

dfr_experiment.csv.

Table5shows the results for the DFR extrapolation of the curves considered in Figure8, together with the performance of our constant-time implementations. The extrapolation was done for the last two points (r_A, p_A) and (r_B, p_B) where more than 1000 failures were observed and consideredα= 0.01 for the Clopper-Pearson method to build the confidence interval forp_Aandp_B.

We can see, from Table 5, that even with less than 5 iterations, the extrapolated parameterrfor each security level does not differ by much from the parameters proposed by the BIKE team using BGF (Table 1). However, since PickyFix also works with a reduced number of iterations, its performance can be significantly better.

From the results presented in this section, PickyFix looks like a promising decoder for BIKE. However, remember that the FixFlip auxiliary iteration used by PickyFix is inherently more complex than those used by BGF. In the next section, we describe how to efficiently implement PickyFix in constant-time and show that our decoder provides a major speedup over BGF for all security levels.

(16)

9600 9800 10000 10200 10400 10600 Parameterr

2⁰ 2⁻⁴ 2⁻⁸ 2⁻¹² 2⁻¹⁶ 2⁻²⁰

DFR

Level 1

Iterations 2 3 4 5

19200 19400 19600 19800 20000 20200 20400 20600 Parameterr

2⁰ 2⁻⁴ 2⁻⁸ 2⁻¹² 2⁻¹⁶ 2⁻²⁰

DFR

Level 3

Iterations 2 3 4 5

33400 33600 33800 34000 34200 34400 34600

Parameterr 2⁰

2⁻⁴ 2⁻⁸ 2⁻¹² 2⁻¹⁶ 2⁻²⁰

DFR

Level 5

Iterations 2 3 4 5

Figure 8: The DFR for PickyFix when using 2 to 5 iterations, considering all security levels.

7 Efficient Implementation in Constant Time

The efficient constant-time implementation proposed by the BIKE team is based on Chou’s [Cho16] QcBits with further improvements by Guimarães et al. [GAB19] and Drucker et al. [DGK20b,DG19]. Using these ideas, Drucker et al. [DGK19,DGK20c]

proposed the BGF implementation that is the best performing decoder up to this day, which is implemented in BIKE’s Additional Implementation [DGK20a].

We based our PickyFix implementation on Drucker’s et al. [DGK20a] code, which im- plements, in constant-time, most of the procedures required for both PickyFlip and FixFlip iterations. This includes, for example, the syndrome and UPC counters computations, and algorithms to flip bits given a threshold.

This section begins with a high-level description on how to adapt Drucker’s et al.

[DGK20a] implementation to perform the PickyFlip iteration in constant-time. Then we give a more detailed explanation on how to implement the procedures needed by FixFlip that are significantly different from what is used by previous decoders. We end this section

(17)

with a performance evaluation of our constant-time implementation, which is available at https://github.com/thalespaiva/pickyfix.

7.1 Implementing the PickyFlip Iteration

Remember that PickyFlip is similar to the BitFlip iteration, except that it uses a different threshold to flip zeros and ones. More specifically, consider theBitFlipIterdescribed in Algorithm2. Notice how if upc_j ≥τ₀ it invertse_j, but ifτ₀−δ≤upc_j< τ₀, it updates GrayMask_j = 1. BitFlip behavior is then very similar to PickyFix if we letτ₀=τ_In and δ=τ_In−τ_Out.

BIKE’s efficient implementation of BitFlip is based on QcBits [Cho16], and we implemented PickyFix by reusing their implementation. Since the details of this implementation are already described by Chou [Cho16], we give here only a brief description of how it works.

Suppose we want to flip all bits inewhose UPC counters are above a thresholdτ_In. First, all UPC counters are computed in bitsliced form. Since the UPC counters are lower than or equal tod=w/2, then⌈log₂(d)⌉slices are enough. Second, the implementation performs a bitsliced subtraction ofτ_In over all UPC counters. Therefore, the 0 bits in the last slice, which contains the most significant bits, indicate that the UPC was greater than or equal toτ_In, and thus the corresponding bit ineshould be flipped.

Notice that PickyFix performs the procedure above two times: one forτ_Inand other toτ_Out. However, the computation of UPC counters, which is the most costly step, is only done once for the two thresholds. The cost of the call is then very similar to the complexity ofBitFlipIter.

7.2 Implementing the FixFlip Iteration

Most of the steps needed by the FixFlip algorithm are common to all variants of the original bit-flipping decoder proposed by Gallager [Gal62]. Therefore, we can base our implementation in the most efficient constant-time implementations of QC-MDPC decoders, if we can efficiently implement the sorting step of FixFlip, corresponding to line 3 of Algorithm4.

Simply put, the main problem we need to solve is: given a list of UPC counters, flip then_Flipsbits that have the largest counters. This motivates us to call the set of indexes of entries to be flipped as a FixFlip set, which is formally defined bellow.

Definition 1 (FixFlip set). Consider a list of UPC countersU = (u₁, . . . , u₂r). A FixFlip setS with respect toU andn_Flipsis a set ofn_Flips indexes such thatui≥usfor alli /∈S and for alls∈S.

Notice that, in general, there are more than 1 FixFlip set for the same list of UPC counters. For example, for a list of UPC countersU = (3,5,2,3,7,1,3,1) andn_Flips= 4, thenS₁={1,2,4,5} andS₂={1,2,5,7}are two valid FixFlip sets. Furthermore, notice that any FixFlip set S forU can be constructed by the thresholdτ= 3 and the integer nτ = 1 by taking every indexiwhose UPC is strictly greater thanτ and also takingnτ

indexes whose UPC is equal toτ. The pair (τ, nτ) is then called a FixFlip threshold, and is formally defined next.

Definition 2(FixFlip threshold). LetU = (u₁, . . . , u₂r) be a list of UPC counters. A pair (τ, nτ) is a FixFlip threshold with respect toU andn_Flipsif, for any FixFlip setS can be partitioned intoS=S>τ∪S₌τ such thatS>τ ={s∈S:us> τ},S₌τ={s∈S:us=τ}

and|S₌τ|=nτ.

This notion helps us to reduce the problem of flipping the bits with the largest UPC values to finding a FixFlip threshold, as shown in Algorithm7. The idea of the algorithm

(18)

Algorithm 7Algorithm to flip then_Flipsentries ofewith largest UPC counters.

1: procedureFlipWorstFitEntries(n_Flips,upc) 2: τ, nτ ←FixFlipThreshold(n_Flips,upc) 3: Nτ← |{i:upc_i=τ}|

4: FlipFlagsForThreshold←Random binary vector ofNτ bits with weightnτ

5: η←0 ▷Counts the number of bits seen whoseupcisτ

6: fori= 1 to 2rdo 7: if upc_i> τ then

8: ei=ei

9: else if upc_i=τ then

10: η←η+ 1

11: if FlipFlagsForThreshold_η= 1 then

12: ei=ei

13: return e

is to flip all bits whose UPC is above τ, and use the arrayFlipFlagsForThresholdto control which set of nτ bits should be flipped among all of theNτ bits whose UPC is τ.

The conditionals in Algorithm7can be implemented in constant-time using condition masks. However, there are two aspects that are important to notice when converting the algorithm to a constant-time implementation. The first is that it is not trivial to implement FixFlipThresholdin constant-time. The second is that, to generate the random vector FlipFlagsForThresholdof fixed weight in line4, and to hide the accesses to indexη in line11, we need a tight upper bound onN_τ. In the next two sections, we describe how our constant-time implementation deals with these concerns.

7.2.1 Computing the FixFlip Threshold

The straightforward solution is to use general sorting algorithms, such as quicksort, to sort the indexes based on the corresponding UPC counters’ values, and then return the first n_Flips indexes. There are two problems with this approach. The first is that the average complexity would beO(rlogr) which would result in an iteration much costlier than that of BGF or BG. The second, and most problematic one, is that the algorithm would not be constant-time and timing attacks would be practical.

Notice that the values of the UPC counters are always in{0, . . . , d}, which is a relatively small range, and therefore counting sort is an interesting option that allows for linear sort. The problem with using counting sort in this cryptographic setting is that the constant-time implementation would not be efficient: for every counter, we need to touch all thed+ 1 buckets to avoid cache timing attacks, resulting inO(wr) complexity.

We can do better by analyzing the context in which FixFlip iteration is used. Since the weight tof the error vector is at mostt= 264, considering security level 5, then it is not necessary to allow for more than 264 flips in each FixFlip iteration. Furthermore, we already saw in Section 6.1that, in practice,n_Flips is typically much lower thantfor all security levels, and we can safely assumen_Flips<256. This means that, when performing the counting sort, we only need to count up to 255, since we need only to return the indexes corresponding to then_Flipslargest counters. Therefore, 8 bits are needed for each bucket.

Still, even if we can pack 8 buckets into one 64-bit register, we would need to touch all

⌈(d+ 1)/8⌉registers for each counting update. The number of registers would result in 9×2rand 18×2roperations, considering parameters for levels 1 and 5. But remember that we do not need to count all entries, and we can take what we call the reduced UPC counters approach, which is described next.

Figure9 shows how the algorithm works in a real decoding instance considering BIKE Level 5. Suppose we are given a listU = (u₁, . . . , u₂r) of UPC counters and we want to find

(19)

255⁺ 255⁺ 0 0 0 0 0 0

0 – 63 64 – 127 128 – 191 192 – 255 256 – 319 320 – 383 384 – 447 448 – 511

255⁺ 255⁺ 127 30 1 0 0 0

64 – 71 72 – 79 80 – 87 88 – 95 96 – 103 104 – 111 112 – 119 120 – 127

20 21 20 19 18 15 8 6

80 81 82 83 84 85 86 87

8-bit counters:

UPC Buckets:

8-bit counters:

UPC Buckets:

8-bit counters:

UPC Buckets:

64 bits

Figure 9: Using 3 levels of partial counting sorts to find the FixFlip threshold forn_Flips= 40, considering a real execution of the procedure under BIKE Level 5 parameter set. In this example, the FixFlip threshold corresponds toτ= 86 andnτ = 3.

the FixFlip threshold forU andn_Flips<256. To show our concrete efficient implementation, we assume the following conditions, that hold in the real world parameters.

1. Each UPC counterui≤d <512.

2. The number of bits to flip isn_Flips<256.

The FixFlip threshold is found in 3 counting steps, and each step uses only 8 buckets.

For the first step, each bucket i, where i goes from 0 to 7, corresponds to the UPC counters in the interval [64i,64i+ 63]. The algorithm then runs from u₁ tou₂rcounting the occurrences into the buckets, but with the following rule: the counting is done only in 8 bits, and it should not overflow. That is, the maximum count is 255 for each bucket.

Now suppose the resulting counts for each bucket is [255,255,0,0,0,0,0,0], and consider the casen_Flips= 40, just like in Figure9. Then the bucket where the FixFlip threshold lives must be Bucket 1, since Buckets 3 to 7 do not have any entry, and there are more than n_Flipsentries in Bucket 1. Using Bucketb₁= 1 selected in this step, the algorithm proceeds to the next step.

In the second step, the algorithm expands Bucketb₁, and the 8 counting buckets are zeroed. Now, each bucketiwill count the UPC counters in the interval [B₂+ 8i, B₂+ 8i+ 7], whereB₂= 64b₁. Again, the algorithm runs through the counters in reversed order until it finds where the FixFlip threshold lives. In the case considered in Figure 9, Bucket 3 is not enough to contain the threshold since it separates at most 31 UPC counters from the rest. Therefore, the search continues using Bucketb₂= 2.

In the third and last step, Bucketb₂ is expanded, and now each counting bucket will correspond to one UPC value. Formally, each Bucket iwill count occurrences of the UPC counterB₃+i, whereB₃=B₂+ 8b₂. If we consider the search in Figure 9, we can see that it stops atτ= 86, since it has found 6 + 30 + 1 = 37 UPC values aboveτ andn_τ= 3 UPC values equal to τ complete then_Flips= 40 bits to be flipped.

Now let us analyze why this algorithm is useful. Since each bucket uses only 8 bits, we can pack all the 8 buckets into a single 64-bits register. Therefore, each update on the counters updates a single register, which avoids the cache-timing attacks. Since 3 rounds are necessary, the threshold is found in about 3×2rtouches on the counting registers.