Relocating INTTs in DM/CGGI - possible optimization?

Hello!

Analyzing the implementation of DM/CGGI it seems to me that relocating the application of NTTs/INTTs could save 1 NTT and 2 INTTs per bootstrapping. I wanted to share the analysis in case it could result in a (slight) optimization of the implementation or to know whether there is something fundamentally wrong with my approach. (I think it would also apply to LMKCDEY, but I haven’t looked closely into that code.)

I will follow the BootstrapFunc and BootstrapFuncCore functions as a base for the following description, though the same would hold for the EvalBinGate and BootstrapGateCore functions. The life-cycle of the accumulator through a full bootstrapping procedure based on those functions can be abstracted to:


Initialization

  • Construct B from f and b [code]
  • (A, B) := (0, \mathrm{NTT}(B)) [code]

AddToAccX (called repeatedly, X \in {DM, CGGI})

  • Save (A_{NTT}, B_{NTT}) := (A, B) for X == CGGI

  • (A, B) := (\mathrm{INTT}(A), \mathrm{INTT}(B)) [DM, CGGI]

  • (A_0, B_0, \cdots, A_{dg-1}, B_{dg-1}) := \mathrm{SignedDecompose}(A, B) [DM, CGGI]

  • (A_0, B_0, \cdots, A_{dg-1}, B_{dg-1}) := (\mathrm{NTT}(A_0), \mathrm{NTT}(B_0), \dots, \mathrm{NTT}(A_{dg-1}), \mathrm{NTT}(B_{dg-1})) [DM, CGGI]

  • If X == DM:

    • (A, B) := (A_0, B_0, \dots, A_{dg-1}, B_{dg-1}) * \mathrm{ev} [code]
  • If X == CGGI:

    • (A, B) := (A_{NTT}, B_{NTT}) + (A_0, B_0, \dots, A_{dg-1}, B_{dg-1}) * \mathrm{ev}_1 * monomial + (A_0, B_0, \dots, A_{dg-1}, B_{dg-1}) * \mathrm{ev}_2 * monomialNeg [code]

Extraction

  • A := \mathrm{Transpose}(A) [code]

  • (A, B) := (\mathrm{INTT}(A), \mathrm{INTT}(B)) [code]

  • (a_{LWE}, b_{LWE}) := (A, B[0]) [code]


Note how the \mathrm{NTT} performed in the Initialization and the two \mathrm{INTTs} performed in the first call to AddToAcc are redundant. One way to avoid computing those is to remove the NTT from the Initialization, to move the 2 INTTs from the beginning of AddToAcc to the end of the function and to remove the 2 INTTs from the Extraction.

That approach would follow this structure:


Initialization

  • Construct B from f and b

AddToAccX (called repeatedly, X \in {DM, CGGI})

  • Save (A_{coef}, B_{coef}) := (A, B) for X == CGGI

  • (A_0, B_0, \cdots, A_{dg-1}, B_{dg-1}) := \mathrm{SignedDecompose}(A, B)

  • (A_0, B_0, \cdots, A_{dg-1}, B_{dg-1}) := (\mathrm{NTT}(A_0), \mathrm{NTT}(B_0), \dots, \mathrm{NTT}(A_{dg-1}), \mathrm{NTT}(B_{dg-1}))

  • If X == DM:

    • (A, B) := (A_0, B_0, \dots, A_{dg-1}, B_{dg-1}) * \mathrm{ev}
  • If X == CGGI:

    • (A, B) := (A_0, B_0, \dots, A_{dg-1}, B_{dg-1}) * \mathrm{ev}_1 * monomial + (A_0, B_0, \dots, A_{dg-1}, B_{dg-1}) * \mathrm{ev}_2 * monomialNeg
  • (A, B) := (\mathrm{INTT}(A), \mathrm{INTT}(B))

  • If X == CGGI:

    • (A, B) := (A, B) + (A_{coef}, B_{coef})

Extraction

  • A := \mathrm{SignedPermute}(A)

  • (a_{LWE}, b_{LWE}) := (A, B[0])


There are two issues to overcome when changing the positions of the INTTs:

  1. To update (A, B) in AddToAccCGGI we need to add the input value of the accumulator, that I denoted as (A_{NTT}, B_{NTT}) for the current version. In the proposed version, however, there is no access to an “NTT” version of the input accumulator, so we would have to add the input accumulator only after applying the INTTs.

  2. The current version of the Extraction transposes A before applying the INTTs to account for the mismatch in the keys, which is not possible in the proposed version. I am not really sure of how that transposition works (what is its effect on the vector?), but I understand that it should be possible to perform a similar permutation having the mask A in coefficient form. According to [MP21, Section 3.1], if no transposition is performed we would end up with a ciphertext encrypted under the key z = (z_0, -z_{N-1}, \dots, -z_{1}) \in \mathbb{Z}_Q^N (accounting for the embedding into Z_Q[X]/(X^N+1)). I understand that to obtain a ciphertext encrypted under the original key we could take the mask A in coefficient form, leave the 0th element untouched, and reverse and change the sign of the rest of the vector. That is the operation that I denoted by \mathrm{SignedPermute}(\cdot).

This way we would effectively save the computation of 1 NTT and 2 INTTs. I would like to know whether there is something wrong with the proposed approach.

Thanks in advance,

Antonio

I’m posting here the descriptions as code blocks, since some lines can’t be read completely in the previous post.

Current version:

// Initialization

B := Initialize(b, f)

(A, B) := (0, NTT(B))


// AddToAccX (Called repeatedly, X in {'DM', 'CGGI'})

(A_{NTT}, B_{NTT}) := (A, B) // for X == CGGI

(A, B) := (INTT(A), INTT(B))

(A_0, B_0,..., A_{dg-1}, B_{dg-1}) := SignedDecompose(A, B)

(A_0, B_0,..., A_{dg-1}, B_{dg-1}) := (NTT(A_0), NTT(B_0),..., NTT(A_{dg-1}), NTT(B_{dg-1}))

if X == 'DM':
    (A, B) := (A_0, B_0,..., A_{dg-1}, B_{dg-1}) * ev

if X == 'CGGI':
    (A, B) := (A_{NTT}, B_{NTT}) + (A_0, B_0,..., A_{dg-1}, B_{dg-1}) * ev1 * monomial + (A_0, B_0,..., A_{dg-1}, B_{dg-1}) * ev2 * monomialNeg


// Extraction

A := Transpose(A)

(A, B) := (INTT(A), INTT(B))

(a_{lwe}, b_{lwe}) := (A, B[0])

Proposed version:

// Initialization

B := Initialize(b, f)

(A, B) := (0, B)


// AddToAccX (Called repeatedly, X in {'DM', 'CGGI'})

(A_{coef}, B_{coef}) := (A, B) // for X == CGGI

(A_0, B_0,..., A_{dg-1}, B_{dg-1}) := SignedDecompose(A, B)

(A_0, B_0,..., A_{dg-1}, B_{dg-1}) := (NTT(A_0), NTT(B_0),..., NTT(A_{dg-1}), NTT(B_{dg-1}))

if X == 'DM':
    (A, B) := (A_0, B_0,..., A_{dg-1}, B_{dg-1}) * ev

if X == 'CGGI':
    (A, B) := (A_0, B_0,..., A_{dg-1}, B_{dg-1}) * ev1 * monomial + (A_0, B_0,..., A_{dg-1}, B_{dg-1}) * ev2 * monomialNeg

(A, B) := (INTT(A), INTT(B))

if X == 'CGGI':
    (A, B) := (A, B) + (A_{coef}, B_{coef})


// Extraction

A := SignedPermute(A)

(a_{lwe}, b_{lwe}) := (A, B[0])

Hi @amerigal ,

Thank you for the suggested optimization. Conceptually the optimization should work. However, the benefit we get from it seems negligible to me. In bootstrapping, we need n runs of AddToAcc. In each run of AddToAcc, we evaluate 2 d_g (I)NTTs. So in total, we get 2 d_g n (I)NTTs (in the optimized GINX or LMKCDEY cases). The smallest value of n we deal with is 446 and the smallest value of d_g is 2. So we get about 1,800 (I)NTTs in total. The gain of 3 (I)NTTs gives us at most a speed-up of 0.17% (in practice, it is less as the values of n and d_g are higher and there are other operations). We reserve 1-2% for variability in measuring the bootstrapping runtime. So the gain of this optimization is expected to be negligible in practice unless I am missing something in what you suggested.

Thank you for your response. I now see that the speed-up is negligible in practice and you are indeed not missing anything from my suggestion. In any case, knowing that the suggested approach should also work really helps me understanding the schemes and how they are implemented. I appreciate you took the time to review it.