Looking at the rotations in the square matrix multiplication algorithm, Is my understanding correct: For a 128x128 matrix and a parameter set (log N = 16, loq PQ = 1728, L = 20, dnum = 3), for each distinct rotation index we need a key of size ~ 83 MB (dnum * 2 * N * log PQ / 8). We need rotations by i and -i, i, and i*rowsize (1, … , 127), where i = (1,…,127). This adds up to 127 * 3 rotation keys, each with size ~ 83 MB, adding up to 31.6 GB.
My question: Is there any way to increase the log N = 17 and perform 256x256 matrix multiplications without memory issues on a GPU?
It seems you have your own GPU FHE library. Do you need to keep all the keys in the GPU memory? Maybe you can push them to GPU memory when needed or prefetch them slightly a head of the time they are needed.
Looking at this from a different angle, would it make sense to decompose your 256x256 big matrix into 4 128x128 sub-matrices, and compute the big matrix product using 8 sub-matrix products and 4 sub-matrix additions?
Hi, I wasn’t running any code on the GPU at the moment, but planning for ahead on our library.
Right now I am following your suggestion, decomposing the matrix. I will consider the prefetching part, thank you so much!
Do you think my calculations are correct (especially in terms of the required number of rotation keys) and reflect the correct metadata size, or did I overdo some part and incorrectly got a high value?