GPU Acceleration of OpenFHE - GPU memory


I am investigating hardware acceleration of homomorphic operations on GPUs. In the design paper this seems to be something that should theoretically be supported through the hardware acceleration layer, however I am not so sure how this can be done. The main difficulty is in how to handle moving ciphertexts back and forth between main memory and GPU memory. Currently I have been working on accelerating NativeVector operations, however this necessitates moving data back and forth for every operation which removes any potential acceleration. It seems like there needs to be an approach that allows the storing of ciphertexts on the GPU but I am not sure how that can be handled using the current hardware acceleration layer.

Any thoughts or discussion would be greatly appreciated. Thank you!

Well there is no direct GPU support at the HAL yet. currently there is just AVX512 (a cpu based vector processor). However, we have a work in progress version of the HAL for integrating a PCI based accelerator.
in that case we have to track which DCRTPolys are on the CPU vs on the PCI accelerator , and synchronize their transfer back and forth. That is somewhat similar to the problem of a GPU, though using unified memory tends to avoid the need to move data manually.

however that won’t be in a form that we can openly release for several months to a year.

However, I’ll point out things we have seen in the past: you will find that GPU ALUs will generally have disappointing performance on FHE math due to a couple of things: 1) the need for modulo arithmetic, 2) poor performance with 64b moduli (on most GPUs the ALU is 32b and 64b is implemented in firmware) 3) ciphertexts can be quite large which means needing to have a large memory footprint in the gpu.

That said if you are interested in the binFHE schemes, they run 128b PQ security with 32b arithmetic and very modest vector sizes 1k-4k elements in size. basically one executes a whole mess of gates in parallel (see the transpiler sub-repo or the encrypted digital circuit repo). here the gate math is trivial on a cpu, and it its the bootstrap that is costly. One might consider writing a completely vectorized version of bootstrapping over a large number of gates. that might map well to gpus, as the length of the vector is the # parallel gates boostrapped – something you have complete control over