OpenFHE on AMD APUs

Hi,

I’ve been experimenting with running parts of the OpenFHE library on the AMD MI300A APU [1], which, unlike a discrete GPU, provides unified shared memory between its CPU and GPU cores.

After profiling and considering the work presented in [2], I modified the ApproxSwitchCRTBasis function. Specifically, I replaced 128-bit integer operations with a struct of two 64-bit integers, and rewrote the related arithmetic functions such as multiplication and Barrett reduction to operate on this new representation.

For the ring dimension of 16384 and p and q size of 4, I measured the performance (wall time) of the modified parallelized loop by executing a single multiplication operation using BFV encryption scheme. So far, the performance is roughly the same as the default implementation. This does not include overhead caused by marshalling and unmarshalling of data for the GPU computation.

I will continue experimenting, any feedback is appreciated.

[1] https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf

[2] Towards GPU Accelerated FHE Computations | IEEE Conference Publication | IEEE Xplore