OpenFHE on AMD APUs

Hi,

I’ve been experimenting with running parts of the OpenFHE library on the AMD MI300A APU [1], which, unlike a discrete GPU, provides unified shared memory between its CPU and GPU cores.

After profiling and considering the work presented in [2], I modified the ApproxSwitchCRTBasis function. Specifically, I replaced 128-bit integer operations with a struct of two 64-bit integers, and rewrote the related arithmetic functions such as multiplication and Barrett reduction to operate on this new representation.

For the ring dimension of 16384 and p and q size of 4, I measured the performance (wall time) of the modified parallelized loop by executing a single multiplication operation using BFV encryption scheme. So far, the performance is roughly the same as the default implementation. This does not include overhead caused by marshalling and unmarshalling of data for the GPU computation.

I will continue experimenting, any feedback is appreciated.

[1] https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf

[2] Towards GPU Accelerated FHE Computations | IEEE Conference Publication | IEEE Xplore

I just wanted to comment that I achieved performance improvement of about 25% compared to the CPU (AMD EPYC 9374F), excluding data (un)marshalling overhead. Ring dimensions was 32768. Ensuring that vector and array elements are stored in contiguous memory locations had the most influence on performance.

1 Like