Single Threading performs faster than Multi-Threading

Hi

I compiled the library with and without OpenMP support and the profiled results show that the performance w/o OpenMP performs faster than the code compiled with OpenMP enabled.

Compiling library disabling OpenMP (presume to run in the single thread manner)

Compilation Command:

cmake -DWITH_OPENMP=OFF ..
make -j
./bin/

The profiled latency from different kernels is shown below:

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
CKKSrns_KeyGen                  2266 us         2266 us          309
CKKSrns_MultKeyGen              4387 us         4386 us          161
CKKSrns_EvalAtIndexKeyGen       4746 us         4745 us          147
CKKSrns_Encryption              1622 us         1622 us          433
CKKSrns_Decryption               160 us          160 us         4384
CKKSrns_Add                     31.5 us         31.5 us        22255
CKKSrns_AddInPlace              23.4 us         23.4 us        29659
CKKSrns_MultNoRelin              289 us          288 us         2425
CKKSrns_MultRelin               2770 us         2769 us          253
CKKSrns_Relin                   2531 us         2530 us          276
CKKSrns_RelinInPlace            2623 us         2621 us          267
CKKSrns_Rescale                  449 us          449 us         1555
CKKSrns_RescaleInPlace           439 us          439 us         1590
CKKSrns_EvalAtIndex             2456 us         2456 us          282

Compiling library enabling OpenMP (presume to run in the single thread manner)

Compilation Command:

cmake ..
make -j
./bin/

The profiled latency from different kernels is shown below:

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
CKKSrns_KeyGen                  2561 us         2561 us          277
CKKSrns_MultKeyGen              4580 us         4580 us          153
CKKSrns_EvalAtIndexKeyGen       4880 us         4880 us          143
CKKSrns_Encryption              1623 us         1623 us          431
CKKSrns_Decryption               149 us          149 us         4656
CKKSrns_Add                     24.4 us         24.4 us        28618
CKKSrns_AddInPlace              18.7 us         18.7 us        37505
CKKSrns_MultNoRelin              185 us          185 us         3812
CKKSrns_MultRelin              57316 us        55202 us           12
CKKSrns_Relin                  57121 us        54685 us           10
CKKSrns_RelinInPlace           57108 us        55116 us           12
CKKSrns_Rescale                  453 us          453 us         1547
CKKSrns_RescaleInPlace           447 us          447 us         1564
CKKSrns_EvalAtIndex            56590 us        54281 us           13

Might I ask whether the multi-thread performance becomes worse on Relin, RelinInPlace, EvalAtIndex etc.?

Best
Jianming

Hi @JianmingTong

It does not make sense to me (typically the numbers are close in the environments I ran this in). What environment did you compile/run it? What OS, what compiler? What version of OpenFHE? Thanks

Thanks for the fast response!

Here is information on the CPU and compilation environment:
CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor

-- Building in Release mode
-- BUILD_UNITTESTS:  ON
-- BUILD_EXAMPLES:   ON
-- BUILD_BENCHMARKS: ON
-- BUILD_EXTRAS:     OFF
-- BUILD_STATIC:     OFF
-- BUILD_SHARED:     ON
-- GIT_SUBMOD_AUTO:  ON
-- WITH_BE2:         OFF
-- WITH_BE4:         OFF
-- WITH_NTL:         OFF
-- WITH_TCM:         OFF
-- WITH_OPENMP:      OFF
-- NATIVE_SIZE:      128
-- CKKS_M_FACTOR:    1
-- WITH_NATIVEOPT:   OFF
-- WITH_COVTEST:     OFF
-- WITH_NOISE_DEBUG: OFF
-- USE_MACPORTS:     OFF
-- BUILTIN_INFO_AVAILABLE is defined
***** INSTALL IS AT /usr/local; to change, run cmake with -DCMAKE_INSTALL_PREFIX=/your/path
-- Architecture is x86_64
-- NATIVEINT is set to 128
-- MATHBACKEND is set to 4
-- MATHBACKEND set to 4. Setting WITH_BE4 to ON
-- Submodule update
Skipping parallel because WITH_OPENMP=OFF
-- Failed to find LLVM FileCheck
-- git version: v1.5.5-14-ge451e50e normalized to 1.5.5.14
-- Version: 1.5.5.14
-- Performing Test HAVE_STD_REGEX -- success
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
-- Performing Test HAVE_POSIX_REGEX -- success
-- Performing Test HAVE_STEADY_CLOCK -- success
-- Configuring done (0.2s)
-- Generating done (0.1s)

For the one with OpenMP enabled, “WITH_OPENMP: ON” is set instead.

Best
Jianming

Just for completeness, can you try it with NATIVE_SIZE=64 (default configuration) for both W/OMP and W/O OMP? Just want to check if this behavior is specific to NATIVE_SIZE=128

Also, what is the OS? what is the compiler (it shows it the first time you run cmake after clearing the build directory or typing in make clear)

Thanks so much for the help!

I’m running on AMD 5990X with both OpenMP enabled and Disabled. I set Native_Size as 64 and the profiled results are shown in the below table.

Degree	65536	65536	65536
OPENMP enabled?	no	no	Yes
Native_Size	64	64	64
SetScalingModSize	48	32	32
SetBatchSize	8	23	23
Multiplication & Relinearization	2933	2649	5153
Relinearization	2585	2410	53998
Rescale	459	427	425
Rotation	2502	2380	5037
Rotation	1325	1076	1710
Multiplication & Relinearization	7511	6681	8465
KeyGeneration	1717	1610	1866

The performance value with OpenMP enabled performs worse than OpenMP disabled.

System information:

AMD Ryzen 9 5950X 16-Core Processor
cmake version 3.28.0-rc4

Best
Jianming

Thanks. What OS/version are you using? What compiler/version (g++, clang++, or something else)? Once you provide the information, we will run some experiments to see if we can recreate the anomalous behavior you are reporting.