OpenFHE Single thread


I am trying to use OpenFHE with a single core single thread. I’m studying computer architecture and I want to use this library as an object of study.
I prefer to start with a single thread, where the report of a program like perf will be easier to understand at first.
This is posible?

I try to compile the build with the cmake flag -DWITH_OPENMP=OFF, but i had the same result…

The easiest way is to set OMP_NUM_THREADS=1 before running your binaries (it is a runtime environment variable). The syntax is


Please see openfhe-development/ at main · openfheorg/openfhe-development · GitHub for more details on benchmarking (if this is one of the goals of your study)


@mmazz would be very helpful to us if you could share a short paragraph about your motivation for studying OpenFHE and what you are trying to do with it.

1 Like

Thanks @ypolyakov, but there is a way to directly don’t use OpenMP? I think that with that flag that i comment earlier didn’t work. I want the most direct and clear assembler code.

@Caesar, I’m a Physicist starting a PhD in computer science, in particular computer architecture. With my adviser we choose the main aplication FHE.
We will try to find a particular application (autonomous vehicles for example) that may use FHE to be accelerated by hardware.
So my first step is to study OpenFHE (possible focus in CKKS to start with), how its implemented, how its perform,
where cpu cycles are wasted, where are the possible hot spot to dig in, etc.
For now we only have a vague idea of what we want (I’m new in Architecture and FHE…) but in a near future we may clear a little bit.

1 Like

Yes, you can pass -DWITH_OPENMP=OFF to completely disable OpenMP during compilation.

1 Like

Hi @mmazz, not to do your homework for you, but I will share some insights from >10 years of getting FHE to run efficiently, and with dealing with profiling heavily templated C++ code.

  1. the Core bottleneck of our FULLY HE applications is Bootstrapping. Bootstrapping is extremely complex, and I would not start there.
  2. for applications without bootstrapping, and there are many, we set the multiplicative level such that any single string of math through the algorithm is met by that level. In these cases the bottleneck is KeySwitching (used after vector multiplies, or vector rotations). Hybrid Key Switching (HKS) is the best performer for CPU based systems, and that is what we use in our CKKS/BGV/BFV family of schemes (if you want to use straight booleans the DM family of schemes is best - TFHE/FHEW – there will be more published soon on this).

HKS is still complex. Understanding HKS and where the bottlenecks are is a reasonable target.
Within HKS then, most operations are vectorized modulo arithmetic operations. With large ring sizes (which are required for higher levels of security, and in CKKS for higher precision operations (such as Machine Learning Training) – the bottleneck is the NTT operation, which is strikingly similar in construction to the FFT.

If you simply used a profiler, at any level of application code, you would see NTT as the main bottleneck – but that doesn’t really show the whole picture, there are large data arrays in use, so cache issues also come into play.

FHE in general is what I would call a “high performance computing” problem. There is a very large amount of parallelism but there is also a large amount of data in the form of ciphertexts and keys.

One way to get a feel for the bottlenecks is to dive into the benchmark\src directory, and look at the there. It goes into detail on how to build them, and what the organization of the files are.

Hope this helps.

Dave Cousins, Director, Duality Labs.


Thanks! Don’t know why, but when I delete the hole repository and clone it and compile it again with the flags it works! Thanks again!

@dcousins Hello! Thanks for your advice! I will take this to consideration! All is too new but exciting for me. For now i did not get into HKS but now I will.

I will check the benchmark readme. For now i made some runs with perf and saw that the implementation of the chinese remainder theorem is one that take most of the work. My work for today is how ckks use this theorem (tthis is my level of understanding of fhe…)

Can you name the procedure that took most of the runtime (exact name)? Chinese Remainder Theorem (CRT) should not be a performance bottleneck, except for interpolation (which you do not really need in FHE computation).


At least for the example " simple-ckks-bootstrapping.cpp", the procudure:

intnat::ChineseRemainderTransformFTTNat<intnat::NativeVectorT<intnat::NativeIntegerT > >::ForwardTransformToBitReverseInPlace(intnat::NativeIntegerT const&, unsigned int, intnat::NativeVectorT<intnat::NativeIntegerT >*)

Takes almost 33% of the work. Its a simple case with only bootstrapping, so may be its a fix work that with more operations it will be diluted. For now i didn’t search more of it. I’m stuck trying to make hotspot (a perf ‘gui’) to know the locations of some symbols of

That is the NTT operation @dcousins mentioned. Your findings algin with our expectations.