Scaling method vs EvalAdd performance

I am digging into some performance profiling for OpenFHE, and I noticed that, for the existing lib-benchmark.cpp, changing the scaling method from FIXEDMANUAL to FLEXIBLEAUTOEXT (which is the default choice) causes the performance of EvalAdd to nearly triple.

diff --git a/benchmark/src/lib-benchmark.cpp b/benchmark/src/lib-benchmark.cpp
index b92dda15..a49ea70b 100644
--- a/benchmark/src/lib-benchmark.cpp
+++ b/benchmark/src/lib-benchmark.cpp
@@ -77,7 +77,7 @@ using namespace lbcrypto;
     CCParams<CryptoContextCKKSRNS> parameters;
     parameters.SetScalingModSize(48);
     parameters.SetBatchSize(8);
-    parameters.SetScalingTechnique(FIXEDMANUAL);
+    parameters.SetScalingTechnique(FLEXIBLEAUTOEXT);
     parameters.SetMultiplicativeDepth(mdepth);
     auto cc = GenCryptoContext(parameters);
     cc->Enable(PKE);

CKKSrns_Add 82.5 us 82.4 us 7973

Whereas without this change:

CKKSrns_Add 31.3 us 31.3 us 21808

Is there an easy way to understand why the scaling method affects the performance here? I am bringing up EvalAdd because it is the simplest example of a nontrivial performance change I am seeing when using the default parameters versus the ones configured in the benchmark file.

For reference, I am building with

cmake -DCMAKE_BUILD_TYPE=Release -DWITH_NTL=OFF -DWITH_TCM=OFF -DMATHBACKEND=4 -DWITH_NATIVEOPT=OFF -DNATIVE_SIZE=64 -DBUILD_BENCHMARKS=ON -DBUILD_EXAMPLES=OFF -DBUILD_UNITTESTS=OFF ..

And my platform details are:

-- The C compiler identification is Clang 18.1.8
-- The CXX compiler identification is Clang 18.1.8
-- Architecture is x86_64
-- Found OpenMP_C: -fopenmp=libomp (found version "5.1")
-- Found OpenMP_CXX: -fopenmp=libomp (found version "5.1")
(etc., ask if you need more info here)

I tried the same think with Clang 19.1.7, just to see if the compiler version was the issue, but the values are roughly the same.

At first I thought it was purely because the (automatically selected) ring dimension was different in the two cases (8192 vs 16384) but even if I use HEstd_NotSet and fix the ring dimensions to be the same, I still see a 2x+ slowdown with the FLEXIBLEAUTOEXT scaling.

With FLEXIBLE* modes, there is an extra logic that adjusts the scale of the input ciphertexts before addition is performed. This is needed for the cases where the input ciphertexts are at different levels, and hence have different scaling factors associated with them. See pages 17 and 18 of 2020/1118 for more details.