Hello everyone, I have two questions regarding the logistic regression training using OpenFHE.
First, what is the reason for using the Nesterov Accelerated Gradient Descent?
Second, I encountered ‘training instabilities’ mentioned in papers for polynomially approximated nonlinear activations. When using the OpenFHE examples, I didn’t experience any instabilities in my loss curves, even for 2nd-degree approximations to relu. How do I see the mentioned instabilities?
Hi seyda,
Nesterov’s accelerated gradient descent achieves the optimal (1/k^2) convergence rate for first-order algorithms, i.e., using only the gradient. For strongly convex and smooth functions, the learning rate and momentum can be set in an optimal way to achieve this.
Training instabilities can refer to many things, from numerical instability due to underflow or overflow, to convergence issues because of the gradient descent zig-zagging (still related to numerical issues) or because the approximation is applied somewhere but the gradient is taken wrt the original function or separately approximated, etc. What kind of instabilities are you referring to? As long as the learning steps do not become too small/large, and the functions remain convex, you should not encounter lack of convergence (but the convergence with approximations might not reach good accuracy).
Thank you so much for your help. Do we prefer NAG for only faster convergence or can It also improve the accuracy?
The instabilities that occur only during encrypted training (apart from the numeric issues that arise in plaintext training as well). Do you know any source that formally discusses the subject?
I have a hard time understanding this. Don’t we use weight regularization or drop out sort of techniques so as not to overfit the data? Then, why don’t coarse approximations help (to some extent) decrease the loss further? Do they fail when the gradients fall outside of the approximation range?
NAG has faster convergence - it will converge to the global optimum of the problem (if the problem is strongly convex). This global optimum will depend on e.g., the chosen regularization parameter, which impacts the accuracy.
These instabilities that happen only over encrypted inference/training might refer to having the encrypted values be outside of the pre-determined range as the computation gets deeper (https://arxiv.org/pdf/2107.12342 talks a bit about issues with replacing non-polynomial activation functions by polynomial approximations). This is always an issue when working with polynomial approximations. In principle, this can be mitigated by normalizing the encrypted data, but that of course comes with its own approximation if you want to divide by the encrypted norm and not simply to scale it by a public small constant.
Finally, the ML community has converged to activation functions such as ReLU, sigmoid or softmax because these have some properties that give the best accuracy; tweaking them into coarse approximations will impact the accuracy Activation Function in Neural Networks: Sigmoid, Tanh, ReLU, Leaky ReLU, Parametric ReLU, ELU, Softmax, GeLU | Medium.