52

LDLT L-Lipschitz Network Weight Parameterization Initialization

Marius F. R. Juston
Ramavarapu S. Sreenivas
Dustin Nottage
Ahmet Soylemezoglu
Main:8 Pages
17 Figures
Bibliography:2 Pages
Appendix:2 Pages
Abstract

We analyze initialization dynamics for LDLT-based L\mathcal{L}-Lipschitz layers by deriving the exact marginal output variance when the underlying parameter matrix W0Rm×nW_0\in \mathbb{R}^{m\times n} is initialized with IID Gaussian entries N(0,σ2)\mathcal{N}(0,\sigma^2). The Wishart distribution, S=W0W0Wm(n,σ2Im)S=W_0W_0^\top\sim\mathcal{W}_m(n,\sigma^2 \boldsymbol{I}_m), used for computing the output marginal variance is derived in closed form using expectations of zonal polynomials via James' theorem and a Laplace-integral expansion of (αIm+S)1(\alpha \boldsymbol{I}_m+S)^{-1}. We develop an Isserlis/Wick-based combinatorial expansion for E[tr(Sk)]\operatorname{\mathbb{E}}\left[\operatorname{tr}(S^k)\right] and provide explicit truncated moments up to k=10k=10, which yield accurate series approximations for small-to-moderate σ2\sigma^2. Monte Carlo experiments confirm the theoretical estimates. Furthermore, empirical analysis was performed to quantify that, using current He or Kaiming initialization with scaling 1/n1/\sqrt{n}, the output variance is 0.410.41, whereas the new parameterization with 10/n10/ \sqrt{n} for α=1\alpha=1 results in an output variance of 0.90.9. The findings clarify why deep L\mathcal{L}-Lipschitz networks suffer rapid information loss at initialization and offer practical prescriptions for choosing initialization hyperparameters to mitigate this effect. However, using the Higgs boson classification dataset, a hyperparameter sweep over optimizers, initialization scale, and depth was conducted to validate the results on real-world data, showing that although the derivation ensures variance preservation, empirical results indicate He initialization still performs better.

View on arXiv
Comments on this paper