LDLT L-Lipschitz Network Weight Parameterization Initialization
We analyze initialization dynamics for LDLT-based -Lipschitz layers by deriving the exact marginal output variance when the underlying parameter matrix is initialized with IID Gaussian entries . The Wishart distribution, , used for computing the output marginal variance is derived in closed form using expectations of zonal polynomials via James' theorem and a Laplace-integral expansion of . We develop an Isserlis/Wick-based combinatorial expansion for and provide explicit truncated moments up to , which yield accurate series approximations for small-to-moderate . Monte Carlo experiments confirm the theoretical estimates. Furthermore, empirical analysis was performed to quantify that, using current He or Kaiming initialization with scaling , the output variance is , whereas the new parameterization with for results in an output variance of . The findings clarify why deep -Lipschitz networks suffer rapid information loss at initialization and offer practical prescriptions for choosing initialization hyperparameters to mitigate this effect. However, using the Higgs boson classification dataset, a hyperparameter sweep over optimizers, initialization scale, and depth was conducted to validate the results on real-world data, showing that although the derivation ensures variance preservation, empirical results indicate He initialization still performs better.
View on arXiv