An Efficient and Optimal Method for Sparse Canonical Correlation
Analysis
Canonical correlation analysis (CCA) is an important multivariate technique for exploring the relationship between two sets of variables which finds applications in many fields. This paper considers the problem of estimating the subspaces spanned by sparse leading canonical correlation directions when the ambient dimensions are high. We propose a computationally efficient two-stage estimation procedure which consists of a convex programming based initialization stage and a group Lasso based refinement stage. Moreover, we show that for data generated from sub-Gaussian distributions, our approach achieves optimal rates of convergence under mild conditions by deriving both the error bounds of the proposed estimator and the matching minimax lower bounds. In particular, the computation of the estimator does not involve estimating the marginal covariance matrices of the two sets of variables, and its minimax rate optimality requires no structural assumption on the marginal covariance matrices as long as they are well conditioned. We also present an encouraging numerical results on simulated data sets. The practical usefulness is demonstrated by an application on a breast cancer data set.
View on arXiv