39
18

Sign-Full Random Projections

Abstract

The method of 1-bit ("sign-sign") random projections has been a popular tool for efficient search and machine learning on large datasets. Given two DD-dim data vectors uu, vRDv\in\mathbb{R}^D, one can generate x=i=1Duirix = \sum_{i=1}^D u_i r_i, and y=i=1Dviriy = \sum_{i=1}^D v_i r_i, where riN(0,1)r_i\sim N(0,1) iid. The "collision probability" is Pr(sgn(x)=sgn(y))=1cos1ρπ{Pr}\left(sgn(x)=sgn(y)\right) = 1-\frac{\cos^{-1}\rho}{\pi}, where ρ=ρ(u,v)\rho = \rho(u,v) is the cosine similarity. We develop "sign-full" random projections by estimating ρ\rho from (e.g.,) the expectation E(sgn(x)y)=2πρE(sgn(x)y)=\sqrt{\frac{2}{\pi}} \rho, which can be further substantially improved by normalizing yy. For nonnegative data, we recommend an interesting estimator based on E(y1x0+y+1x<0)E\left(y_- 1_{x\geq 0} + y_+ 1_{x<0}\right) and its normalized version. The recommended estimator almost matches the accuracy of the (computationally expensive) maximum likelihood estimator. At high similarity (ρ1\rho\rightarrow1), the asymptotic variance of recommended estimator is only 43π0.4\frac{4}{3\pi} \approx 0.4 of the estimator for sign-sign projections. At small kk and high similarity, the improvement would be even much more substantial.

View on arXiv
Comments on this paper