22
2

Bottleneck-Minimal Indexing for Generative Document Retrieval

Abstract

We apply an information-theoretic perspective to reconsider generative document retrieval (GDR), in which a document xXx \in X is indexed by tTt \in T, and a neural autoregressive model is trained to map queries QQ to TT. GDR can be considered to involve information transmission from documents XX to queries QQ, with the requirement to transmit more bits via the indexes TT. By applying Shannon's rate-distortion theory, the optimality of indexing can be analyzed in terms of the mutual information, and the design of the indexes TT can then be regarded as a {\em bottleneck} in GDR. After reformulating GDR from this perspective, we empirically quantify the bottleneck underlying GDR. Finally, using the NQ320K and MARCO datasets, we evaluate our proposed bottleneck-minimal indexing method in comparison with various previous indexing methods, and we show that it outperforms those methods.

View on arXiv
Comments on this paper