We introduce a deep learning model that can universally approximate regular conditional distributions (RCDs). The proposed model operates in three phases: first, it linearizes inputs from a given metric space to via a feature map, then a deep feedforward neural network processes these linearized features, and then the network's outputs are then transformed to the -Wasserstein space via a probabilistic extension of the attention mechanism of Bahdanau et al.\ (2014). Our model, called the \textit{probabilistic transformer (PT)}, can approximate any continuous function from to uniformly on compact sets, quantitatively. We identify two ways in which the PT avoids the curse of dimensionality when approximating -valued functions. The first strategy builds functions in which can be efficiently approximated by a PT, uniformly on any given compact subset of . In the second approach, given any function in , we build compact subsets of whereon can be efficiently approximated by a PT.
View on arXiv