We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size is a basic operation. In the -TCU model, we show that for inputs of size , the algorithm has depth at most and runs in time assuming tensor core units. Equivalently, the algorithm performs multiplications of square matrices of size s.
View on arXiv