Nonlinear manifolds are pervasive in deep visual features, where Euclidean distances can misrepresent true similarity. This mismatch is particularly detrimental to prototype-based interpretable fine-grained recognition, where even subtle semantic distinctions are crucial. To mitigate this issue, this work presents a novel paradigm for prototype-based recognition by grounding similarity in the intrinsic geometry of deep features. Concretely, we distill the latent manifold structure of each class into a diffusion space and, critically, devise a differentiable Nyström interpolation to make this geometry accessible to both unseen samples and learnable prototypes. To maintain efficiency, we employ compact per-class landmark sets with periodic updates. This strategy keeps the embedding synchronized with the evolving backbone, enabling fast inference at scale. Comprehensive experiments on two benchmark datasets demonstrate that our GeoProto yields prototypes focusing on semantically corresponding parts, significantly outperforming Euclidean prototype networks.

View on arXiv

Comments on this paper