Generalizable Audio Deepfake Detection via Hierarchical Structure Learning and Feature Whitening in Poincaré sphere
Audio deepfake detection (ADD) faces critical generalization challenges due to diverse real-world spoofing attacks and domain variations. However, existing methods primarily rely on Euclidean distances, failing to adequately capture the intrinsic hierarchical structures associated with attack categories and domain factors. To address these issues, we design a novel framework Poin-HierNet to construct domain-invariant hierarchical representations in the Poincaré sphere. Poin-HierNet includes three key components: 1) Poincaré Prototype Learning (PPL) with several data prototypes aligning sample features and capturing multilevel hierarchies beyond human labels; 2) Hierarchical Structure Learning (HSL) leverages top prototypes to establish a tree-like hierarchical structure from data prototypes; and 3) Poincaré Feature Whitening (PFW) enhances domain invariance by applying feature whitening to suppress domain-sensitive features. We evaluate our approach on four datasets: ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF, and In-The-Wild. Experimental results demonstrate that Poin-HierNet exceeds state-of-the-art methods in Equal Error Rate.
View on arXiv