12
0

Cross-attention for State-based model RWKV-7

Abstract

We introduce CrossWKV, a novel cross-attention mechanism for the state-based RWKV-7 model, designed to enhance the expressive power of text-to-image generation. Leveraging RWKV-7's linear-complexity Weighted Key-Value (WKV) architecture, CrossWKV integrates text and image modalities in a single pass, utilizing a generalized delta rule with vector-valued gating and low-rank adaptations (LoRA) to achieve superior cross-modal alignment. Unlike Transformer-based models, CrossWKV's non-diagonal, input-dependent transition matrix enables it to represent complex functions beyond the TC0\mathrm{TC}^0 complexity class, including all regular languages, as demonstrated by its ability to perform state-tracking tasks like S5S_5 permutation modeling. Evaluated within the Diffusion in RWKV-7 (DIR-7) on datasets such as LAION-5B and ImageNet, CrossWKV achieves a Frechet Inception Distance (FID) of 2.88 and a CLIP score of 0.33 on ImageNet 256x256, matching state-of-the-art performance while offering robust generalization across diverse prompts. The model's enhanced expressivity, combined with constant memory usage and linear scaling, positions it as a powerful solution for advanced cross-modal tasks, with potential applications in high-resolution generation and dynamic statethis http URLatthis https URL

View on arXiv
@article{xiao2025_2504.14260,
  title={ Cross-attention for State-based model RWKV-7 },
  author={ Liu Xiao and Li Zhiyuan and Lin Yueyu },
  journal={arXiv preprint arXiv:2504.14260},
  year={ 2025 }
}
Comments on this paper