8
v1v2 (latest)

EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models

Chuanrui Hu
Tong Li
Xingze Gao
Hongda Chen
Yi Bai
Dannong Xu
Tianwei Lin
Xinda Zhao
Xiaohong Li
Yunyun Han
Jian Pei
Yafeng Deng
Main:8 Pages
3 Figures
Bibliography:2 Pages
6 Tables
Abstract

Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.

View on arXiv
Comments on this paper