RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

11 January 2026

Haonan Bian

Zhiyuan Yao

Sen Hu

Zishan Xu

Shaolei Zhang

Yifu Guo

Ziliang Yang

Xueran Han

Huacan Wang

Ronghao Chen

LLMAG

ArXiv (abs)PDF HTML HuggingFace (7 upvotes)Github (5★)

Main:9 Pages

6 Figures

Bibliography:3 Pages

11 Tables

Appendix:6 Pages

Abstract

As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture **"long-term project-oriented"** interactions where agents must track evolving goals.To bridge this gap, we introduce **RealMem**, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation.We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects.Our code and datasets are available at [this https URL](this https URL).

View on arXiv

Comments on this paper