113

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

Tingyu Wu
Zhisheng Chen
Ziyan Weng
Shuhe Wang
Chenglong Li
Shuo Zhang
Sen Hu
Silin Wu
Qizhen Lan
Huacan Wang
Ronghao Chen
Main:9 Pages
2 Figures
Bibliography:2 Pages
4 Tables
Appendix:7 Pages
Abstract

Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in \href{KnowMeBench}{this https URL}.

View on arXiv
Comments on this paper