24
2

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time

Abstract

Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. To address this gap, we present a novel framework and dataset spanning over 8,000 events from 2018 to 2024, annotated with day-level granularity and sourced globally across domains such as politics, science, and business. Our TimeShift evaluation method systematically probes LLMs for temporal reasoning, revealing that base models often outperform instruction-tuned and synthetic-trained counterparts on time-sensitive recall. Additionally, we find that even large-scale models exhibit brittleness in handling paraphrased facts, highlighting unresolved challenges in temporal consistency. By identifying these limitations, our work provides a significant step toward advancing time-aware language models capable of adapting to the dynamic nature of real-world knowledge.

View on arXiv
@article{herel2025_2409.13338,
  title={ Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time },
  author={ David Herel and Vojtech Bartek and Jiri Jirak and Tomas Mikolov },
  journal={arXiv preprint arXiv:2409.13338},
  year={ 2025 }
}
Comments on this paper