ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.11088
  4. Cited By
L-Eval: Instituting Standardized Evaluation for Long Context Language
  Models
v1v2v3 (latest)

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Annual Meeting of the Association for Computational Linguistics (ACL), 2023
20 July 2023
Chen An
Shansan Gong
Ming Zhong
Xingjian Zhao
Mukai Li
Jun Zhang
Lingpeng Kong
Xipeng Qiu
    ELMALM
ArXiv (abs)PDFHTMLHuggingFace (5 upvotes)

Papers citing "L-Eval: Instituting Standardized Evaluation for Long Context Language Models"

50 / 137 papers shown
Title
Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving
Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving
Hui Zeng
Daming Zhao
Pengfei Yang
WenXuan Hou
Tianyang Zheng
Hui Li
Weiye Ji
Jidong Zhai
128
1
0
08 Nov 2025
LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model
LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model
Wei Shao
Lingchao Zheng
Pengyu Wang
Peizhen Zheng
Jun Li
Yuwei Fan
50
0
0
07 Nov 2025
MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models
MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models
Keyan Zhou
Zecheng Tang
Lingfeng Ming
G. Zhou
Qiguang Chen
...
Zheming Yang
Libo Qin
Minghui Qiu
Juntao Li
Min Zhang
76
0
0
15 Oct 2025
Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation
Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation
Mufei Li
Dongqi Fu
Limei Wang
Si Zhang
Hanqing Zeng
...
Xiaoxin He
Xavier Bresson
Yinglong Xia
Chonglin Sun
Pan Li
154
0
0
08 Oct 2025
Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles
Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles
Miao Li
Alexander Gurung
Irina Saparina
Mirella Lapata
RALMLRM
81
1
0
25 Sep 2025
CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density
CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density
Daniel Kaiser
Arnoldo Frigessi
Ali Ramezani-Kebrya
Benjamin Ricaud
LRM
82
0
0
22 Sep 2025
Extending Automatic Machine Translation Evaluation to Book-Length Documents
Extending Automatic Machine Translation Evaluation to Book-Length Documents
Kuang-Da Wang
Shuoyang Ding
Chao-Han Huck Yang
Ping-Chun Hsieh
Wen-Chih Peng
Vitaly Lavrukhin
Boris Ginsburg
95
1
0
21 Sep 2025
Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs
Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs
Norman Paulsen
84
0
0
21 Sep 2025
Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts
Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts
Jiaqi Deng
Yuho Lee
Nicole Hee-Yeon Kim
Hyangsuk Min
Taewon Yun
Minjeong Ban
Kim Yul
Hwanjun Song
50
0
0
27 Aug 2025
LongReasonArena: A Long Reasoning Benchmark for Large Language Models
LongReasonArena: A Long Reasoning Benchmark for Large Language Models
Jiayu Ding
Shuming Ma
Lei Cui
Nanning Zheng
Furu Wei
LRMELM
68
0
0
26 Aug 2025
The illusion of a perfect metric: Why evaluating AI's words is harder than it looks
The illusion of a perfect metric: Why evaluating AI's words is harder than it looks
Maria Paz Oliva
Adriana Correia
Ivan Vankov
Viktor Botev
ALM
84
0
0
19 Aug 2025
Positional Biases Shift as Inputs Approach Context Window Limits
Positional Biases Shift as Inputs Approach Context Window Limits
Blerta Veseli
Julian Chibane
Mariya Toneva
Alexander Koller
80
2
0
10 Aug 2025
NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models
NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models
Hyeonseok Moon
Heuiseok Lim
LLMAGRALMLRM
97
0
0
30 Jul 2025
SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models
SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models
Wonjun Jeong
Dongseok Kim
Taegkeun Whangbo
175
0
0
24 Jul 2025
Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction
Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction
Tianyun Zhong
Guozhao Mo
Yanjiang Liu
Yihan Chen
Lingdi Kong
...
Hongyu Lin
Shiwei Ye
Le Sun
Ben He
Le Sun
RALMLMTD
192
0
0
22 Jul 2025
Docopilot: Improving Multimodal Models for Document-Level Understanding
Docopilot: Improving Multimodal Models for Document-Level UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025
Yuchen Duan
Zhe Chen
Yusong Hu
Weiyun Wang
Shenglong Ye
...
Qibin Hou
Tong Lu
Jiaming Song
Jifeng Dai
Wenhai Wang
120
8
0
19 Jul 2025
LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues
LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues
Haoyang Li
Zhanchao Xu
Yiming Li
Xuejia Chen
Darian Li
...
Cheng Deng
Jun Wang
Qing Li
Lei Chen
Mingxuan Yuan
160
1
0
18 Jul 2025
Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models
Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
J. Wu
Gefei Gu
Yanan Zheng
Dit-Yan Yeung
Arman Cohan
LLMAGELM
154
3
0
13 Jul 2025
StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns
StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns
Luanbo Wan
Weizhi Ma
LLMAGKELM
144
1
0
16 Jun 2025
LIFELONG SOTOPIA: Evaluating Social Intelligence of Language Agents Over Lifelong Social Interactions
LIFELONG SOTOPIA: Evaluating Social Intelligence of Language Agents Over Lifelong Social Interactions
Hitesh Goel
Hao Zhu
CLL
123
1
0
14 Jun 2025
Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs
Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs
Wanyun Cui
Mingwei Xu
129
0
0
04 Jun 2025
A Controllable Examination for Long-Context Language Models
A Controllable Examination for Long-Context Language Models
Yijun Yang
Zeyu Huang
Wenhao Zhu
Zihan Qiu
Fei Yuan
Jeff Z.Pan
Ivan Titov
ELM
180
0
0
03 Jun 2025
ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions
ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term InteractionsConference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Beong-woo Kwak
Minju Kim
Dongha Lim
Hyungjoo Chae
Dongjin Kang
Sunghwan Kim
Dongil Yang
Jinyoung Yeo
LLMAGRALM
251
1
0
29 May 2025
MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models
MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zhongzhan Huang
Guoming Ling
Shanshan Zhong
Hefeng Wu
Liang Lin
232
0
0
26 May 2025
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Wang Yang
Hongye Jin
Shaochen Zhong
Song Jiang
Qifan Wang
Vipin Chaudhary
Xiaotian Han
ELM
157
1
0
25 May 2025
Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find
Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find
Owen Bianchi
Mathew J. Koretsky
Maya Willey
Chelsea X. Alvarado
Tanay Nayak
Adi Asija
Nicole Kuznetsov
M. Nalls
F. Faghri
Daniel Khashabi
208
3
0
23 May 2025
SELF: Self-Extend the Context Length With Logistic Growth Function
SELF: Self-Extend the Context Length With Logistic Growth Function
Phat Thanh Dang
Saahil Thoppay
Wang Yang
Qifan Wang
Vipin Chaudhary
Xiaotian Han
211
0
0
22 May 2025
NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts
NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts
Abhay Gupta
Michael Lu
Kevin Zhu
Sean O'Brien
Sean O Brien
LRM
228
0
0
20 May 2025
Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning
Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning
Adam Štorek
Mukur Gupta
Samira Hajizadeh
Prashast Srivastava
Suman Jana
LRM
218
2
0
19 May 2025
PSC: Extending Context Window of Large Language Models via Phase Shift Calibration
PSC: Extending Context Window of Large Language Models via Phase Shift CalibrationConference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Wenqiao Zhu
Chao Xu
Lulu Wang
Jun Wu
206
2
0
18 May 2025
SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models
SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models
Peichao Lai
Jianchao Tan
Yi Lin
Lingling Zhang
Feiyang Ye
...
Zifei Shan
Bin Wang
Longji Xu
Wentao Zhang
Bin Cui
ELMLRM
391
0
0
12 May 2025
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
Yiming Du
Wenyu Huang
Danna Zheng
Zhaowei Wang
Sébastien Montella
Mirella Lapata
Kam-Fai Wong
Jeff Z. Pan
KELMMU
553
16
0
01 May 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Qi Zhang
Tat-Seng Chua
Tianwei Zhang
ALMELM
456
21
0
26 Apr 2025
LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams
LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams
Yongxuan Wu
Runyu Chen
Peiyu Liu
Hongjin Qian
RALM
307
1
0
24 Apr 2025
Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks
Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks
Amey Hengle
Prasoon Bajpai
Soham Dan
Tanmoy Chakraborty
LRM
263
3
0
17 Apr 2025
EduPlanner: LLM-Based Multi-Agent Systems for Customized and Intelligent Instructional Design
EduPlanner: LLM-Based Multi-Agent Systems for Customized and Intelligent Instructional DesignIEEE Transactions on Learning Technologies (IEEE TLT), 2025
Xinsong Zhang
Chao Zhang
Jianwen Sun
Jun Xiao
Yi Yang
Yawei Luo
LLMAGAI4Ed
169
11
0
07 Apr 2025
Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts
Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts
Yifei Yu
Qian Zhang
Lingfeng Qiao
Di Yin
Fang Li
Jie Wang
Zheyu Chen
Suncong Zheng
Xiaolong Liang
Xingwu Sun
317
6
0
07 Apr 2025
The Use of Gaze-Derived Confidence of Inferred Operator Intent in Adjusting Safety-Conscious Haptic Assistance
The Use of Gaze-Derived Confidence of Inferred Operator Intent in Adjusting Safety-Conscious Haptic Assistance
Jeremy D. Webb
Michael Bowman
Songpo Li
Xiaoli Zhang
260
0
0
04 Apr 2025
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs
Siqi Fan
Xiusheng Huang
Yiqun Yao
Xuezhi Fang
Kang Liu
Peng Han
Shuo Shang
Aixin Sun
Yequan Wang
LLMAG
211
2
0
30 Mar 2025
A Survey on Transformer Context Extension: Approaches and Evaluation
A Survey on Transformer Context Extension: Approaches and Evaluation
Yijun Liu
Jinzheng Yu
Yang Xu
Zhongyang Li
Qingfu Zhu
LLMAG
417
11
0
17 Mar 2025
Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring
Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring
Kezia Oketch
John P. Lalor
Yi Yang
Ahmed Abbasi
ELM
135
5
0
14 Mar 2025
CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning
CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and ReasoningInternational Conference on Learning Representations (ICLR), 2025
Hao Cui
Zahra Shamsi
Gowoon Cheon
Xuejian Ma
Shutong Li
...
Eun-Ah Kim
M. Brenner
Viren Jain
Sameera Ponda
Subhashini Venugopalan
ELMLRM
402
22
0
14 Mar 2025
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs
Zhaofeng Wu
Michihiro Yasunaga
Andrew Cohen
Yoon Kim
Asli Celikyilmaz
Marjan Ghazvininejad
247
10
0
14 Mar 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025
Shehreen Azad
Vibhav Vineet
Yogesh S Rawat
VLM
975
11
0
11 Mar 2025
Shifting Long-Context LLMs Research from Input to Output
Yuhao Wu
Yushi Bai
Zhiqing Hu
Shangqing Tu
Ming Shan Hee
Juanzi Li
Roy Ka-wei Lee
271
13
0
06 Mar 2025
RSQ: Learning from Important Tokens Leads to Better Quantized LLMs
Yi-Lin Sung
Prateek Yadav
Jialu Li
Jaehong Yoon
Joey Tianyi Zhou
MQ
182
2
0
03 Mar 2025
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence InferenceInternational Conference on Learning Representations (ICLR), 2025
Xunhao Lai
Jianqiao Lu
Yao Luo
Yiyuan Ma
Xun Zhou
239
45
0
28 Feb 2025
PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation
PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation
Albert Gong
Kamilė Stankevičiūtė
Chao-gang Wan
Anmol Kabra
Raphael Thesmar
Johann Lee
Julius Klenke
Daniel Schwalbe-Koda
Kilian Q. Weinberger
LRMRALM
252
4
0
27 Feb 2025
MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering
MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering
Teng Lin
Yuyu Luo
Honglin Zhang
Jicheng Zhang
Chunlin Liu
Kaishun Wu
Nan Tang
RALM
273
6
0
26 Feb 2025
Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models
Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Shuliang Liu
Xinze Li
Zhenghao Liu
Shi Yu
Cheng Yang
Zheni Zeng
Zhiyuan Liu
Maosong Sun
Ge Yu
RALM
399
4
0
26 Feb 2025
123
Next