Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.08394
Cited By
Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization
12 October 2023
Ondrej Skopek
Rahul Aralikatte
Sian Gooding
Victor Carbune
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization"
22 / 22 papers shown
Title
PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines
Reya Vir
Shreya Shankar
Harrison Chase
Will Fu-Hinthorn
Aditya G. Parameswaran
AI4TS
32
0
0
20 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
29
0
0
16 Apr 2025
RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction
Jianhao Yan
Yun Luo
Yue Zhang
LLMAG
50
1
0
25 Feb 2025
Disentangling Likes and Dislikes in Personalized Generative Explainable Recommendation
Ryotaro Shimizu
Takashi Wada
Yu Wang
Johannes Kruse
Sean O'Brien
...
Yuya Yoshikawa
Yuki Saito
Fugee Tsung
M. Goto
Julian McAuley
14
0
0
17 Oct 2024
MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark
Elliot L. Epstein
Kaisheng Yao
Jing Li
Xinyi Bai
Hamid Palangi
LRM
42
0
0
26 Sep 2024
UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches
Chao Wang
Neo Wu
Lin Ning
Jiaxing Wu
Luyang Liu
Jun Xie
S. O’Banion
Bradley Green
40
0
0
30 Aug 2024
See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses
Yulong Chen
Yang Liu
Jianhao Yan
X. Bai
Ming Zhong
Yinghao Yang
Ziyi Yang
Chenguang Zhu
Yue Zhang
ALM
ELM
35
5
0
16 Aug 2024
The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models
Xinyi Chen
Baohao Liao
Jirui Qi
Panagiotis Eustratiadis
Christof Monz
Arianna Bisazza
Maarten de Rijke
ALM
ELM
LRM
23
5
0
28 Jun 2024
Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation
Rem Hida
Junki Ohmura
Toshiyuki Sekiya
ELM
27
0
0
24 Jun 2024
METAL: Towards Multilingual Meta-Evaluation
Rishav Hada
Varun Gumma
Mohamed Ahmed
Kalika Bali
Sunayana Sitaram
ELM
22
2
0
02 Apr 2024
RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models
Jianhao Yan
Yun Luo
Yue Zhang
ALM
LRM
22
6
0
21 Feb 2024
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs
Kenneth Li
Tianle Liu
Naomi Bashkansky
David Bau
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
11
6
0
13 Feb 2024
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges
Zhen Li
Xiaohan Xu
Tao Shen
Can Xu
Jia-Chen Gu
Yuxuan Lai
Chongyang Tao
Shuai Ma
LM&MA
ELM
26
9
0
13 Jan 2024
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization
Yixin Liu
Alexander R. Fabbri
Jiawen Chen
Yilun Zhao
Simeng Han
Shafiq R. Joty
Pengfei Liu
Dragomir R. Radev
Chien-Sheng Wu
Arman Cohan
ELM
36
43
0
15 Nov 2023
Can Large Language Models Be an Alternative to Human Evaluations?
Cheng-Han Chiang
Hung-yi Lee
ALM
LM&MA
206
559
0
03 May 2023
Generative Agents: Interactive Simulacra of Human Behavior
J. Park
Joseph C. O'Brien
Carrie J. Cai
Meredith Ringel Morris
Percy Liang
Michael S. Bernstein
LM&Ro
AI4CE
209
1,701
0
07 Apr 2023
Large Language Models are Diverse Role-Players for Summarization Evaluation
Ning Wu
Ming Gong
Linjun Shou
Shining Liang
Daxin Jiang
57
44
0
27 Mar 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,261
0
28 Jan 2022
MReD: A Meta-Review Dataset for Structure-Controllable Text Generation
Chenhui Shen
Liying Cheng
Ran Zhou
Lidong Bing
Yang You
Luo Si
37
33
0
14 Oct 2021
SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems
Harrison Lee
Raghav Gupta
Abhinav Rastogi
Yuan Cao
Bin Zhang
Yonghui Wu
64
29
0
13 Oct 2021
Teaching Machines to Read and Comprehend
Karl Moritz Hermann
Tomás Kociský
Edward Grefenstette
L. Espeholt
W. Kay
Mustafa Suleyman
Phil Blunsom
170
3,504
0
10 Jun 2015
1