ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.03608
  4. Cited By
TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and
  Generation

TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation

4 October 2024
Jonathan Cook
Tim Rocktaschel
Jakob Foerster
Dennis Aumiller
Alex Wang
    ALM
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)

Papers citing "TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation"

19 / 19 papers shown
Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning
Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning
Baolong Bi
Shenghua Liu
Yiwei Wang
Siqian Tong
Lingrui Mei
Yuyao Ge
Yilong Xu
Jiafeng Guo
Xueqi Cheng
OffRLLRM
327
6
0
15 Nov 2025
Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey
Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey
Qiyuan Liu
Hao Xu
Xuhong Chen
Wei Chen
Yee Whye Teh
Ning Miao
ReLMLRMAI4CE
321
3
0
02 Oct 2025
LLMs Behind the Scenes: Enabling Narrative Scene Illustration
LLMs Behind the Scenes: Enabling Narrative Scene Illustration
Melissa Roemmele
John Joon Young Chung
Taewook Kim
Yuqian Sun
Alex Calderwood
Max Kreminski
DiffM
140
1
0
26 Sep 2025
TopoSizing: An LLM-aided Framework of Topology-based Understanding and Sizing for AMS Circuits
TopoSizing: An LLM-aided Framework of Topology-based Understanding and Sizing for AMS Circuits
Ziming Wei
Zichen Kong
Yuan Wang
David Z. Pan
Xiyuan Tang
108
3
0
17 Sep 2025
Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?
Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?
Momoka Furuhashi
Kouta Nakayama
Takashi Kodama
Saku Sugawara
ALMELM
209
5
0
21 Aug 2025
CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions
CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions
Tae Soo Kim
Yoonjoo Lee
Yoonah Park
Jiho Kim
Young-Ho Kim
Juho Kim
331
6
0
03 Aug 2025
Checklists Are Better Than Reward Models For Aligning Language Models
Checklists Are Better Than Reward Models For Aligning Language Models
Vijay Viswanathan
Yanchao Sun
Shuang Ma
Xiang Kong
Meng Cao
Graham Neubig
Tongshuang Wu
ALM
429
44
0
24 Jul 2025
From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes
From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes
Karen Zhou
John Michael Giorgi
P. Mani
Peng Xu
Davis Liang
Chenhao Tan
237
1
0
23 Jul 2025
Checklist Engineering Empowers Multilingual LLM Judges
Checklist Engineering Empowers Multilingual LLM Judges
Mohammad Ghiasvand Mohammadkhani
Hamid Beigy
ELM
222
1
0
09 Jul 2025
MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs
MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yongqi Fan
Yating Wang
Guandong Wang
Jie Zhai
Jingping Liu
Qi Ye
Tong Ruan
179
1
0
18 Jun 2025
Team Anotheroption at SemEval-2025 Task 8: Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA
Team Anotheroption at SemEval-2025 Task 8: Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA
Nikolas Evkarpidi
Elena Tutubalina
LMTD
373
1
0
11 Jun 2025
LLM-as-a-qualitative-judge: automating error analysis in natural language generation
LLM-as-a-qualitative-judge: automating error analysis in natural language generation
Nadezhda Chirkova
Tunde Oluwaseyi Ajayi
Seth Aycock
Zain Muhammad Mujahid
Vladana Perlić
Ekaterina Borisova
Markarit Vartampetian
ELM
390
1
0
10 Jun 2025
EvalAgent: Discovering Implicit Evaluation Criteria from the Web
EvalAgent: Discovering Implicit Evaluation Criteria from the Web
Manya Wadhwa
Zayne Sprague
Chaitanya Malaviya
Philippe Laban
Junyi Jessy Li
Greg Durrett
470
6
0
21 Apr 2025
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
Mingxuan Li
Hanchen Li
Chenhao Tan
ALMELM
389
1
0
09 Apr 2025
When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
Nishad Singhi
Hritik Bansal
Arian Hosseini
Aditya Grover
Kai-Wei Chang
Marcus Rohrbach
Anna Rohrbach
OffRLLRM
368
25
0
01 Apr 2025
REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
Alexander Pugachev
Alena Fenogenova
Vladislav Mikhailov
Ekaterina Artemova
423
1
0
17 Mar 2025
Investigating Non-Transitivity in LLM-as-a-Judge
Investigating Non-Transitivity in LLM-as-a-Judge
Yi Xu
Laura Ruis
Tim Rocktaschel
Robert Kirk
500
23
0
19 Feb 2025
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect
  Verifiers
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers
Benedikt Stroebl
Sayash Kapoor
Arvind Narayanan
LRM
587
45
0
26 Nov 2024
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context ScenariosAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Xiaodong Wu
Minhao Wang
Yichen Liu
Xiaoming Shi
He Yan
Xiangju Lu
Junmin Zhu
Wei Zhang
1.2K
19
0
11 Nov 2024
1
Page 1 of 1