Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2410.03608
Cited By

TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and
Generation

TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation

4 October 2024

Tim Rocktaschel

Jakob Foerster

Dennis Aumiller

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Papers citing "TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation"

19 / 19 papers shown

Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

327

6

0

15 Nov 2025

Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

321

3

0

02 Oct 2025

LLMs Behind the Scenes: Enabling Narrative Scene Illustration

LLMs Behind the Scenes: Enabling Narrative Scene Illustration

Melissa Roemmele

John Joon Young Chung

Alex Calderwood

140

1

0

26 Sep 2025

TopoSizing: An LLM-aided Framework of Topology-based Understanding and Sizing for AMS Circuits

TopoSizing: An LLM-aided Framework of Topology-based Understanding and Sizing for AMS Circuits

108

3

0

17 Sep 2025

Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?

Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?

Momoka Furuhashi

209

5

0

21 Aug 2025

CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

331

6

0

03 Aug 2025

Checklists Are Better Than Reward Models For Aligning Language Models

Checklists Are Better Than Reward Models For Aligning Language Models

Vijay Viswanathan

429

44

0

24 Jul 2025

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

John Michael Giorgi

237

1

0

23 Jul 2025

Checklist Engineering Empowers Multilingual LLM Judges

Checklist Engineering Empowers Multilingual LLM Judges

Mohammad Ghiasvand Mohammadkhani

222

1

0

09 Jul 2025

MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs

MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

179

1

0

18 Jun 2025

Team Anotheroption at SemEval-2025 Task 8: Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA

Team Anotheroption at SemEval-2025 Task 8: Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA

Nikolas Evkarpidi

Elena Tutubalina

373

1

0

11 Jun 2025

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Nadezhda Chirkova

Tunde Oluwaseyi Ajayi

Zain Muhammad Mujahid

Vladana Perlić

Ekaterina Borisova

Markarit Vartampetian

390

1

0

10 Jun 2025

EvalAgent: Discovering Implicit Evaluation Criteria from the Web

EvalAgent: Discovering Implicit Evaluation Criteria from the Web

Chaitanya Malaviya

470

6

0

21 Apr 2025

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

389

1

0

09 Apr 2025

When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

Marcus Rohrbach

368

25

0

01 Apr 2025

REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities

REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities

Alexander Pugachev

Alena Fenogenova

Vladislav Mikhailov

Ekaterina Artemova

423

1

0

17 Mar 2025

Investigating Non-Transitivity in LLM-as-a-Judge

Investigating Non-Transitivity in LLM-as-a-Judge

Tim Rocktaschel

500

23

0

19 Feb 2025

Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect
Verifiers

Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers

Benedikt Stroebl

Arvind Narayanan

587

45

0

26 Nov 2024

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context ScenariosAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

1.2K

19

0

11 Nov 2024

Page 1 of 1