Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.05761
Cited By
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
9 June 2024
Seungone Kim
Juyoung Suk
Ji Yong Cho
Shayne Longpre
Chaeeun Kim
Dongkeun Yoon
Guijin Son
Yejin Cho
Sheikh Shafayat
Jinheon Baek
Sue Hyun Park
Hyeonbin Hwang
Jinkyung Jo
Hyowon Cho
Haebin Shin
Seongyun Lee
Hanseok Oh
Noah Lee
Namgyu Ho
Se June Joo
Miyoung Ko
Yoonjoo Lee
Hyungjoo Chae
Jamin Shin
Joel Jang
Seonghyeon Ye
Bill Yuchen Lin
Sean Welleck
Graham Neubig
Moontae Lee
Kyungjae Lee
Minjoon Seo
ELM
ALM
LM&MA
Re-assign community
ArXiv
PDF
HTML
Papers citing
"The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models"
38 / 38 papers shown
Title
Trillion 7B Technical Report
Sungjun Han
Juyoung Suk
Suyeong An
Hyungguk Kim
Kyuseok Kim
Wonsuk Yang
Seungtaek Choi
Jamin Shin
19
0
0
21 Apr 2025
EvalAgent: Discovering Implicit Evaluation Criteria from the Web
Manya Wadhwa
Zayne Sprague
Chaitanya Malaviya
Philippe Laban
Junyi Jessy Li
Greg Durrett
25
0
0
21 Apr 2025
LRAGE: Legal Retrieval Augmented Generation Evaluation Tool
Minhu Park
Hongseok Oh
Eunkyung Choi
Wonseok Hwang
AILaw
RALM
ELM
109
0
0
02 Apr 2025
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
Yoonshik Kim
Jaeyoon Jung
30
0
0
31 Mar 2025
Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators
Seungone Kim
Ian Wu
Jinu Lee
Xiang Yue
Seongyun Lee
...
Kiril Gashteovski
Carolin (Haas) Lawrence
J. Hockenmaier
Graham Neubig
Sean Welleck
LRM
40
2
0
25 Mar 2025
Won: Establishing Best Practices for Korean Financial NLP
Guijin Son
Hyunwoo Ko
Haneral Jung
Chami Hwang
42
0
0
23 Mar 2025
REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
Alexander Pugachev
Alena Fenogenova
Vladislav Mikhailov
Ekaterina Artemova
27
0
0
17 Mar 2025
An Empirical Analysis of Uncertainty in Large Language Model Evaluations
Qiujie Xie
Qingqiu Li
Zhuohao Yu
Yuejie Zhang
Yue Zhang
Linyi Yang
ELM
48
1
0
15 Feb 2025
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Guijin Son
Dongkeun Yoon
Juyoung Suk
Javier Aula-Blasco
Mano Aslan
Vu Trong Kim
Shayekh Bin Islam
Jaume Prats-Cristià
Lucía Tormo-Bañuelos
Seungone Kim
ELM
LRM
25
8
0
23 Oct 2024
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
Maosong Cao
Alexander Lam
Haodong Duan
Hongwei Liu
S. Zhang
Kai Chen
AILaw
ELM
29
5
0
21 Oct 2024
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Dilip Venkatesh
Raj Dabre
Anoop Kunchukuttan
Mitesh M. Khapra
ELM
30
1
0
17 Oct 2024
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Sijun Tan
Siyuan Zhuang
Kyle Montgomery
William Y. Tang
Alejandro Cuadron
Chenguang Wang
Raluca A. Popa
Ion Stoica
ELM
ALM
37
35
0
16 Oct 2024
Self-rationalization improves LLM as a fine-grained judge
Prapti Trivedi
Aditya Gulati
Oliver Molenschot
Meghana Arakkal Rajeev
Rajkumar Ramamurthy
Keith Stevens
Tanveesh Singh Chaudhery
Jahnavi Jambholkar
James Y. Zou
Nazneen Rajani
LRM
20
2
0
07 Oct 2024
ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure
Ippei Fujisawa
Sensho Nobe
Hiroki Seto
Rina Onda
Yoshiaki Uchida
Hiroki Ikoma
Pei-Chun Chien
Ryota Kanai
LRM
28
1
0
04 Oct 2024
Direct Judgement Preference Optimization
Peifeng Wang
Austin Xu
Yilun Zhou
Caiming Xiong
Shafiq Joty
ELM
29
11
0
23 Sep 2024
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
Zayne Sprague
Fangcong Yin
Juan Diego Rodriguez
Dongwei Jiang
Manya Wadhwa
Prasann Singhal
Xinyu Zhao
Xi Ye
Kyle Mahowald
Greg Durrett
ReLM
LRM
81
79
0
18 Sep 2024
AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents
Zhe Su
Xuhui Zhou
Sanketh Rangreji
Anubha Kabra
Julia Mendelsohn
Faeze Brahman
Maarten Sap
LLMAG
86
2
0
13 Sep 2024
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
Ilya Gusev
LLMAG
40
3
0
10 Sep 2024
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
Andreas Stephan
D. Zhu
Matthias Aßenmacher
Xiaoyu Shen
Benjamin Roth
ELM
37
4
0
06 Sep 2024
Industrial-Grade Smart Troubleshooting through Causal Technical Language Processing: a Proof of Concept
Alexandre Trilla
Ossee Yiboe
Nenad Mijatovic
Jordi Vitrià
20
1
0
30 Jul 2024
Can Language Models Evaluate Human Written Text? Case Study on Korean Student Writing for Education
Seungyoon Kim
Seungone Kim
AI4Ed
18
0
0
24 Jul 2024
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
Ori Yoran
S. Amouyal
Chaitanya Malaviya
Ben Bogin
Ofir Press
Jonathan Berant
LLMAG
30
26
0
22 Jul 2024
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Sshubam Verma
Mitesh Khapra
24
10
0
19 Jun 2024
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Seungone Kim
Juyoung Suk
Shayne Longpre
Bill Yuchen Lin
Jamin Shin
Sean Welleck
Graham Neubig
Moontae Lee
Kyungjae Lee
Minjoon Seo
MoMe
ALM
ELM
39
136
0
02 May 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace
Kai Y. Xiao
R. Leike
Lilian Weng
Johannes Heidecke
Alex Beutel
SILM
47
113
0
19 Apr 2024
A Safe Harbor for AI Evaluation and Red Teaming
Shayne Longpre
Sayash Kapoor
Kevin Klyman
Ashwin Ramaswami
Rishi Bommasani
...
Daniel Kang
Sandy Pentland
Arvind Narayanan
Percy Liang
Peter Henderson
44
10
0
07 Mar 2024
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning
Shivalika Singh
Freddie Vargus
Daniel D'souza
Börje F. Karlsson
Abinaya Mahendiran
...
Max Bartolo
Julia Kreutzer
A. Ustun
Marzieh Fadaee
Sara Hooker
110
115
0
09 Feb 2024
Personalized Language Modeling from Personalized Human Feedback
Xinyu Li
Zachary C. Lipton
Liu Leqi
ALM
57
11
0
06 Feb 2024
OLMo: Accelerating the Science of Language Models
Dirk Groeneveld
Iz Beltagy
Pete Walsh
Akshita Bhagia
Rodney Michael Kinney
...
Jesse Dodge
Kyle Lo
Luca Soldaini
Noah A. Smith
Hanna Hajishirzi
OSLM
124
349
0
01 Feb 2024
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Lianghui Zhu
Xinggang Wang
Xinlong Wang
ELM
ALM
54
103
0
26 Oct 2023
Language Models are Multilingual Chain-of-Thought Reasoners
Freda Shi
Mirac Suzgun
Markus Freitag
Xuezhi Wang
Suraj Srivats
...
Yi Tay
Sebastian Ruder
Denny Zhou
Dipanjan Das
Jason W. Wei
ReLM
LRM
160
320
0
06 Oct 2022
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
Abulhair Saparov
He He
ELM
LRM
ReLM
116
270
0
03 Oct 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
211
327
0
23 Aug 2022
ClaimDiff: Comparing and Contrasting Claims on Contentious Issues
Miyoung Ko
Ingyu Seong
Hwaran Lee
Joonsuk Park
Minsuk Chang
Minjoon Seo
30
1
0
24 May 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang
Jason W. Wei
Dale Schuurmans
Quoc Le
Ed H. Chi
Sharan Narang
Aakanksha Chowdhery
Denny Zhou
ReLM
BDL
LRM
AI4CE
297
3,163
0
21 Mar 2022
Towards Continual Knowledge Learning of Language Models
Joel Jang
Seonghyeon Ye
Sohee Yang
Joongbo Shin
Janghoon Han
Gyeonghun Kim
Stanley Jungkyu Choi
Minjoon Seo
CLL
KELM
216
122
0
07 Oct 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann
Tosin P. Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
...
Nishant Subramani
Wei-ping Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
VLM
238
254
0
02 Feb 2021
Language Models as Knowledge Bases?
Fabio Petroni
Tim Rocktaschel
Patrick Lewis
A. Bakhtin
Yuxiang Wu
Alexander H. Miller
Sebastian Riedel
KELM
AI4MH
391
2,216
0
03 Sep 2019
1