ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.04369
  4. Cited By
Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation

Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation

8 September 2023
Jiatong Li
Rui Li
Qi Liu
ArXivPDFHTML

Papers citing "Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation"

15 / 15 papers shown
Title
AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery
AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery
Amirhossein Abaskohi
A. Ramesh
Shailesh Nanisetty
Chirag Goel
David Vazquez
Christopher Pal
Spandana Gella
Giuseppe Carenini
I. Laradji
24
0
0
10 Apr 2025
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Simin Chen
Yiming Chen
Zexin Li
Yifan Jiang
Zhongwei Wan
...
Dezhi Ran
Tianle Gu
H. Li
Tao Xie
Baishakhi Ray
41
2
0
23 Feb 2025
AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities
AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities
Fabrizio Davide
Pietro Torre
Andrea Gaggioli
Andrea Gaggioli
ELM
88
0
0
12 Dec 2024
TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning
  Abilities of LLMs
TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs
H. Wang
Xiachong Feng
Lei Li
Z. Qin
Dianbo Sui
Lingpeng Kong
LRM
ELM
30
3
0
14 Oct 2024
Are Large Language Models Strategic Decision Makers? A Study of
  Performance and Bias in Two-Player Non-Zero-Sum Games
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games
Nathan Herr
Fernando Acero
Roberta Raileanu
María Pérez-Ortiz
Zhibin Li
LRM
53
2
0
05 Jul 2024
How Many Parameters Does it Take to Change a Light Bulb? Evaluating
  Performance in Self-Play of Conversational Games as a Function of Model
  Characteristics
How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics
Nidhir Bhavsar
Jonathan Jordan
Sherzod Hakimov
David Schlangen
16
0
0
20 Jun 2024
clembench-2024: A Challenging, Dynamic, Complementary, Multilingual
  Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents
clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents
Anne Beyer
Kranti Chalamalasetti
Sherzod Hakimov
Brielen Madureira
P. Sadler
David Schlangen
LLMAG
19
4
0
31 May 2024
PertEval: Unveiling Real Knowledge Capacity of LLMs with
  Knowledge-Invariant Perturbations
PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations
Jiatong Li
Renjun Hu
Kunzhe Huang
Zhuang Yan
Qi Liu
Mengxiao Zhu
Xing Shi
Wei Lin
KELM
36
4
0
30 May 2024
Risks and Opportunities of Open-Source Generative AI
Risks and Opportunities of Open-Source Generative AI
Francisco Eiras
Aleksander Petrov
Bertie Vidgen
Christian Schroeder
Fabio Pizzati
...
Matthew Jackson
Phillip H. S. Torr
Trevor Darrell
Y. Lee
Jakob N. Foerster
37
18
0
14 May 2024
Near to Mid-term Risks and Opportunities of Open-Source Generative AI
Near to Mid-term Risks and Opportunities of Open-Source Generative AI
Francisco Eiras
Aleksandar Petrov
Bertie Vidgen
Christian Schroeder de Witt
Fabio Pizzati
...
Paul Röttger
Philip H. S. Torr
Trevor Darrell
Y. Lee
Jakob N. Foerster
33
5
0
25 Apr 2024
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments
Jen-tse Huang
E. Li
Man Ho Lam
Tian Liang
Wenxuan Wang
Youliang Yuan
Wenxiang Jiao
Xing Wang
Zhaopeng Tu
Michael R. Lyu
ELM
LLMAG
77
32
0
18 Mar 2024
Towards Personalized Evaluation of Large Language Models with An
  Anonymous Crowd-Sourcing Platform
Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing Platform
Mingyue Cheng
Hao Zhang
Jiqian Yang
Qi Liu
Li Li
Xin Huang
Liwei Song
Zhi Li
Zhenya Huang
Enhong Chen
ELM
ALM
14
9
0
13 Mar 2024
Position: AI Evaluation Should Learn from How We Test Humans
Position: AI Evaluation Should Learn from How We Test Humans
Yan Zhuang
Q. Liu
Yuting Ning
Wei Huang
Rui Lv
Zhenya Huang
Guanhao Zhao
Zheng-Wei Zhang
ELM
ALM
62
21
0
18 Jun 2023
A Paradigm Shift: The Future of Machine Translation Lies with Large
  Language Models
A Paradigm Shift: The Future of Machine Translation Lies with Large Language Models
Chenyang Lyu
Zefeng Du
Jitao Xu
Yitao Duan
Minghao Wu
Teresa Lynn
Alham Fikri Aji
Derek F. Wong
Siyou Liu
Longyue Wang
41
25
0
02 May 2023
Measuring Coding Challenge Competence With APPS
Measuring Coding Challenge Competence With APPS
Dan Hendrycks
Steven Basart
Saurav Kadavath
Mantas Mazeika
Akul Arora
...
Collin Burns
Samir Puranik
Horace He
D. Song
Jacob Steinhardt
ELM
AIMat
ALM
194
614
0
20 May 2021
1