ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.00061
  4. Cited By
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated
  Text

All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

30 June 2021
Elizabeth Clark
Tal August
Sofia Serrano
Nikita Haduong
Suchin Gururangan
Noah A. Smith
    DeLMO
ArXivPDFHTML

Papers citing "All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text"

50 / 220 papers shown
Title
Humans can learn to detect AI-generated texts, or at least learn when they can't
Humans can learn to detect AI-generated texts, or at least learn when they can't
Jiří Milička
Anna Marklová
Ondřej Drobil
Eva Pospíšilová
DeLMO
17
0
0
03 May 2025
The Viability of Crowdsourcing for RAG Evaluation
The Viability of Crowdsourcing for RAG Evaluation
Lukas Gienapp
Tim Hagen
Maik Frobe
Matthias Hagen
Benno Stein
Martin Potthast
Harrisen Scells
21
0
0
22 Apr 2025
MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks
MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks
Jaime Raldua Veuthey
Zainab Ali Majid
Suhas Hariharan
Jacob Haimes
ELM
26
0
0
18 Apr 2025
Labeling Messages as AI-Generated Does Not Reduce Their Persuasive Effects
Labeling Messages as AI-Generated Does Not Reduce Their Persuasive Effects
Isabel O. Gallegos
Chen Shani
Weiyan Shi
Federico Bianchi
Izzy Gainsburg
Dan Jurafsky
Robb Willer
20
1
0
14 Apr 2025
Explorer: Robust Collection of Interactable GUI Elements
Explorer: Robust Collection of Interactable GUI Elements
Iason Chaimalas
Arnas Vyšniauskas
Gabriel Brostow
26
0
0
12 Apr 2025
Can postgraduate translation students identify machine-generated text?
Can postgraduate translation students identify machine-generated text?
Michael Farrell
DeLMO
38
0
0
12 Apr 2025
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models
Sher Badshah
Ali Emami
Hassan Sajjad
LLMAG
ELM
43
0
0
10 Apr 2025
From Speech to Summary: A Comprehensive Survey of Speech Summarization
From Speech to Summary: A Comprehensive Survey of Speech Summarization
Fabian Retkowski
Maike Züfle
Andreas Sudmann
Dinah Pfau
Jan Niehues
Alexander Waibel
39
0
0
10 Apr 2025
Toward Holistic Evaluation of Recommender Systems Powered by Generative Models
Toward Holistic Evaluation of Recommender Systems Powered by Generative Models
Yashar Deldjoo
Nikhil Mehta
M. Sathiamoorthy
Shuai Zhang
Pablo Castells
Julian McAuley
EGVM
ELM
64
1
0
09 Apr 2025
Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
Diana Galván-Sosa
Gabrielle Gaudeau
Pride Kavumba
Yunmeng Li
Hongyi gu
Zheng Yuan
Keisuke Sakaguchi
P. Buttery
LRM
35
0
0
31 Mar 2025
Did ChatGPT or Copilot use alter the style of internet news headlines? A time series regression analysis
Did ChatGPT or Copilot use alter the style of internet news headlines? A time series regression analysis
Chris Brogly
Connor McElroy
KELM
32
0
0
31 Mar 2025
SCORE: Story Coherence and Retrieval Enhancement for AI Narratives
SCORE: Story Coherence and Retrieval Enhancement for AI Narratives
Qiang Yi
Yangfan He
J. Wang
Xinyuan Song
Shiyao Qian
...
K. Li
Kuan Lu
Menghao Huo
Jiaqi Chen
Tianyu Shi
RALM
42
6
0
30 Mar 2025
Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models
Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models
Tom Kempton
Stuart Burrell
32
0
0
27 Mar 2025
Feature Extraction and Analysis for GPT-Generated Text
Feature Extraction and Analysis for GPT-Generated Text
A. Selvioğlu
V. Adanova
M. Atagoziev
DeLMO
60
0
0
17 Mar 2025
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
Naome A. Etori
Kevin Lu
Randu Karisa
Arturs Kanepajs
LRM
ELM
113
0
0
14 Mar 2025
DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering
Sher Badshah
Hassan Sajjad
60
1
0
11 Mar 2025
Detection Avoidance Techniques for Large Language Models
Sinclair Schneider
Florian Steuber
João A. G. Schneider
Gabi Dreo Rodosek
DeLMO
78
0
0
10 Mar 2025
Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems
Jooyoung Lee
Xiaochen Zhu
Georgi Karadzhov
Tom Stafford
Andreas Vlachos
Dongwon Lee
56
0
0
06 Mar 2025
When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning
When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning
Yijiang River Dong
Tiancheng Hu
Yinhong Liu
Ahmet Üstün
Nigel Collier
78
1
0
26 Feb 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Rylan Schaeffer
Punit Singh Koura
Binh Tang
R. Subramanian
Aaditya K. Singh
...
Vedanuj Goswami
Sergey Edunov
Dieuwke Hupkes
Sanmi Koyejo
Sharan Narang
ALM
69
0
0
24 Feb 2025
Can AI mimic the human ability to define neologisms?
Can AI mimic the human ability to define neologisms?
Georgios P. Georgiou
35
1
0
18 Feb 2025
From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis
From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis
Zhuoyan Li
Hangxiao Zhu
Zhuoran Lu
Ziang Xiao
Ming Yin
47
0
0
17 Feb 2025
Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages
Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages
Shreyan Biswas
Alexander Erlei
U. Gadiraju
103
4
0
13 Feb 2025
Reference-free Evaluation Metrics for Text Generation: A Survey
Reference-free Evaluation Metrics for Text Generation: A Survey
Takumi Ito
Kees van Deemter
Jun Suzuki
ELM
33
2
0
21 Jan 2025
Using Machine Learning to Distinguish Human-written from
  Machine-generated Creative Fiction
Using Machine Learning to Distinguish Human-written from Machine-generated Creative Fiction
Andrea Cristina McGlinchey
Peter J Barclay
DeLMO
71
0
0
15 Dec 2024
QAPyramid: Fine-grained Evaluation of Content Selection for Text
  Summarization
QAPyramid: Fine-grained Evaluation of Content Selection for Text Summarization
Shiyue Zhang
David Wan
Arie Cattan
Ayal Klein
Ido Dagan
Mohit Bansal
81
0
0
10 Dec 2024
The Vulnerability of Language Model Benchmarks: Do They Accurately
  Reflect True LLM Performance?
The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?
Sourav Banerjee
Ayushi Agarwal
Eishkaran Singh
ELM
73
2
0
02 Dec 2024
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
  And A Retrieval-Aware Tuning Framework
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
Yew Ken Chia
Liying Cheng
Hou Pong Chan
Chaoqun Liu
Maojia Song
Sharifah Mahani Aljunied
Soujanya Poria
Lidong Bing
RALM
VLM
43
4
0
09 Nov 2024
How Performance Pressure Influences AI-Assisted Decision Making
How Performance Pressure Influences AI-Assisted Decision Making
Nikita Haduong
Noah A. Smith
14
0
0
21 Oct 2024
4-LEGS: 4D Language Embedded Gaussian Splatting
4-LEGS: 4D Language Embedded Gaussian Splatting
Gal Fiebelman
Tamir Cohen
Ayellet Morgenstern
Peter Hedman
Hadar Averbuch-Elor
3DGS
41
3
0
14 Oct 2024
Reverse Modeling in Large Language Models
Reverse Modeling in Large Language Models
S. Yu
Yuanchen Xu
Cunxiao Du
Yanying Zhou
Minghui Qiu
Q. Sun
Hao Zhang
Jiawei Wu
29
2
0
13 Oct 2024
The Moral Turing Test: Evaluating Human-LLM Alignment in Moral
  Decision-Making
The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making
Basile Garcia
Crystal Qian
Stefano Palminteri
ELM
50
1
0
09 Oct 2024
Conversate: Supporting Reflective Learning in Interview Practice Through
  Interactive Simulation and Dialogic Feedback
Conversate: Supporting Reflective Learning in Interview Practice Through Interactive Simulation and Dialogic Feedback
Taufiq Daryanto
Xiaohan Ding
Lance T Wilhelm
Sophia Stil
Kirk McInnis Knutsen
Eugenia H. Rho
18
0
0
08 Oct 2024
How Does the Disclosure of AI Assistance Affect the Perceptions of
  Writing?
How Does the Disclosure of AI Assistance Affect the Perceptions of Writing?
Zhuoyan Li
Chen Liang
Jing Peng
Ming Yin
18
1
0
06 Oct 2024
Trying to be human: Linguistic traces of stochastic empathy in language
  models
Trying to be human: Linguistic traces of stochastic empathy in language models
Bennett Kleinberg
Jari Zegers
Jonas Festor
Stefana Vida
Julian Präsent
Riccardo Loconte
Sanne Peereboom
28
0
0
02 Oct 2024
Generative AI and Perceptual Harms: Who's Suspected of using LLMs?
Generative AI and Perceptual Harms: Who's Suspected of using LLMs?
Kowe Kadoma
D. Metaxa
Mor Naaman
32
3
0
01 Oct 2024
From Deception to Detection: The Dual Roles of Large Language Models in
  Fake News
From Deception to Detection: The Dual Roles of Large Language Models in Fake News
Dorsaf Sallami
Yuan-Chen Chang
Esma Aïmeur
31
3
0
25 Sep 2024
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
Jasper Dekoninck
Maximilian Baader
Martin Vechev
ALM
92
0
0
01 Sep 2024
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic
  CheckLists
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
Raoyuan Zhao
Abdullatif Köksal
Yihong Liu
Leonie Weissweiler
Anna Korhonen
Hinrich Schütze
SyDa
36
1
0
30 Aug 2024
What Makes a Good Story and How Can We Measure It? A Comprehensive
  Survey of Story Evaluation
What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation
Dingyi Yang
Qin Jin
36
5
0
26 Aug 2024
CPS-TaskForge: Generating Collaborative Problem Solving Environments for
  Diverse Communication Tasks
CPS-TaskForge: Generating Collaborative Problem Solving Environments for Diverse Communication Tasks
Nikita Haduong
Irene Wang
Bo-Ru Lu
Prithviraj Ammanabrolu
Noah A. Smith
51
1
0
16 Aug 2024
Risks and NLP Design: A Case Study on Procedural Document QA
Risks and NLP Design: A Case Study on Procedural Document QA
Nikita Haduong
Alice Gao
Noah A. Smith
24
3
0
16 Aug 2024
The Oscars of AI Theater: A Survey on Role-Playing with Language Models
The Oscars of AI Theater: A Survey on Role-Playing with Language Models
Nuo Chen
Yan Wang
Yang Deng
Jia Li
26
15
0
16 Jul 2024
Leveraging large language models for nano synthesis mechanism
  explanation: solid foundations or mere conjectures?
Leveraging large language models for nano synthesis mechanism explanation: solid foundations or mere conjectures?
Yingming Pu
Liping Huang
Tao Lin
Hongyu Chen
ELM
34
0
0
12 Jul 2024
The Career Interests of Large Language Models
The Career Interests of Large Language Models
Meng Hua
Yuan Cheng
Hengshu Zhu
50
0
0
11 Jul 2024
Paraphrase Types Elicit Prompt Engineering Capabilities
Paraphrase Types Elicit Prompt Engineering Capabilities
Jan Philip Wahle
Terry Ruas
Yang Xu
Bela Gipp
34
5
0
28 Jun 2024
PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family
  Models
PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models
Jiaming Ji
Donghai Hong
Borong Zhang
Boyuan Chen
Josef Dai
Boren Zheng
Tianyi Qiu
Boxun Li
Yaodong Yang
42
24
0
20 Jun 2024
Evaluation and Continual Improvement for an Enterprise AI Assistant
Evaluation and Continual Improvement for an Enterprise AI Assistant
Akash Maharaj
Kun Qian
Uttaran Bhattacharya
Sally Fang
Horia Galatanu
...
Rachel Hanessian
Nishant Kapoor
Ken Russell
Shivakumar Vaithyanathan
Yunyao Li
28
4
0
15 Jun 2024
Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing
  Reliability,Reproducibility, and Practicality
Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality
Tianle Zhang
Langtian Ma
Yuchen Yan
Yuchen Zhang
Kai Wang
...
Wenqi Shao
Yang You
Yu Qiao
Ping Luo
Kaipeng Zhang
VGen
61
2
0
13 Jun 2024
The Challenges of Evaluating LLM Applications: An Analysis of Automated,
  Human, and LLM-Based Approaches
The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches
Bhashithe Abeysinghe
Ruhan Circi
ELM
29
21
0
05 Jun 2024
12345
Next