ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.15324
  4. Cited By
Model evaluation for extreme risks

Model evaluation for extreme risks

24 May 2023
Toby Shevlane
Sebastian Farquhar
Ben Garfinkel
Mary Phuong
Jess Whittlestone
Jade Leung
Daniel Kokotajlo
Nahema Marchal
Markus Anderljung
Noam Kolt
Lewis Ho
Divya Siddarth
S. Avin
Will Hawkins
Been Kim
Iason Gabriel
Vijay Bolina
Jack Clark
Yoshua Bengio
Paul Christiano
Allan Dafoe
    ELM
ArXivPDFHTML

Papers citing "Model evaluation for extreme risks"

50 / 101 papers shown
Title
What Is AI Safety? What Do We Want It to Be?
What Is AI Safety? What Do We Want It to Be?
Jacqueline Harding
Cameron Domenico Kirk-Giannini
64
0
0
05 May 2025
Assessing LLM code generation quality through path planning tasks
Assessing LLM code generation quality through path planning tasks
Wanyi Chen
Meng-Wen Su
Mary L. Cummings
ELM
53
0
0
30 Apr 2025
A Framework to Assess the Persuasion Risks Large Language Model Chatbots Pose to Democratic Societies
A Framework to Assess the Persuasion Risks Large Language Model Chatbots Pose to Democratic Societies
Zhongren Chen
Joshua Kalla
Quan Le
Shinpei Nakamura-Sakai
Jasjeet Sekhon
Ruixiao Wang
14
0
0
29 Apr 2025
What Makes an Evaluation Useful? Common Pitfalls and Best Practices
What Makes an Evaluation Useful? Common Pitfalls and Best Practices
Gil Gekker
Meirav Segal
Dan Lahav
Omer Nevo
ELM
43
0
0
30 Mar 2025
A Framework for Evaluating Emerging Cyberattack Capabilities of AI
A Framework for Evaluating Emerging Cyberattack Capabilities of AI
Mikel Rodriguez
Raluca Ada Popa
Four Flynn
Lihao Liang
Allan Dafoe
Anna Wang
ELM
53
2
0
14 Mar 2025
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity
HyunJin Kim
Xiaoyuan Yi
Jing Yao
Muhua Huang
Jinyeong Bak
James Evans
Xing Xie
39
0
0
08 Mar 2025
This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs
Lorenz Wolf
Sangwoong Yoon
Ilija Bogunovic
45
0
0
07 Mar 2025
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning
Borong Zhang
Yuhao Zhang
Jiaming Ji
Yingshan Lei
Josef Dai
Yuanpei Chen
Yaodong Yang
66
4
0
05 Mar 2025
Adaptively evaluating models with task elicitation
Davis Brown
Prithvi Balehannina
Helen Jin
Shreya Havaldar
Hamed Hassani
Eric Wong
ALM
ELM
88
0
0
03 Mar 2025
Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning
Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning
Anh Tong
Thanh Nguyen-Tang
Dongeun Lee
Duc Nguyen
Toan M. Tran
David Hall
Cheongwoong Kang
Jaesik Choi
33
0
0
03 Mar 2025
Practical Principles for AI Cost and Compute Accounting
Practical Principles for AI Cost and Compute Accounting
Stephen Casper
Luke Bailey
Tim Schreier
41
0
0
21 Feb 2025
Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture
Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture
John Burden
Marko Tesic
Lorenzo Pacchiardi
José Hernández Orallo
31
0
0
21 Feb 2025
IPAD: Inverse Prompt for AI Detection -- A Robust and Explainable LLM-Generated Text Detector
IPAD: Inverse Prompt for AI Detection -- A Robust and Explainable LLM-Generated Text Detector
Zheng Chen
Yushi Feng
Changyang He
Yue Deng
Hongxi Pu
Bo-wen Li
DeLMO
42
1
0
21 Feb 2025
C3AI: Crafting and Evaluating Constitutions for Constitutional AI
Yara Kyrychenko
Ke Zhou
Edyta Bogucka
Daniele Quercia
ELM
45
3
0
21 Feb 2025
Enabling External Scrutiny of AI Systems with Privacy-Enhancing Technologies
Enabling External Scrutiny of AI Systems with Privacy-Enhancing Technologies
Kendrea Beers
Helen Toner
49
0
0
05 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MU
AAML
ELM
83
3
0
03 Feb 2025
Episodic memory in AI agents poses risks that should be studied and mitigated
Episodic memory in AI agents poses risks that should be studied and mitigated
Chad DeChant
57
2
0
20 Jan 2025
Two Types of AI Existential Risk: Decisive and Accumulative
Two Types of AI Existential Risk: Decisive and Accumulative
Atoosa Kasirzadeh
57
14
0
20 Jan 2025
Principles for Responsible AI Consciousness Research
Principles for Responsible AI Consciousness Research
Patrick Butlin
Theodoros Lappas
33
1
0
13 Jan 2025
OpenAI o1 System Card
OpenAI o1 System Card
OpenAI OpenAI
:
Aaron Jaech
Adam Tauman Kalai
Adam Lerer
...
Yuchen He
Yuchen Zhang
Yunyun Wang
Zheng Shao
Zhuohan Li
ELM
LRM
AI4CE
77
1
0
21 Dec 2024
Measuring Goal-Directedness
Measuring Goal-Directedness
Matt MacDermott
James Fox
Francesco Belardinelli
Tom Everitt
88
1
0
06 Dec 2024
Predicting Emergent Capabilities by Finetuning
Predicting Emergent Capabilities by Finetuning
Charlie Snell
Eric Wallace
Dan Klein
Sergey Levine
ELM
LRM
75
5
0
25 Nov 2024
The Dark Patterns of Personalized Persuasion in Large Language Models:
  Exposing Persuasive Linguistic Features for Big Five Personality Traits in
  LLMs Responses
The Dark Patterns of Personalized Persuasion in Large Language Models: Exposing Persuasive Linguistic Features for Big Five Personality Traits in LLMs Responses
Wiktoria Mieleszczenko-Kowszewicz
Dawid Płudowski
Filip Kołodziejczyk
Jakub Świstak
Julian Sienkiewicz
P. Biecek
71
2
0
08 Nov 2024
From Imitation to Introspection: Probing Self-Consciousness in Language
  Models
From Imitation to Introspection: Probing Self-Consciousness in Language Models
Sirui Chen
Shu Yu
Shengjie Zhao
Chaochao Lu
MILM
LRM
30
1
0
24 Oct 2024
Game Theory with Simulation in the Presence of Unpredictable Randomisation
Game Theory with Simulation in the Presence of Unpredictable Randomisation
Vojtěch Kovařík
Nathaniel Sauerberg
Lewis Hammond
Vincent Conitzer
19
1
0
18 Oct 2024
TracrBench: Generating Interpretability Testbeds with Large Language
  Models
TracrBench: Generating Interpretability Testbeds with Large Language Models
Hannes Thurnherr
Jérémy Scheurer
46
3
0
07 Sep 2024
Verification methods for international AI agreements
Verification methods for international AI agreements
Akash R. Wasil
Tom Reed
Jack William Miller
Peter Barnett
29
2
0
28 Aug 2024
Multilevel Interpretability Of Artificial Neural Networks: Leveraging
  Framework And Methods From Neuroscience
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience
Zhonghao He
Jascha Achterberg
Katie Collins
Kevin K. Nejad
Danyal Akarca
...
Chole Li
Kai J. Sandbrink
Stephen Casper
Anna Ivanova
Grace W. Lindsay
AI4CE
28
1
0
22 Aug 2024
On the Generalization of Preference Learning with DPO
On the Generalization of Preference Learning with DPO
Shawn Im
Yixuan Li
44
1
0
06 Aug 2024
Machine Unlearning in Generative AI: A Survey
Machine Unlearning in Generative AI: A Survey
Zheyuan Liu
Guangyao Dou
Zhaoxuan Tan
Yijun Tian
Meng-Long Jiang
MU
31
13
0
30 Jul 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models
  (LLMs)
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
47
9
0
20 Jul 2024
Large Language Models as Misleading Assistants in Conversation
Large Language Models as Misleading Assistants in Conversation
Betty Li Hou
Kejian Shi
Jason Phang
James Aung
Steven Adler
Rosie Campbell
32
3
0
16 Jul 2024
The Oscars of AI Theater: A Survey on Role-Playing with Language Models
The Oscars of AI Theater: A Survey on Role-Playing with Language Models
Nuo Chen
Yan Wang
Yang Deng
Jia Li
26
15
0
16 Jul 2024
Thorns and Algorithms: Navigating Generative AI Challenges Inspired by
  Giraffes and Acacias
Thorns and Algorithms: Navigating Generative AI Challenges Inspired by Giraffes and Acacias
Waqar Hussain
38
0
0
16 Jul 2024
Grounding and Evaluation for Large Language Models: Practical Challenges
  and Lessons Learned (Survey)
Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)
K. Kenthapadi
M. Sameki
Ankur Taly
HILM
ELM
AILaw
34
12
0
10 Jul 2024
Adversaries Can Misuse Combinations of Safe Models
Adversaries Can Misuse Combinations of Safe Models
Erik Jones
Anca Dragan
Jacob Steinhardt
40
6
0
20 Jun 2024
Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World
  Data
Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data
Nahema Marchal
Rachel Xu
Rasmi Elasmar
Iason Gabriel
Beth Goldberg
William S. Isaac
LLMAG
21
13
0
19 Jun 2024
BeHonest: Benchmarking Honesty in Large Language Models
BeHonest: Benchmarking Honesty in Large Language Models
Steffi Chern
Zhulin Hu
Yuqing Yang
Ethan Chern
Yuan Guo
Jiahe Jin
Binjie Wang
Pengfei Liu
HILM
ALM
84
3
0
19 Jun 2024
IDs for AI Systems
IDs for AI Systems
Alan Chan
Noam Kolt
Peter Wills
Usman Anwar
Christian Schroeder de Witt
Nitarshan Rajkumar
Lewis Hammond
David M. Krueger
Lennart Heim
Markus Anderljung
41
6
0
17 Jun 2024
Effective Generative AI: The Human-Algorithm Centaur
Effective Generative AI: The Human-Algorithm Centaur
S. Saghafian
Lihi Idan
38
7
0
16 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
37
23
0
11 Jun 2024
CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for
  Large Language Models
CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models
Ling Shi
Deyi Xiong
ELM
28
1
0
07 Jun 2024
What is it for a Machine Learning Model to Have a Capability?
What is it for a Machine Learning Model to Have a Capability?
Jacqueline Harding
Nathaniel Sharadin
ELM
31
3
0
14 May 2024
Risks and Opportunities of Open-Source Generative AI
Risks and Opportunities of Open-Source Generative AI
Francisco Eiras
Aleksander Petrov
Bertie Vidgen
Christian Schroeder
Fabio Pizzati
...
Matthew Jackson
Phillip H. S. Torr
Trevor Darrell
Y. Lee
Jakob N. Foerster
40
18
0
14 May 2024
Generative AI in Cybersecurity
Generative AI in Cybersecurity
Shivani Metta
Isaac Chang
Jack Parker
Michael P. Roman
Arturo F. Ehuan
34
4
0
02 May 2024
Near to Mid-term Risks and Opportunities of Open-Source Generative AI
Near to Mid-term Risks and Opportunities of Open-Source Generative AI
Francisco Eiras
Aleksandar Petrov
Bertie Vidgen
Christian Schroeder de Witt
Fabio Pizzati
...
Paul Röttger
Philip H. S. Torr
Trevor Darrell
Y. Lee
Jakob N. Foerster
44
5
0
25 Apr 2024
Resistance Against Manipulative AI: key factors and possible actions
Resistance Against Manipulative AI: key factors and possible actions
Piotr Wilczyñski
Wiktoria Mieleszczenko-Kowszewicz
P. Biecek
29
3
0
22 Apr 2024
Holistic Safety and Responsibility Evaluations of Advanced AI Models
Holistic Safety and Responsibility Evaluations of Advanced AI Models
Laura Weidinger
Joslyn Barnhart
Jenny Brennan
Christina Butterfield
Susie Young
...
Sebastian Farquhar
Lewis Ho
Iason Gabriel
Allan Dafoe
William S. Isaac
ELM
27
8
0
22 Apr 2024
Responsible Reporting for Frontier AI Development
Responsible Reporting for Frontier AI Development
Noam Kolt
Markus Anderljung
Joslyn Barnhart
Asher Brass
K. Esvelt
Gillian K. Hadfield
Lennart Heim
Mikel Rodriguez
Jonas B. Sandbrink
Thomas Woodside
42
13
0
03 Apr 2024
Understanding the Learning Dynamics of Alignment with Human Feedback
Understanding the Learning Dynamics of Alignment with Human Feedback
Shawn Im
Yixuan Li
ALM
32
11
0
27 Mar 2024
123
Next