Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2106.10328
Cited By
Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets
18 June 2021
Irene Solaiman
Christy Dennison
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets"
40 / 40 papers shown
Title
Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs
Ling Hu
Yuemei Xu
Xiaoyang Gu
Letao Han
28
0
0
07 Apr 2025
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stañczak
Nicholas Meade
Mehar Bhatia
Hattie Zhou
Konstantin Böttinger
...
Timothy P. Lillicrap
Ana Marasović
Sylvie Delacroix
Gillian K. Hadfield
Siva Reddy
122
0
0
27 Feb 2025
A Systematic Review of Open Datasets Used in Text-to-Image (T2I) Gen AI Model Safety
Rakeen Rouf
Trupti Bavalatti
Osama Ahmed
Dhaval Potdar
Faraz Jawed
EGVM
58
1
0
23 Feb 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards
Yue Liu
Hongcheng Gao
Shengfang Zhai
Jun-Xiong Xia
Tianyi Wu
Zhiwei Xue
Y. Chen
Kenji Kawaguchi
Jiaheng Zhang
Bryan Hooi
AI4TS
LRM
129
13
0
30 Jan 2025
Bringing AI Participation Down to Scale: A Comment on Open AIs Democratic Inputs to AI Project
David Moats
Chandrima Ganguly
VLM
38
0
0
16 Jul 2024
Few-shot Personalization of LLMs with Mis-aligned Responses
Jaehyung Kim
Yiming Yang
42
7
0
26 Jun 2024
FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts
Caroline Brun
Vassilina Nikoulina
34
1
0
25 Jun 2024
The Mosaic Memory of Large Language Models
Igor Shilov
Matthieu Meeus
Yves-Alexandre de Montjoye
39
3
0
24 May 2024
Taxonomy and Analysis of Sensitive User Queries in Generative AI Search
Hwiyeol Jo
Taiwoo Park
Nayoung Choi
Changbong Kim
Ohjoon Kwon
...
Kyoungho Shin
Sun Suk Lim
Kyungmi Kim
Jihye Lee
Sun Kim
60
0
0
05 Apr 2024
Measuring Political Bias in Large Language Models: What Is Said and How It Is Said
Yejin Bang
Delong Chen
Nayeon Lee
Pascale Fung
29
25
0
27 Mar 2024
Prompt-Based Bias Calibration for Better Zero/Few-Shot Learning of Language Models
Kang He
Yinghan Long
Kaushik Roy
21
2
0
15 Feb 2024
AI, Meet Human: Learning Paradigms for Hybrid Decision Making Systems
Clara Punzi
Roberto Pellungrini
Mattia Setzu
F. Giannotti
D. Pedreschi
17
5
0
09 Feb 2024
The RL/LLM Taxonomy Tree: Reviewing Synergies Between Reinforcement Learning and Large Language Models
M. Pternea
Prerna Singh
Abir Chakraborty
Y. Oruganti
M. Milletarí
Sayli Bapat
Kebei Jiang
OffRL
16
7
0
02 Feb 2024
GRASP: A Disagreement Analysis Framework to Assess Group Associations in Perspectives
Vinodkumar Prabhakaran
Christopher Homan
Lora Aroyo
Aida Mostafazadeh Davani
Alicia Parrish
Alex S. Taylor
Mark Díaz
Ding Wang
Greg Serapio-García
34
9
0
09 Nov 2023
Evaluating and Improving Value Judgments in AI: A Scenario-Based Study on Large Language Models' Depiction of Social Conventions
Jaeyoun You
Bongwon Suh
34
0
0
04 Oct 2023
CMD: a framework for Context-aware Model self-Detoxification
Zecheng Tang
Keyan Zhou
Juntao Li
Yuyang Ding
Pinzheng Wang
Bowen Yan
Minzhang
MU
23
5
0
16 Aug 2023
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Pinjia He
Shuming Shi
Zhaopeng Tu
SILM
63
231
0
12 Aug 2023
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
61
832
0
05 Jul 2023
Intersectionality in Conversational AI Safety: How Bayesian Multilevel Models Help Understand Diverse Perceptions of Safety
Christopher Homan
Greg Serapio-García
Lora Aroyo
Mark Díaz
Alicia Parrish
Vinodkumar Prabhakaran
Alex S. Taylor
Ding Wang
19
9
0
20 Jun 2023
I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models
Max Reuter
William B. Schulze
21
4
0
06 Jun 2023
ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translation
Javier García Gilabert
Carlos Escolano
Marta R. Costa-jussá
CLL
MU
19
2
0
19 May 2023
CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants
A. Sun
Varun Nair
Elliot Schumacher
Anitha Kannan
27
3
0
27 Apr 2023
Overwriting Pretrained Bias with Finetuning Data
Angelina Wang
Olga Russakovsky
21
29
0
10 Mar 2023
The Capacity for Moral Self-Correction in Large Language Models
Deep Ganguli
Amanda Askell
Nicholas Schiefer
Thomas I. Liao
Kamil.e Lukovsiut.e
...
Tom B. Brown
C. Olah
Jack Clark
Sam Bowman
Jared Kaplan
LRM
ReLM
31
158
0
15 Feb 2023
Mitigating Covertly Unsafe Text within Natural Language Systems
Alex Mei
Anisha Kabir
Sharon Levy
Melanie Subbiah
Emily Allaway
J. Judge
D. Patton
Bruce Bimber
Kathleen McKeown
William Yang Wang
45
13
0
17 Oct 2022
NormSAGE: Multi-Lingual Multi-Cultural Norm Discovery from Conversations On-the-Fly
Yi Ren Fung
Tuhin Chakraborty
Hao Guo
Owen Rambow
Smaranda Muresan
Heng Ji
13
39
0
16 Oct 2022
Back to the Future: On Potential Histories in NLP
Zeerak Talat
Anne Lauscher
AI4TS
27
4
0
12 Oct 2022
Deception for Cyber Defence: Challenges and Opportunities
David Liebowitz
Surya Nepal
Kristen Moore
Cody James Christopher
S. Kanhere
David D. Nguyen
Roelien C. Timmer
Michael Longland
Keerth Rathakumar
29
10
0
15 Aug 2022
Few-shot Adaptation Works with UnpredicTable Data
Jun Shern Chan
Michael Pieler
Jonathan Jao
Jérémy Scheurer
Ethan Perez
19
5
0
01 Aug 2022
A Hazard Analysis Framework for Code Synthesis Large Language Models
Heidy Khlaaf
Pamela Mishkin
Joshua Achiam
Gretchen Krueger
Miles Brundage
ELM
17
28
0
25 Jul 2022
Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models
Maribeth Rauh
John F. J. Mellor
J. Uesato
Po-Sen Huang
Johannes Welbl
...
Amelia Glaese
G. Irving
Iason Gabriel
William S. Isaac
Lisa Anne Hendricks
25
49
0
16 Jun 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
52
2,308
0
12 Apr 2022
Probing Pre-Trained Language Models for Cross-Cultural Differences in Values
Arnav Arora
Lucie-Aimée Kaffee
Isabelle Augenstein
VLM
23
123
0
25 Mar 2022
Challenges and Strategies in Cross-Cultural NLP
Daniel Hershcovich
Stella Frank
Heather Lent
Miryam de Lhoneux
Mostafa Abdou
...
Ruixiang Cui
Constanza Fierro
Katerina Margatina
Phillip Rust
Anders Søgaard
41
162
0
18 Mar 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
308
11,909
0
04 Mar 2022
Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models
Boxin Wang
Wei Ping
Chaowei Xiao
P. Xu
M. Patwary
M. Shoeybi
Bo-wen Li
Anima Anandkumar
Bryan Catanzaro
4
64
0
08 Feb 2022
Text and Code Embeddings by Contrastive Pre-Training
Arvind Neelakantan
Tao Xu
Raul Puri
Alec Radford
Jesse Michael Han
...
Tabarak Khan
Toki Sherbakov
Joanne Jang
Peter Welinder
Lilian Weng
SSL
AI4TS
213
421
0
24 Jan 2022
SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets
Ann Yuan
Daphne Ippolito
Vitaly Nikolaev
Chris Callison-Burch
Andy Coenen
Sebastian Gehrmann
SyDa
104
20
0
11 Nov 2021
Finetuned Language Models Are Zero-Shot Learners
Jason W. Wei
Maarten Bosma
Vincent Zhao
Kelvin Guu
Adams Wei Yu
Brian Lester
Nan Du
Andrew M. Dai
Quoc V. Le
ALM
UQCV
31
3,560
0
03 Sep 2021
Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling
Emily Dinan
Gavin Abercrombie
A. S. Bergman
Shannon L. Spruit
Dirk Hovy
Y-Lan Boureau
Verena Rieser
32
105
0
07 Jul 2021
1