v1v2v3 (latest)

A General Language Assistant as a Laboratory for Alignment

1 December 2021

Deep Ganguli

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "A General Language Assistant as a Laboratory for Alignment"

50 / 701 papers shown

Transferable Post-training via Inverse Value LearningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

236

28 Oct 2024

Rethinking the Uncertainty: A Critical Review and Analysis in the Era of Large Language Models

...

277

26 Oct 2024

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language ModelsInternational Conference on Learning Representations (ICLR), 2024

571

23 Oct 2024

Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models

262

17 Oct 2024

Looking Inward: Language Models Can Learn About Themselves by Introspection

251

17 Oct 2024

A Survey on Data Synthesis and Augmentation for Large Language Models

...

425

16 Oct 2024

Exploring Model Kinship for Merging Large Language Models

477

16 Oct 2024

JudgeBench: A Benchmark for Evaluating LLM-based JudgesInternational Conference on Learning Representations (ICLR), 2024

Ion Stoica

713

146

16 Oct 2024

Improving Instruction-Following in Language Models through Activation SteeringInternational Conference on Learning Representations (ICLR), 2024

Alessandro Stolfo

Vidhisha Balachandran

443

15 Oct 2024

DeformPAM: Data-Efficient Learning for Long-horizon Deformable Object Manipulation via Preference-based Action AlignmentIEEE International Conference on Robotics and Automation (ICRA), 2024

Cewu Lu

323

15 Oct 2024

Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models

Juseong Jin

Chang Wook Jeong

271

13 Oct 2024

RMB: Comprehensively Benchmarking Reward Models in LLM AlignmentInternational Conference on Learning Representations (ICLR), 2024

...

Xuanjing Huang

428

13 Oct 2024

Unraveling and Mitigating Safety Alignment Degradation of Vision-Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Srikanth Doss Kadarundalagi Raghuram Doss

Lluís Marquez

Miguel Ballesteros

Yassine Benajiba

283

11 Oct 2024

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

634

10 Oct 2024

MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference OptimizationInternational Conference on Learning Representations (ICLR), 2024

588

10 Oct 2024

Self-Boosting Large Language Models with Synthetic Preference DataInternational Conference on Learning Representations (ICLR), 2024

Qingxiu Dong

Zhifang Sui

242

09 Oct 2024

Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond

Shanshan Han

599

09 Oct 2024

WAPITI: A Watermark for Finetuned Open-Source LLMs

299

09 Oct 2024

Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement LearningNeural Information Processing Systems (NeurIPS), 2024

Xiaolin Ai

394

08 Oct 2024

Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation

Chaithanya Bandi

Abir Harrasse

LLMAG ELM

265

07 Oct 2024

Superficial Safety Alignment Hypothesis

Jianwei Li

Jung-Eun Kim

330

07 Oct 2024

Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech CounteringConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Helena Bonaldi

Greta Damo

Nicolás Benjamín Ocampo

Elena Cabrio

S. Villata

Marco Guerini

172

04 Oct 2024

MA-RLHF: Reinforcement Learning from Human Feedback with Macro ActionsInternational Conference on Learning Representations (ICLR), 2024

979

03 Oct 2024

CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning

314

03 Oct 2024

DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily LifeInternational Conference on Learning Representations (ICLR), 2024

Yu Ying Chiu

Liwei Jiang

Yejin Choi

326

03 Oct 2024

Erasing Conceptual Knowledge from Language Models

444

03 Oct 2024

Generative Reward Models

245

02 Oct 2024

Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown

Xingzhou Lou

Dong Yan

Wei Shen

Yuzi Yan

Jian Xie

Junge Zhang

413

01 Oct 2024

Wait, but Tylenol is Acetaminophen... Investigating and Improving Language Models' Ability to Resist Requests for Misinformation

Lizhou Fan

Danielle S. Bitterman

LM&MA

252

30 Sep 2024

The Perfect Blend: Redefining RLHF with Mixture of Judges

Tengyu Xu

Eryk Helenowski

Karthik Abinav Sankararaman

...

386

30 Sep 2024

A Survey on the Honesty of Large Language Models

Siheng Li

Cheng Yang

Taiqiang Wu

Chufan Shi

Yuji Zhang

...

Ngai Wong

Wai Lam

297

27 Sep 2024

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

239

26 Sep 2024

Exposing Assumptions in AI Benchmarks through Cognitive Modelling

Jonathan H. Rystrøm

Kenneth C. Enevoldsen

189

25 Sep 2024

Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn InteractionConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

222

25 Sep 2024

RMCBench: Benchmarking Large Language Models' Resistance to Malicious CodeInternational Conference on Automated Software Engineering (ASE), 2024

Jiachi Chen

Yanlin Wang

Ting Chen

Zibin Zheng

AAML

121

23 Sep 2024

Direct Judgement Preference Optimization

374

23 Sep 2024

Contextualized AI for Cyber Defense: An Automated Survey using LLMsInternational Conference on Security of Information and Networks (SIN), 2024

Christoforus Yoga Haryanto

200

20 Sep 2024

Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey

Sourav Verma

RALM 3DV

261

20 Sep 2024

Aligning Language Models Using Follow-up Likelihood as Reward SignalAAAI Conference on Artificial Intelligence (AAAI), 2024

Chen Zhang

Haizhou Li

321

20 Sep 2024

Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for FilipinoPacific Asia Conference on Language, Information and Computation (PACLIC), 2024

Hamsawardhini Rengarajan

William-Chandra Tjhi

Alham Fikri Aji

452

20 Sep 2024

Edu-Values: Towards Evaluating the Chinese Education Values of Large Language ModelsThe Web Conference (WWW), 2024

Yazhou Zhang

373

19 Sep 2024

AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM AgentsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

396

13 Sep 2024

Alignment of Diffusion Models: Fundamentals, Challenges, and Future

464

11 Sep 2024

AGR: Age Group fairness Reward for Bias Mitigation in LLMsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Shuirong Cao

Ruoxi Cheng

Zhiqiang Wang

178

06 Sep 2024

Programming Refusal with Conditional Activation SteeringInternational Conference on Learning Representations (ICLR), 2024

Bruce W. Lee

Inkit Padhi

Karthikeyan N. Ramamurthy

502

06 Sep 2024

Efficient LLM Context Distillation

375

03 Sep 2024

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

279

03 Sep 2024

User-Driven Value Alignment: Understanding Users' Perceptions and Strategies for Addressing Biased and Discriminatory Statements in AI CompanionsInternational Conference on Human Factors in Computing Systems (CHI), 2024

Xuhui Zhou

314

01 Sep 2024

Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data

Spencer Whitehead

Jacob Phillips

Sean Hendryx

183

30 Aug 2024

Legilimens: Practical and Unified Content Moderation for Large Language Model ServicesConference on Computer and Communications Security (CCS), 2024

356

28 Aug 2024