v1v2v3 (latest)

A General Language Assistant as a Laboratory for Alignment

1 December 2021

Deep Ganguli

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "A General Language Assistant as a Laboratory for Alignment"

50 / 701 papers shown

Human Preferences for Constructive Interactions in Language Model Alignment

229

05 Mar 2025

Alchemist: Towards the Design of Efficient Online Continual Learning System

397

03 Mar 2025

Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models

997

03 Mar 2025

Distributionally Robust Reinforcement Learning with Human Feedback

Debmalya Mandal

Paulius Sasnauskas

Goran Radanović

189

01 Mar 2025

Societal Alignment Frameworks Can Improve LLM Alignment

...

1.0K

27 Feb 2025

Choices Speak Louder than Questions

385

26 Feb 2025

Shh, don't say that! Domain Certification in LLMsInternational Conference on Learning Representations (ICLR), 2025

347

26 Feb 2025

Advantage-Guided Distillation for Preference Alignment in Small Language ModelsInternational Conference on Learning Representations (ICLR), 2025

454

25 Feb 2025

AMPO: Active Multi-Preference Optimization for Self-play Preference Selection

341

25 Feb 2025

Larger or Smaller Reward Margins to Select Preferences for Alignment?

212

25 Feb 2025

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

772

110

24 Feb 2025

Single-pass Detection of Jailbreaking Input in Large Language Models

262

24 Feb 2025

Dynamic LLM Routing and Selection based on User Preferences: Balancing Performance, Cost, and EthicsInternational Journal of Computer Applications (IJCA), 2024

233

23 Feb 2025

Be a Multitude to Itself: A Prompt Evolution Framework for Red TeamingConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

319

22 Feb 2025

An LLM-Based Approach for Insight Generation in Data AnalysisNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

Alberto Sánchez Pérez

232

20 Feb 2025

Faster WIND: Accelerating Iterative Best-of-

N

Distillation for LLM AlignmentInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024

371

20 Feb 2025

STaR-SQL: Self-Taught Reasoner for Text-to-SQLAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

198

20 Feb 2025

Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Yilei Tu

Andrew Xue

Freda Shi

394

17 Feb 2025

Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect UnderstandingInternational Journal of Computer Vision (IJCV), 2025

216

17 Feb 2025

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning CapabilitiesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

286

17 Feb 2025

Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models

412

17 Feb 2025

Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?

523

16 Feb 2025

LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

1.2K

12 Feb 2025

Trustworthy AI: Safety, Bias, and Privacy -- A Survey

376

11 Feb 2025

Safety Reasoning with Guidelines

457

06 Feb 2025

Evaluation of Large Language Models via Coupled Token Generation

Manuel Gomez Rodriguez

367

03 Feb 2025

GuardReasoner: Towards Reasoning-based LLM Safeguards

...

592

30 Jan 2025

Style Outweighs Substance: Failure Modes of LLM Judges in Alignment BenchmarkingInternational Conference on Learning Representations (ICLR), 2024

378

28 Jan 2025

Multi-Modality Transformer for E-Commerce: Inferring User Purchase Intention to Bridge the Query-Product GapBigData Congress [Services Society] (BSS), 2024

Srivatsa Mallapragada

285

28 Jan 2025

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language ModelsKnowledge Discovery and Data Mining (KDD), 2023

452

149

28 Jan 2025

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward ModelAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

593

21 Jan 2025

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesNeural Information Processing Systems (NeurIPS), 2024

423

20 Jan 2025

FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference RankingsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

422

11 Jan 2025

Predictable Artificial Intelligence

Lexin Zhou

Pablo Antonio Moreno Casares

Fernando Martínez-Plumed

...

Konstantinos Voudouris

José Hernández-Orallo

508

08 Jan 2025

Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

327

07 Jan 2025

PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

551

126

03 Jan 2025

Multi-PA: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models

254

27 Dec 2024

Lies, Damned Lies, and Distributional Language Statistics: Persuasion and Deception with Large Language Models

Cameron R. Jones

Benjamin Bergen

460

22 Dec 2024

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs

318

22 Dec 2024

SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task LinkageAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

589

19 Dec 2024

Generative Prompt InternalizationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

206

24 Nov 2024

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

385

19 Nov 2024

Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering

...

587

18 Nov 2024

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

546

15 Nov 2024

Beyond the Safety Bundle: Auditing the Helpful and Harmless DatasetNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

573

12 Nov 2024

Can LLMs make trade-offs involving stipulated pain and pleasure states?

Blaise Agüera y Arcas

Jonathan Birch

218

01 Nov 2024

Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector

164

30 Oct 2024

f

-PO: Generalizing Preference Optimization with

f

-divergence MinimizationInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024

386

29 Oct 2024

CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants

307

28 Oct 2024

Transferable Post-training via Inverse Value LearningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

236

28 Oct 2024