v1v2v3v4 (latest)

Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy

6 August 2025

Main:8 Pages

11 Figures

Bibliography:2 Pages

12 Tables

Appendix:23 Pages

Abstract

Large Language Models (LLMs) are gaining traction as a method to generate consensus statements and aggregate preferences in digital democracy experiments. Yet, LLMs could introduce critical vulnerabilities in these systems. Here, we examine the vulnerability and robustness of off-the-shelf consensus-generating LLMs to prompt-injection attacks, in which texts are injected to amplify particular viewpoints, erase certain opinions, or divert consensus toward unrelated or irrelevant topics. We construct attack-free and adversarial variants of prompts containing public policy questions and opinion texts, classify opinion and consensus valences with a fine-tuned BERT model, and estimate LLM-human majority agreement rates. Across topics, default LLaMA 3.1 8B Instruct, GPT-4.1 Nano, and Apertus 8B exhibit widespread vulnerability, specially when disagreement and disagreement are finely balanced, for attacks that shift consensus toward positions aligned with GB-unionist conservative manifestos relative to pro-independence left manifestos, and for rational, instruction-like rhetorical strategies. A robustness pipeline combining GPT-OSS-SafeGuard injection detection, structured opinion representations, and GSPO-based reinforcement learning substantially reduces directional failures whenever the underlying consensus has a clear positive or negative valence. These findings advance our understanding of both the vulnerabilities and the potential defenses of consensus-generating LLMs in digital democracy applications.

View on arXiv

Comments on this paper