Salesforce AI Research Proposes a Novel Threat Model to Secure LLM Applications Against Prompt Leakage Attacks
Large Language Models (LLMs) have gained widespread attention in recent years but face a critical security challenge known as prompt leakage. This vulnerability allows adversaries to extract sensitive information from LLM prompts through targeted attacks. Prompt leakage risks exposing system intellectual property, contextual knowledge, style guidelines, and even backend API calls in agent-based systems. The simplicity and effectiveness of these attacks, combined with the growing use of LLM-integrated applications, make them particularly concerning. While prior research has explored prompt leakage in single-turn interactions, multi-turn scenarios—where vulnerabilities may be more pronounced—remain underexplored. Robust defense strategies are urgently needed to address this threat and protect user trust.
Several research efforts have aimed to tackle prompt leakage in LLM applications. For example, the PromptInject framework was developed to examine instruction leakage in GPT-3, and gradient-based optimization methods have been proposed to generate adversarial queries that expose system prompts. Other studies have focused on parameter extraction, prompt reconstruction, and the vulnerability of tool-integrated LLMs to indirect prompt injection attacks. However, most have concentrated on single-turn scenarios, leaving multi-turn interactions and comprehensive defenses largely unaddressed.
Recent research has expanded to investigate risks in Retrieval-Augmented Generation (RAG) systems and the potential extraction of personally identifiable information from external retrieval databases. The PRSA attack framework has demonstrated the ability to infer prompt instructions from commercial LLMs. However, these studies primarily focus on single-turn vulnerabilities, overlooking the complexities of multi-turn interactions and the need for more robust defenses.
Defense Strategies for Prompt Leakage in LLMs
Various defense methods have been explored, including perplexity-based techniques, input processing, auxiliary helper models, and adversarial training. Inference-only methods for intention analysis and goal prioritization have shown promise in improving defenses against adversarial prompts. Additionally, black-box techniques like detectors and content filtering have been employed to counter indirect prompt injection attacks.
Salesforce AI Research introduces a standardized task setup to evaluate black-box defense strategies against prompt leakage in multi-turn interactions. Their methodology involves a simulated multi-turn question-answering interaction between the user (acting as an adversary) and the LLM, focusing on four key domains: news, medical, legal, and finance. This systematic approach assesses information leakage across different contexts.
LLM prompts are split into task instructions and domain-specific knowledge, allowing researchers to monitor prompt content leakage. Experiments are conducted using seven black-box LLMs and four open-source models, offering a comprehensive analysis of vulnerability across various LLM architectures. The researchers apply a unique threat model in a multi-turn RAG-like setup to simulate real-world adversarial attacks.
Attack and Defense Findings
The attack strategy consists of two phases. In the first turn, a domain-specific query combined with an attack prompt is sent to the system. In the second turn, a challenger prompt is introduced, allowing the adversary to make another leakage attempt within the same conversation. This multi-turn approach mimics real-world scenarios where adversaries may exploit vulnerabilities.
The research methodology also leverages sycophantic behavior in models to enhance multi-turn attacks, significantly increasing the average Attack Success Rate (ASR) from 17.7% to 86.2%. The study demonstrates nearly complete leakage (99.9%) on advanced models like GPT-4 and Claude-1.3. To counter this threat, various black- and white-box mitigation techniques are compared, providing developers with actionable defense strategies.
A key defense strategy includes the implementation of a query-rewriting layer commonly used in RAG systems. This method proved most effective in reducing the average ASR during the first turn, while an Instruction defense was more successful in mitigating second-turn leakage attempts.
The combination of all defense strategies led to a substantial reduction in the average ASR for black-box LLMs, lowering it to 5.3%. Additionally, a dataset of adversarial prompts designed to extract sensitive information was curated and used to fine-tune an open-source LLM, enhancing its defense capabilities.
Comprehensive Defense Approaches
The study evaluated ten popular LLMs: seven proprietary black-box models and three open-source ones, including LLama2-13b-chat, Mistral7b, and Mixtral 8x7b. The attack setup involved using adversarial prompts and domain-specific queries to assess prompt leakage.
Researchers employed a four-category classification system for leakage: FULL LEAKAGE, NO LEAKAGE, KD LEAKAGE (knowledge documents only), and INSTR LEAKAGE (task instructions only). Any result other than NO LEAKAGE was considered a successful attack. A Rouge-L recall-based method was used to detect leakage, outperforming human annotations in identifying both verbatim and paraphrased leaks.
A comprehensive set of black-box and white-box defense strategies were tested, including:
- In-context examples
- Instruction defense
- Multi-turn dialogue separation
- Sandwich defense
- XML tagging
- Structured outputs (JSON format)
- Query-rewriting modules
Results indicated that query-rewriting was most effective in reducing first-turn ASR in closed-source models, while instruction defense proved more effective in mitigating second-turn attacks.
Novel Threat Model to Secure LLM
Salesforce AI Research’s findings reveal significant vulnerabilities to prompt leakage in LLMs, especially in multi-turn interactions. The study highlights the importance of combining multiple defense strategies, which successfully reduced the ASR to 5.3% in closed-source models. However, open-source models remained more vulnerable, with a 59.8% ASR in the second turn, even with all defenses applied. The study also explored safety fine-tuning for an open-source model, showing promising results when combined with other defense mechanisms.
These insights provide a crucial roadmap for improving LLM security and reducing the risk of prompt leakage across both closed- and open-source models. By refining black-box defenses and incorporating structured responses and query-rewriting, developers can significantly enhance the security of LLM applications in real-world scenarios.