Prompt injection is a type of attack where malicious users craft prompts that trick or manipulate a language model into:
- Ignoring system-level or developer instructions
- Producing harmful, biased, or manipulated content
- Bypassing safety mechanisms or revealing hidden data
As LLMs are increasingly embedded in chatbots, search, writing assistants, and decision-making tools, prompt injection threatens:
- 🔓 User privacy
- 🧨 Model safety
- 📉 Trustworthiness of responses
- 💸 Commercial fairness (e.g., biased recommendations)
While this document introduces the basics of prompt injection, defending against it requires:
- Prompt sanitization
- Clear separation between system and user inputs
- Use of classifiers like Prompt Guard
- Fine-tuned moderation models like LLaMA Guard
We welcome developers, researchers, and prompt engineers to collaborate!
💬 You can:
- Share new examples of prompt injection
- Suggest mitigation techniques
- Add links to academic papers or blog posts
- Build tools or datasets to detect/guard prompt attacks
Let’s make the future of AI more secure — together!