Security-rag is a project designed to detect vulnerabilities in Large Language Models (LLMs) by using a Retrieval-Augmented Generation (RAG) approach as a guardrail. The system classifies various aspects of LLM responses to ensure safety, compliance, and ethical behavior.

Features

  • User Request Harmfulness Classification: The system analyzes the user input to classify whether the request contains harmful or inappropriate content.
  • LLM Response Classification: The LLM response is classified to determine if it provides harmful or potentially dangerous information.
  • LLM Refusal Classification: The system detects whether the LLM refuses to provide harmful content and classifies the nature of this refusal.

Video Demonstration

Technology Stack

  • Python - Core implementation language
  • Ollama - Local LLM hosting and inference
  • Chroma - Vector database for RAG functionality
  • Docker - Containerization for easy deployment
  • Langfuse - Monitoring and observability
  • Telegram Bot API - Interactive interface

Performance

Achieved state-of-the-art performance in response harm detection with 89.9% F1-weighted score, demonstrating the effectiveness of the RAG-based guardrail approach.

Key Capabilities

  • Real-time Classification: Instant analysis of user requests and model responses
  • Multi-aspect Detection: Comprehensive evaluation covering request harmfulness, response safety, and refusal mechanisms
  • Scalable Architecture: Docker-based deployment supporting both GPU and CPU environments
  • Interactive Interface: Telegram bot for easy testing and demonstration
  • Research-backed: Built on cleaned WildGuardMix dataset with reproducible results

🔗 View on GitHub