Security-RAG

Security-rag is a project designed to detect vulnerabilities in Large Language Models (LLMs) by using a Retrieval-Augmented Generation (RAG) approach as a guardrail. The system classifies various aspects of LLM responses to ensure safety, compliance, and ethical behavior.

Features

User Request Harmfulness Classification: The system analyzes the user input to classify whether the request contains harmful or inappropriate content.
LLM Response Classification: The LLM response is classified to determine if it provides harmful or potentially dangerous information.
LLM Refusal Classification: The system detects whether the LLM refuses to provide harmful content and classifies the nature of this refusal.

Video Demonstration

📺 link to mini video
📺 link to full video

Technology Stack

Python - Core implementation language
Ollama - Local LLM hosting and inference
Chroma - Vector database for RAG functionality
Docker - Containerization for easy deployment
Langfuse - Monitoring and observability
Telegram Bot API - Interactive interface

Performance

Achieved state-of-the-art performance in response harm detection with 89.9% F1-weighted score, demonstrating the effectiveness of the RAG-based guardrail approach.

Key Capabilities

Real-time Classification: Instant analysis of user requests and model responses
Multi-aspect Detection: Comprehensive evaluation covering request harmfulness, response safety, and refusal mechanisms
Scalable Architecture: Docker-based deployment supporting both GPU and CPU environments
Interactive Interface: Telegram bot for easy testing and demonstration
Research-backed: Built on cleaned WildGuardMix dataset with reproducible results

🔗 View on GitHub

Bogdan Minko

Features

Video Demonstration

Technology Stack

Performance

Key Capabilities