Lexi-RAG

Privacy-first legal Retrieval-Augmented Generation (RAG) system combining encrypted document storage, hybrid semantic + keyword retrieval, answer re-ranking, and role-enforced vector filtering.

Business Context

Legal teams manage highly sensitive documents that must remain encrypted at rest while still being searchable for fast and accurate case research.

Lexi-RAG was built to address a fundamental contradiction

How can confidential legal data remain encrypted while still enabling high-quality semantic search and contextual AI reasoning?

The system delivers secure document ingestion, strict role-based access control, citation-backed responses, and context-aware legal dialogue designed for multi-tenant legal environments.

Engineering Architecture

Dual-Database Security Architecture

MongoDB stores AES-256-GCM encrypted legal documents, while Qdrant stores 1024-dimensional embeddings for semantic retrieval. Security is enforced at query time through indexed role-based filtering rather than post-processing.

Hybrid Semantic + Keyword Retrieval Pipeline

The hybrid design combines dense semantic embeddings (BGE-M3) with keyword-sensitive retrieval to handle both contextual legal reasoning and structured references such as section numbers or clauses. A context-aware router determines when to rely on conversational memory versus fresh vector retrieval.

Local Embedding Engine (Security-Oriented Design)

The 2GB BGE-M3 embedding model is loaded once at startup and executed locally. This ensures embeddings are generated without exposing raw legal text externally, adding an additional privacy layer. BGE-M3 was selected for its ability to support both semantic similarity and keyword-aware retrieval.

Role-Based Vector Filtering (RBAC at Query Time)

A 4-tier hierarchy (Partner → Associate → Staff → Client) is enforced directly within Qdrant payload filters. This guarantees zero cross-privilege leakage by embedding access rules directly into the vector search layer.

Answer Re-Ranking Layer

Retrieved chunks are re-ranked before final generation to ensure the most legally relevant and contextually aligned passages are prioritized. This improves citation precision and reduces noise from semantically similar but irrelevant chunks.

Context-Aware Conversational Layer

The FastAPI backend maintains structured chat sessions, enabling multi-turn reasoning. The system preserves contextual continuity between the user and the AI while selectively triggering retrieval only when needed.

Key Trade-offs

Encryption vs. Searchability

Adopted a dual-storage model (encrypted source text + unencrypted embeddings) to preserve retrieval quality while maintaining confidentiality. Research is ongoing to further strengthen encryption strategies without degrading retrieval performance.

Post-Filtering vs. Query-Time Enforcement

Moved RBAC enforcement directly into Qdrant's indexed payload filters to eliminate security gaps and avoid expensive post-retrieval filtering.

Retrieval Frequency vs. Latency

Implemented contextual routing to reduce unnecessary vector searches, lowering latency and infrastructure costs.

Model Size vs. Cost & Security

Selected Llama 3.1-8B via Groq for inference to balance security, cost efficiency, and low-latency responses. The lightweight 8B model provides sufficient reasoning capacity while remaining computationally economical.

Tech Stack & Tools

Languages

PythonTypeScript

Frameworks

FastAPILangChainNext.jsReact

Database

MongoDBQdrant

Tools

Beanie ODMMotor

Other

BGE-M3 embeddingsLlama 3.1-8B (Groq)AES-256-GCM encryptionRBAC