Lexi-RAG
Business Context
Legal teams manage highly sensitive documents that must remain encrypted at rest while still being searchable for fast and accurate case research.
Lexi-RAG was built to address a fundamental contradiction
How can confidential legal data remain encrypted while still enabling high-quality semantic search and contextual AI reasoning?
The system delivers secure document ingestion, strict role-based access control, citation-backed responses, and context-aware legal dialogue designed for multi-tenant legal environments.
Engineering Architecture
Dual-Database Security Architecture
MongoDB stores AES-256-GCM encrypted legal documents, while Qdrant stores 1024-dimensional embeddings for semantic retrieval. Security is enforced at query time through indexed role-based filtering rather than post-processing.
Hybrid Semantic + Keyword Retrieval Pipeline
The hybrid design combines dense semantic embeddings (BGE-M3) with keyword-sensitive retrieval to handle both contextual legal reasoning and structured references such as section numbers or clauses. A context-aware router determines when to rely on conversational memory versus fresh vector retrieval.
Local Embedding Engine (Security-Oriented Design)
The 2GB BGE-M3 embedding model is loaded once at startup and executed locally. This ensures embeddings are generated without exposing raw legal text externally, adding an additional privacy layer. BGE-M3 was selected for its ability to support both semantic similarity and keyword-aware retrieval.
Role-Based Vector Filtering (RBAC at Query Time)
A 4-tier hierarchy (Partner → Associate → Staff → Client) is enforced directly within Qdrant payload filters. This guarantees zero cross-privilege leakage by embedding access rules directly into the vector search layer.
Answer Re-Ranking Layer
Retrieved chunks are re-ranked before final generation to ensure the most legally relevant and contextually aligned passages are prioritized. This improves citation precision and reduces noise from semantically similar but irrelevant chunks.
Context-Aware Conversational Layer
The FastAPI backend maintains structured chat sessions, enabling multi-turn reasoning. The system preserves contextual continuity between the user and the AI while selectively triggering retrieval only when needed.
Key Trade-offs
Encryption vs. Searchability
Adopted a dual-storage model (encrypted source text + unencrypted embeddings) to preserve retrieval quality while maintaining confidentiality. Research is ongoing to further strengthen encryption strategies without degrading retrieval performance.
Post-Filtering vs. Query-Time Enforcement
Moved RBAC enforcement directly into Qdrant's indexed payload filters to eliminate security gaps and avoid expensive post-retrieval filtering.
Retrieval Frequency vs. Latency
Implemented contextual routing to reduce unnecessary vector searches, lowering latency and infrastructure costs.
Model Size vs. Cost & Security
Selected Llama 3.1-8B via Groq for inference to balance security, cost efficiency, and low-latency responses. The lightweight 8B model provides sufficient reasoning capacity while remaining computationally economical.