RAGFlow - Hardened Self-Hosted RAG Engine with Deep Document AI
by Lynxroute
RAGFlow - CIS Level 1 hardened self-hosted RAG engine on Ubuntu 24.04 LTS with SBOM + CIS Report.
What is RAGFlow
RAGFlow is an open-source Retrieval-Augmented Generation engine for enterprise document intelligence. Its deep document layer (deepdoc) parses PDFs, Word, slides, spreadsheets, scanned books and images with layout recognition, table extraction and chunk-level citation tracking - every LLM answer is traceable back to a paragraph, page, table cell or figure in the source. RAGFlow ingests files into knowledge bases, builds hybrid vector and full-text indices in Elasticsearch, stores blobs in MinIO, and exposes a multi-tenant web UI, REST API and MCP server for chat, agents and programmatic retrieval. Works with any LLM provider (OpenAI, Anthropic, Azure OpenAI, Gemini, Ollama, vLLM and others) selected from a single dropdown. Apache-2.0 license, no vendor lock-in.
Why self-host RAGFlow
Self-hosting keeps every document, embedding and retrieval query inside your own Azure tenant - no per-seat SaaS fee, no third-party access to internal knowledge bases, no provider API key shared outside the VM. Suits regulated industries with GDPR or HIPAA data residency, legal and consulting practices, R&D groups querying proprietary research, and MSPs delivering private RAG inside customer subscriptions.
What this VM image adds
Security hardening:
- Self-registration with auto-close - first user becomes the workspace owner, then a background systemd service closes the signup endpoint automatically
- No pre-seeded admin, no default credentials on disk - the customer owns the identity layer end-to-end
- MySQL, Elasticsearch, MinIO and Valkey passwords rotated at first boot - 32 random characters each, written into /opt/ragflow/.env, never baked into the image
- Upstream Go admin server explicitly NOT enabled - the bundled license-tracker module is left off; the VM runs the Apache-2.0 Python server only
- All five containers reachable only via the docker bridge - host port 8088 bound to 127.0.0.1; the only public-facing endpoint is Nginx on TCP 443
- Nginx reverse proxy with TLS - HTTP to HTTPS redirect, hardened cipher suite, WebSocket pass-through for streaming chat, 128 MB upload limit
- Provider API keys not baked in - configure OpenAI, Anthropic, Azure OpenAI, Ollama and others after first login in the web UI
- UFW firewall, fail2ban, AppArmor; CVE scan with Trivy before every release; Certbot pre-installed for Let's Encrypt
OS hardening (CIS Level 1):
- CIS Level 1 hardened - CIS Ubuntu 24.04 LTS Level 1 Benchmark via ansible-lockdown
- auditd, SSH key-only access, kernel hardening (SYN cookies, ASLR, rp_filter, TCP BBR), /tmp as tmpfs (nosuid, nodev, noexec)
- Azure IMDS endpoints - egress rules pre-configured (169.254.169.254, 168.63.129.16)
Compliance artifacts (inside the VM):
- SBOM - CycloneDX 1.6 at /etc/lynxroute/sbom.json with SHA-256 of the ragflow image and NTIA-compliant supplier metadata
- CIS Conformance Report - OpenSCAP HTML at /etc/lynxroute/cis-report.html
- Tailored CIS profile - /usr/share/doc/lynxroute/CIS_TAILORED_PROFILE.md
- Server credentials file - /root/ragflow-credentials.txt with web UI URL, per-instance backing-store passwords and stack-management commands
Quick Start
- Deploy VM from Azure Marketplace (Standard_D4s_v3 or larger - RAGFlow needs 16 GB RAM minimum)
- Open NSG: TCP 443 from YOUR IP/32 only until you have registered the workspace owner; TCP 80 for Let's Encrypt; TCP 22 from your management IPs
- SSH: ssh -i key.pem azureuser@<PUBLIC_IP>; read sudo cat /root/ragflow-credentials.txt
- Open https://<PUBLIC_IP>, accept the self-signed cert, click Sign up - first registered user becomes the workspace owner. Registration auto-closes within ~30 s
- Log in, open Model Providers and configure your preferred LLM. No model API keys ship in the VM
- Create a knowledge base, upload documents (PDF, DOCX, PPTX, scanned images) and start chatting with citation-grounded answers
- Public TLS: sudo certbot --nginx -d your.domain.com
Persistent storage lives at /opt/ragflow/{mysql,es,minio,redis} - attach an Azure Managed Disk to any of those paths for production-scale document libraries.