Skip to main content
https://catalogartifact.azureedge.net/publicartifacts/lynxroute.ragflow-993474b5-bab4-4cbc-87c0-ec56fb868ac3/image3_Azureready.png

RAGFlow - Hardened Self-Hosted RAG Engine with Deep Document AI

by Lynxroute

RAGFlow - CIS Level 1 hardened self-hosted RAG engine on Ubuntu 24.04 LTS with SBOM + CIS Report.

What is RAGFlow

RAGFlow is an open-source Retrieval-Augmented Generation engine for enterprise document intelligence. Its deep document layer (deepdoc) parses PDFs, Word, slides, spreadsheets, scanned books and images with layout recognition, table extraction and chunk-level citation tracking - every LLM answer is traceable back to a paragraph, page, table cell or figure in the source. RAGFlow ingests files into knowledge bases, builds hybrid vector and full-text indices in Elasticsearch, stores blobs in MinIO, and exposes a multi-tenant web UI, REST API and MCP server for chat, agents and programmatic retrieval. Works with any LLM provider (OpenAI, Anthropic, Azure OpenAI, Gemini, Ollama, vLLM and others) selected from a single dropdown. Apache-2.0 license, no vendor lock-in.

Why self-host RAGFlow

Self-hosting keeps every document, embedding and retrieval query inside your own Azure tenant - no per-seat SaaS fee, no third-party access to internal knowledge bases, no provider API key shared outside the VM. Suits regulated industries with GDPR or HIPAA data residency, legal and consulting practices, R&D groups querying proprietary research, and MSPs delivering private RAG inside customer subscriptions.

What this VM image adds

Security hardening:

  • Self-registration with auto-close - first user becomes the workspace owner, then a background systemd service closes the signup endpoint automatically
  • No pre-seeded admin, no default credentials on disk - the customer owns the identity layer end-to-end
  • MySQL, Elasticsearch, MinIO and Valkey passwords rotated at first boot - 32 random characters each, written into /opt/ragflow/.env, never baked into the image
  • Upstream Go admin server explicitly NOT enabled - the bundled license-tracker module is left off; the VM runs the Apache-2.0 Python server only
  • All five containers reachable only via the docker bridge - host port 8088 bound to 127.0.0.1; the only public-facing endpoint is Nginx on TCP 443
  • Nginx reverse proxy with TLS - HTTP to HTTPS redirect, hardened cipher suite, WebSocket pass-through for streaming chat, 128 MB upload limit
  • Provider API keys not baked in - configure OpenAI, Anthropic, Azure OpenAI, Ollama and others after first login in the web UI
  • UFW firewall, fail2ban, AppArmor; CVE scan with Trivy before every release; Certbot pre-installed for Let's Encrypt

OS hardening (CIS Level 1):

  • CIS Level 1 hardened - CIS Ubuntu 24.04 LTS Level 1 Benchmark via ansible-lockdown
  • auditd, SSH key-only access, kernel hardening (SYN cookies, ASLR, rp_filter, TCP BBR), /tmp as tmpfs (nosuid, nodev, noexec)
  • Azure IMDS endpoints - egress rules pre-configured (169.254.169.254, 168.63.129.16)

Compliance artifacts (inside the VM):

  • SBOM - CycloneDX 1.6 at /etc/lynxroute/sbom.json with SHA-256 of the ragflow image and NTIA-compliant supplier metadata
  • CIS Conformance Report - OpenSCAP HTML at /etc/lynxroute/cis-report.html
  • Tailored CIS profile - /usr/share/doc/lynxroute/CIS_TAILORED_PROFILE.md
  • Server credentials file - /root/ragflow-credentials.txt with web UI URL, per-instance backing-store passwords and stack-management commands

Quick Start

  1. Deploy VM from Azure Marketplace (Standard_D4s_v3 or larger - RAGFlow needs 16 GB RAM minimum)
  2. Open NSG: TCP 443 from YOUR IP/32 only until you have registered the workspace owner; TCP 80 for Let's Encrypt; TCP 22 from your management IPs
  3. SSH: ssh -i key.pem azureuser@<PUBLIC_IP>; read sudo cat /root/ragflow-credentials.txt
  4. Open https://<PUBLIC_IP>, accept the self-signed cert, click Sign up - first registered user becomes the workspace owner. Registration auto-closes within ~30 s
  5. Log in, open Model Providers and configure your preferred LLM. No model API keys ship in the VM
  6. Create a knowledge base, upload documents (PDF, DOCX, PPTX, scanned images) and start chatting with citation-grounded answers
  7. Public TLS: sudo certbot --nginx -d your.domain.com

Persistent storage lives at /opt/ragflow/{mysql,es,minio,redis} - attach an Azure Managed Disk to any of those paths for production-scale document libraries.