vLLM with WebUI - Hardened Self-Hosted LLM Server
by Lynxroute
vLLM with Open WebUI - CIS Level 1 hardened on Ubuntu 24.04 LTS with SBOM and CIS Report.
What is vLLM
vLLM is an open-source, high-throughput inference engine for large language models, built in Python on top of PyTorch. It implements PagedAttention, continuous batching, and tensor parallelism to serve any HuggingFace-compatible transformer model (Llama, Mistral, Qwen, Phi, Gemma, OPT, GPT-J, Falcon, and 100+ more) through a fully OpenAI-compatible REST API. Any OpenAI client (openai-python, openai-node, LangChain, LlamaIndex, AnythingLLM) connects unchanged - just point the base URL at this VM and pass the local Bearer token. This image ships the CPU build of vLLM bundled with Open WebUI as a browser chat front end pre-wired to the local vLLM. The default model facebook/opt-125m (~250 MB) is preloaded so chat and API work immediately, no HuggingFace token required; any HuggingFace-compatible model can be swapped in via /etc/vllm/server.env.
Why self-host vLLM
Self-hosting keeps every prompt, document, embedding, and API key inside your own tenant. No third-party SaaS sees your customer data, internal knowledge bases, or model traffic. Recommended for teams with data residency requirements, organisations under regulated frameworks (HIPAA, GDPR, ISO 27001), and AI labs that need full visibility into the inference path. Apache-2.0 (vLLM) and MIT (Open WebUI) - fully auditable, no vendor lock-in.
What this VM image adds
Security hardening:
- Random 32-byte API key generated at first boot - written to /root/vllm-credentials.txt, never baked into the image; the same key is injected into Open WebUI so the chat UI authenticates to vLLM transparently
- vLLM bound to 127.0.0.1:8000 - reachable only through Nginx with TLS, with --api-key Bearer auth enforced on every /v1/* request
- Open WebUI bound to 127.0.0.1:8080 - reachable only through Nginx with TLS
- First registered user in Open WebUI becomes the workspace administrator - no admin baked in
- Nginx reverse proxy - self-signed TLS, HTTP-to-HTTPS redirect, WebSocket upgrade for streaming chat, security headers (X-Content-Type-Options, X-Frame-Options, Referrer-Policy)
- Loading splash page - served while the model warms up on first request
- Anonymous telemetry disabled - VLLM_NO_USAGE_STATS, DO_NOT_TRACK, ANONYMIZED_TELEMETRY
- UFW firewall - only TCP 22, 80, 443 exposed; 8000 and 8080 explicitly denied
- fail2ban - SSH brute-force protection
- AppArmor - mandatory access control
- Trivy CVE scan - every image is scanned for vulnerabilities before release
- Trivy secret scan - blocks any image that ships with leaked credentials
OS hardening (CIS Level 1):
- CIS Level 1 hardened - CIS Ubuntu 24.04 LTS Level 1 Benchmark via ansible-lockdown
- auditd - system call auditing for critical paths
- SSH hardening - PasswordAuthentication disabled, key-only access
- Kernel hardening - SYN cookies, ASLR, rp_filter, TCP BBR
- /tmp as tmpfs - nosuid, nodev, noexec
- Azure IMDS endpoints - egress rules pre-configured (169.254.169.254, 168.63.129.16)
Compliance artifacts (inside the VM):
- SBOM - CycloneDX 1.6 at /etc/lynxroute/sbom.json
- CIS Conformance Report - OpenSCAP HTML at /etc/lynxroute/cis-report.html
- Tailored CIS profile - /usr/share/doc/lynxroute/CIS_TAILORED_PROFILE.md
- Credentials file - /root/vllm-credentials.txt with the API key and connection details
Quick Start
- Deploy VM (Standard_D4s_v3 recommended; minimum Standard_D2s_v3 with 8 GB RAM)
- Open NSG: TCP 443 from YOUR IP until you have registered; SSH 22 from your management IPs
- SSH: ssh -i key.pem <username>@<PUBLIC_IP> (default user: azureuser)
- Read connection details: sudo cat /root/vllm-credentials.txt
- Open https://<PUBLIC_IP>/, accept the self-signed certificate, click "Sign up" - the first registered user becomes the workspace administrator
OpenAI API direct: curl https://<PUBLIC_IP>/v1/models -H "Authorization: Bearer <API_KEY>" -k. Replace the self-signed TLS certificate with a CA-signed certificate for production.