Paperless-ngx - Hardened Document Management with OCR
by Lynxroute
Paperless-ngx - CIS Level 1 hardened document management with OCR on Ubuntu 24.04 LTS with SBOM.
What is Paperless-ngx
Paperless-ngx is an open-source document management system that turns paper documents, PDFs, and scanned images into a fully searchable online archive. It uses Tesseract OCR to extract text from every uploaded file, then indexes it with full-text search, automatic tagging, correspondent detection, and configurable retention rules. The web UI runs on Django, asynchronous OCR and indexing run on Celery workers backed by Redis, and metadata lives in PostgreSQL. Consume a document by dropping it in a watched folder, posting it via REST API, or uploading through the web UI - within seconds it is OCRed, tagged, archived, and searchable.
Why self-host Paperless-ngx
Self-hosting keeps every invoice, contract, scanned ID card, tax record, and onboarding form on infrastructure you control - no third-party visibility into your documents and no per-user SaaS fees that grow with your team. Ideal for accountants and bookkeepers archiving client paperwork, legal practices that need durable searchable records, healthcare practices subject to retention requirements, regulated organisations under GDPR, HIPAA or ISO 27001, and homelab users archiving years of personal paperwork.
What this VM image adds
Security hardening:
- Admin credentials generated per instance - unique random password issued at first boot, written to root-only credentials file, never the same on two deployments
- Paperless services run as non-root - dedicated paperless system user, no shell, locked home directory, UMask=0027 enforced via systemd
- granian ASGI server bound to localhost - the Django app listens on 127.0.0.1:8000, exposed externally only through the hardened nginx reverse proxy
- PostgreSQL and Redis on 127.0.0.1 only - the document database and the Celery broker are never reachable from the network
- Per-instance Django SECRET_KEY and database password - issued at first boot, tokens and sessions cannot be replayed against any other deployment
- Self-signed TLS pre-installed - nginx serves HTTPS on day one, HTTP redirects to HTTPS, ready for certbot replacement
- CVE scan - every image is scanned with Trivy before release, vulnerable Python packages are pinned to patched versions on top of upstream requirements
- UFW firewall - only ports 80, 443, and 22 open; internal port 8000 explicitly local
- fail2ban - SSH brute-force protection
- AppArmor - mandatory access control
OS hardening (CIS Level 1):
- CIS Level 1 hardened - CIS Ubuntu 24.04 LTS Level 1 Benchmark via ansible-lockdown, 0 FAIL rules in shipped image
- auditd - system call auditing for critical paths
- SSH hardening - PasswordAuthentication disabled, key-only access
- Kernel hardening - SYN cookies, ASLR, rp_filter, TCP BBR
- /tmp as tmpfs - nosuid, nodev, noexec
- Azure IMDS endpoints - egress rules pre-configured (169.254.169.254, 168.63.129.16)
Compliance artifacts (inside the VM):
- SBOM - CycloneDX 1.6 at /etc/lynxroute/sbom.json
- CIS Conformance Report - OpenSCAP HTML at /etc/lynxroute/cis-report.html
- Tailored CIS profile - /usr/share/doc/lynxroute/CIS_TAILORED_PROFILE.md
- Server credentials file - /root/paperless-ngx-credentials.txt with public IP, web UI URL, and the per-instance admin password
Quick Start
- Deploy VM from Azure Marketplace (Standard_D2s_v3 or larger recommended; Standard_D4s_v3 for heavy OCR throughput)
- Open NSG: TCP 80 and 443 from your client networks - SSH 22 from your management IPs only
- Wait 3 to 5 minutes after first boot for OCR dependencies, database migrations, and admin user creation to complete - the nginx startup page auto-refreshes every 15 seconds
- SSH: ssh -i key.pem <username>@<PUBLIC_IP> (username set during VM creation, default: azureuser)
- Read connection details: sudo cat /root/paperless-ngx-credentials.txt - contains web UI URL and the unique admin password
- Open https://<PUBLIC_IP> and accept the self-signed certificate warning
- Replace the self-signed TLS certificate with a CA-signed certificate for production use (certbot or your existing PKI)
- Ingest documents: upload via the web UI, post to /api/documents/post_document/, or drop files into /opt/paperless-ngx/consume/ on the VM