https://store-images.s-microsoft.com/image/apps.15069.efa036a6-aab7-4624-953c-6e13debc592a.061478f1-1a01-4f49-960a-5cb980f897e9.e3423b23-14f7-4475-a0d8-d254ecd812e2

Document Intelligence

by Stealth Labs LTD

SharePoint content extractor for integration with Purview

SharePoint Content Scraper with AI Chat Interface is an enterprise-grade Azure solution that transforms your SharePoint environment into a searchable, AI-powered knowledge base. Deployed as a fully managed Azure Container App, it combines intelligent document extraction with a RAG-powered chat interface, allowing your team to instantly find and understand content across your entire SharePoint estate.

Unlike simple document crawlers, this solution provides a complete document intelligence platform:

  • Ask Questions, Get Answers: Chat with your SharePoint documents using natural language. The AI retrieves relevant documents and synthesizes accurate answers with source citations.
  • Multi-Site Management: Configure and monitor multiple SharePoint sites from a single dashboard. Add, enable, disable, or trigger scans for individual sites without redeployment.
  • Real-Time Visibility: Watch scanning progress live with detailed statistics, error tracking, and site-by-site status updates.

Key Features

AI-Powered Chat Interface

  • RAG (Retrieval-Augmented Generation): Ask questions about your documents in plain English
  • Source Citations: Every answer includes links to the source documents
  • Smart Suggestions: AI-generated questions based on your indexed content
  • Context-Aware Responses: Uses multiple documents to provide comprehensive answers

Multi-Site Management Dashboard

  • Centralized Configuration: Manage all SharePoint sites from one interface
  • Per-Site Controls: Enable/disable scanning, trigger immediate scans, or delete sites
  • Status Monitoring: Track scanning status (pending/scanning/active/error) per site
  • Document Counts: See how many documents each site contributes

Intelligent Document Processing

  • 10+ File Formats: PDF, Word, Excel, PowerPoint, text files, JSON, CSV, and code files
  • Full Content Extraction: Text, tables, embedded content, and metadata
  • Smart Incremental Updates: Only reprocesses changed files using delta detection
  • High Performance: Processes 30+ files per second with concurrent workers

AI-Powered Analysis (Optional)

  • Document Summaries: Auto-generate concise summaries using Azure OpenAI
  • Security Risk Detection: Flag exposed credentials, PII, and compliance issues
  • Stale Content Identification: Highlight documents not updated in 5+ years
  • Compliance Support: GDPR, HIPAA, SOC2, and PCI-DSS awareness

Real-Time Progress Dashboard

  • Live Statistics: Documents processed, failed, sites scanned
  • Current Activity: See exactly which site and library is being scanned
  • Error Tracking: View recent errors with affected documents and timestamps
  • Document Analytics: Breakdown by file type and site distribution

Use Cases

Knowledge Discovery

  • "What is our company policy on remote work?"
  • "Find all contracts mentioning ABC Corporation"
  • "What security procedures do we have documented?"

Content Auditing

  • Inventory all documents across SharePoint sites
  • Identify stale content needing review
  • Track document growth and distribution

Compliance & Security

  • Discover documents with exposed credentials or PII
  • Identify outdated security policies
  • Support GDPR data subject access requests

Migration Planning

  • Understand content distribution before migrations
  • Identify document types and volumes per site
  • Plan storage and bandwidth requirements


Prerequisites

Required:

  1. Entra ID (Azure AD) App Registration
    • Grant permission (application type)
    • For tenant-wide scanning, grant admin consent
    • Create a client secret

Optional (for AI features): 2. Azure OpenAI Service

  • Deployed GPT-4o or GPT-3.5-turbo model
  • API endpoint and key