SharePoint Intel
by Stealth Labs LTD
SharePoint content extractor for integration with Purview
Overview
SharePoint Content Scraper is a powerful, enterprise-grade solution for extracting and analyzing content from Microsoft SharePoint Online. Deployed as a fully managed Azure Container App, it automatically scans your SharePoint sites, extracts document metadata and content, and stores everything in Azure Cosmos DB for easy querying and analysis.
Key Features
Comprehensive SharePoint Integration
- Scan specific SharePoint sites or your entire tenant
- Extract from document libraries, lists, and subsites
- Support for all major file types (PDF, Word, Excel, PowerPoint, text files, and more)
- Automatic discovery of all accessible sites when tenant-wide scanning is enabled
AI-Powered Document Summaries (Optional)
- Generate intelligent summaries using your own Azure OpenAI service (Bring-Your-Own credentials)
- Support for GPT-4o and GPT-3.5-turbo models
- Automatic fallback if AI services are unavailable
- Optional security analysis to identify PII, exposed credentials, and compliance risks
Enterprise Security
- Azure Key Vault integration for secure secrets management
- Managed Identity support for passwordless authentication
- Entra ID (Azure AD) app registration authentication
- All data stored in your own Azure subscription
- No data leaves your environment
Reliable Data Storage
- Azure Cosmos DB serverless for cost-effective storage
- Pay only for what you use (approximately $0.25 per million operations)
- Automatic schema flexibility for various document types
- Built-in retry logic and error handling
- Change feed support for downstream processing
Performance and Scalability
- Multiple performance tiers (Basic, Standard, Large)
- Configurable CPU and memory allocation
- Concurrent file processing for faster extraction
- Rate limiting and retry mechanisms
- Handles large SharePoint environments with thousands of documents
Rich Metadata Extraction
- File names, sizes, and extensions
- Created and modified dates
- Document age calculations
- SharePoint-specific metadata (sites, libraries, lists)
- Full text content extraction
- Custom metadata fields
Monitoring and Observability
- Azure Log Analytics integration
- Application Insights for detailed telemetry
- Comprehensive logging of all operations
- Error tracking and alerting capabilities
- Real-time monitoring dashboard
Use Cases
Content Discovery and Inventory
- Audit all documents across your SharePoint environment
- Identify stale or outdated content
- Track document creation and modification patterns
Compliance and Security
- Discover documents containing sensitive information
- Identify outdated security policies or credentials
- Track document age for retention policies
- GDPR, HIPAA, and SOC2 compliance support
Knowledge Management
- Build searchable document repositories
- Generate AI summaries for quick content review
- Enable full-text search across all documents
- Support migration planning and content analysis
Data Analytics
- Export SharePoint data to analytics platforms
- Track content growth over time
- Analyze document types and usage patterns
Prerequisites
Before Deployment:
-
Entra ID App Registration (Required)
- Create an app registration in Entra ID (Azure AD)
- Grant and permissions
- For tenant-wide scanning, grant permission
- Create a client secret
-
Azure OpenAI Service (Optional – for AI features)
- Existing Azure OpenAI resource with deployed model
- Supported models: GPT-4o, GPT-3.5-turbo
- API key and endpoint URL