Microsoft Marketplace | cloud solutions, AI apps, and agents

https://catalogartifact.azureedge.net/publicartifacts/bcloudllc1671615348068.deequwithapachespark-0e1b2833-a91b-4ad7-8ce0-74f76ce12e76/2ba8eeaa-325e-4b7f-a8bb-a24a4c159773_bcloud.png

Overview Plans Ratings + reviews Details + support

Version 1.2.0+ Free Support on Ubuntu 24.04

Deequ with Apache Spark is a data quality framework developed by Amazon that runs on top of Apache Spark. It allows organizations to define, automate, and enforce data quality checks on large datasets, leveraging Spark’s distributed computing for scalable and efficient analysis. Deequ helps ensure the reliability, consistency, and completeness of data in production pipelines.

Features of Deequ with Apache Spark:

Data quality framework: Enables definition of constraints and metrics such as completeness, uniqueness, and consistency to validate data quality.
Distributed computation: Utilizes Apache Spark’s distributed processing for scalable verification of large datasets.
Historical tracking: Supports storing verification results over time to monitor trends in data quality.
Declarative API: Allows users to define checks and analysis in a concise and readable manner, while integrating seamlessly into Spark pipelines.
Compatible with Ubuntu 24.04 and Python virtual environments for isolated PyDeequ package management.

Deequ Usage with PySpark:
$ sudo su
$ source /opt/pydeequ-venv/bin/activate
$ /opt/spark/bin/pyspark --packages com.amazon.deequ:deequ:2.0.7-spark-3.5
$Once Spark starts successfully, test: $ import pydeequ print(pydeequ.__version__)

Disclaimer: Deequ is an independent open-source project developed by Amazon and is not affiliated with, endorsed by, or sponsored by Apache Spark or any Linux distribution.

Deequ with Apache Spark

by bCloud LLC

Version 1.2.0+ Free Support on Ubuntu 24.04

Other apps from bCloud LLC