https://catalogartifact.azureedge.net/publicartifacts/bcloudllc1671615348068.deequwithapachespark-0e1b2833-a91b-4ad7-8ce0-74f76ce12e76/2ba8eeaa-325e-4b7f-a8bb-a24a4c159773_bcloud.png

Deequ with Apache Spark

by bCloud LLC

(1 ratings)

Version 1.2.0+ Free Support on Ubuntu 24.04

Deequ with Apache Spark is a data quality framework developed by Amazon that runs on top of Apache Spark. It allows organizations to define, automate, and enforce data quality checks on large datasets, leveraging Spark’s distributed computing for scalable and efficient analysis. Deequ helps ensure the reliability, consistency, and completeness of data in production pipelines.

Features of Deequ with Apache Spark:

  • Data quality framework: Enables definition of constraints and metrics such as completeness, uniqueness, and consistency to validate data quality.
  • Distributed computation: Utilizes Apache Spark’s distributed processing for scalable verification of large datasets.
  • Historical tracking: Supports storing verification results over time to monitor trends in data quality.
  • Declarative API: Allows users to define checks and analysis in a concise and readable manner, while integrating seamlessly into Spark pipelines.
  • Compatible with Ubuntu 24.04 and Python virtual environments for isolated PyDeequ package management.

Deequ Usage with PySpark:

$ sudo su
$ source /opt/pydeequ-venv/bin/activate
$ /opt/spark/bin/pyspark --packages com.amazon.deequ:deequ:2.0.7-spark-3.5
$Once Spark starts successfully, test: $ import pydeequ print(pydeequ.__version__)

Disclaimer: Deequ is an independent open-source project developed by Amazon and is not affiliated with, endorsed by, or sponsored by Apache Spark or any Linux distribution.