Data Quality Scans
Data Quality scans help to check data for certain metrics, like completeness, uniqueness, validity, NULLs, consistency, or any custom rule you might need, like the length of the string, etc.
Last updated
Data Quality scans help to check data for certain metrics, like completeness, uniqueness, validity, NULLs, consistency, or any custom rule you might need, like the length of the string, etc.
Last updated
Masthead leverages can integrate your own data quality scans or use Dataplex, a native Google Cloud solution, to power and execute the data quality rules, which are in essence, SQL queries. Let's look into both of these options.
None. Masthead does not request permissions, nor does it read or edit clients' data. The read-only permissions should be granted to Dataplex, a Google Cloud native service, while Masthead only helps to create and manage rules existing in Dataplex.
It can be enabled with just one click. You need to activate the Cloud Dataplex API in the connected Google Cloud Project. The link to this is provided in the Masthead UI, under the Data Quality tab. Additionally, rules must be manually created for each table you wish to monitor. Once the rules are created and scheduled, they will execute automatically.
This depends on each business and the metrics that are crucial for it. Data Quality rules are scheduled SQL queries. There are pre-set rules for aspects like completeness, accuracy, consistency, validity, uniqueness, and null checks.
However, any metric within any dimension can be monitored through a custom SQL rule.
Dataplex (with Masthead Data Quality) is a SQL-based solution, in contrast to Masthead Monitors, which are based on logs and metadata. Unlike Dataplex, Masthead Monitors provide automated anomaly detection across the entire data platform by identifying anomalies in time-series tables and errors in pipelines in real-time. Additionally, it does not increase cloud costs as it does not run SQL queries.
At Masthead, we believe Dataplex is the optimal solution for Google Cloud users to implement data quality checks. The benefits include:
It's 100x cheaper. Pay only for SQL execution, which helps avoid the 100X additional costs associated with purchasing another SQL-first solution plus to compute costs for executed queries.
The safest approach. Doesn't expose data to a third-party vendor, which also eliminates the need to spend time with security and legal teams for onboarding a third-party tool.
No additional vendor lock-in.
Follow the integration steps Masthead to expand the Data Quality of the data assets with the results of any additional quality checks.
Implement Data Quality Rules: orchestrate the quality check rules and save the results into the table for Masthead.
Share the scan results schema mapping with Masthead.
See results in the context of your data assets.
When deploying custom data quality scans, the results must be stored in a dedicated table for each table being checked. This structure facilitates organized tracking and analysis of quality metrics.
Table Naming: You can choose an arbitrary name for the quality check table.
Required Label: It's crucial to apply the following label to the data quality tables:
Location: when you have data stored in multiple cloud regions, create a table for each region. Masthead will read the tables across all your regions.
Schema: The schema is defined by you but we need the following data:
table_reference
REQUIRED
The full identifier of the table being checked.
project_name.dataset_name.table_name
scan_name
OPTIONAL
Name of the scan run. Not required if there is only one scan per table.
rule_name
REQUIRED
Name of the quality rule
value
REQUIRED
Count of events, for example
timestamp
REQUIRED
Check execution time
Example scan query:
Masthead Access Permissions
Service Account: The service account principal used for this integration is:
masthead-quality-checks@masthead-prod.iam.gserviceaccount.com
Required Roles: The service account requires the following access permissions (guide) to read data from the table:
BigQuery Data Viewer (roles/bigquery.dataViewer)
Scans and Data Synchronization Schedule
Data quality checks can be executed on any schedule on your side. Masthead will determine the optimal frequency to sync the data scan results and run analysis.
Can we sample the data?
Yes, to make data quality checks more cost-efficient. You can filter it like any SQL query, based on data dimensions.
Data Quality with Dataplex has an easy-to-use UI-based setting that allows you to limit the scope and sample the data queried.
How much does it cost to run data quality checks?
Masthead itself doesn't charge for running data quality checks. Put simply, customers pay only for the execution of rules.
The cost of each data quality check is equivalent to the data processed in BigQuery at the chosen data location and the selected pricing plan for the particular Google Cloud project.
Dataplex pricing is on pay-as-you-go usage and cost to run data quality check is based on Dataplex premium processing pricing.
After creating a rule in Masthead Data Quality with Dataplex, will it appear in Dataplex in Google Cloud?
Yes, once a rule is created in Masthead, it will be available in Dataplex UI and vice versa. Rules established in Masthead can be modified or deleted in Dataplex at any time.