Data Quality Scans

Data Quality scans help to check data for certain metrics, like completeness, uniqueness, validity, NULLs, consistency, or any custom rule you might need, like the length of the string, etc.

Masthead leverages can integrate your own data quality scans or use Dataplex, a native Google Cloud solution, to power and execute the data quality rules, which are in essence, SQL queries. Let's look into both of these options.

Data Scans with Dataplex

What data does Masthead query with Dataplex?

None. Masthead does not request permissions, nor does it read or edit clients' data. The read-only permissions should be granted to Dataplex, a Google Cloud native service, while Masthead only helps to create and manage rules existing in Dataplex.

Is Dataplex available automatically?

It can be enabled with just one click. You need to activate the Cloud Dataplex API in the connected Google Cloud Project. The link to this is provided in the Masthead UI, under the Data Quality tab. Additionally, rules must be manually created for each table you wish to monitor. Once the rules are created and scheduled, they will execute automatically.

Masthead Data Quality

What metrics can we monitor via Dataplex?

This depends on each business and the metrics that are crucial for it. Data Quality rules are scheduled SQL queries. There are pre-set rules for aspects like completeness, accuracy, consistency, validity, uniqueness, and null checks.

Data Quality rule creation

However, any metric within any dimension can be monitored through a custom SQL rule.

Data Quality row check rule creation

Why would I need Masthead to integrate with Dataplex?

Dataplex (with Masthead Data Quality) is a SQL-based solution, in contrast to Masthead Monitors, which are based on logs and metadata. Unlike Dataplex, Masthead Monitors provide automated anomaly detection across the entire data platform by identifying anomalies in time-series tables and errors in pipelines in real-time. Additionally, it does not increase cloud costs as it does not run SQL queries.

At Masthead, we believe Dataplex is the optimal solution for Google Cloud users to implement data quality checks. The benefits include:

  • It's 100x cheaper. Pay only for SQL execution, which helps avoid the 100X additional costs associated with purchasing another SQL-first solution plus to compute costs for executed queries.

  • The safest approach. Doesn't expose data to a third-party vendor, which also eliminates the need to spend time with security and legal teams for onboarding a third-party tool.

  • No additional vendor lock-in.

Custom Data Scans

Follow the integration steps Masthead to expand the Data Quality of the data assets with the results of any additional quality checks.

  1. Implement Data Quality Rules: orchestrate the quality check rules and save the results into the table.

  2. Share the scan results table with Masthead.

  3. See results in the context of your data assets.

When deploying custom data quality scans, the results must be stored in a single table for all tables being checked. This structure facilitates organized tracking and analysis of quality metrics.

  • Table Naming: You can choose an arbitrary name for the quality check table.

  • Required Label: It's crucial to apply the following label so that Masthead can recognize the table for data scans:

labels: {
    masthead: data_quality_table,
    ...
}
  • Location: when you have data stored in multiple cloud regions, create a table for each region. Masthead will read the tables across all your regions.

  • Schema: Masthead requires the following data to be included:

Value
Type
Description

table_reference

STRING, REQUIRED

The full identifier of the table being checked.

project_name.dataset_name.table_name

scan_name

STRING, OPTIONAL

Name of the scan run. Not required if there is only one scan per table.

rule_name

STRING, REQUIRED

Name of the quality rule

value

FLOAT64, REQUIRED

Count of events, for example

timestamp

TIMESTAMP, REQUIRED

Check execution time

  • Retrospective data: Include at least 14 days of historical data to begin analyzing quality incidents immediately. Without it, analysis may be delayed by up to 14 days.

Example scan query:

SELECT
  *,
  CURRENT_TIMESTAMP() AS timestamp
FROM (
  SELECT
    'bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*' AS table_reference,
    'main_daily_check' AS scan_name,
    COUNT(DISTINCT user_pseudo_id) AS unique_users,
    CAST(MAX(ecommerce.purchase_revenue) AS INT64) AS max_purchase_revenue
  FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
  WHERE _TABLE_SUFFIX = '20201108'
)
UNPIVOT(value FOR rule_name IN (unique_users, max_purchase_revenue))

Masthead Access Permissions

  • Service Account: The service account principal used for this integration is:

[email protected]

  • Required Roles: The service account requires the following access permissions to read data from the table: BigQuery Data Viewer (how to add permissions)

Scheduling

Data quality checks can be executed on any schedule on your side. Masthead will determine the optimal frequency to sync the data scan results and their analysis.

Commonly asked questions

  • Can we sample the data?

Yes, to make data quality checks more cost-efficient. You can filter it like any SQL query, based on data dimensions.

Data Quality with Dataplex has an easy-to-use UI-based setting that allows you to limit the scope and sample the data queried.

  • How much does it cost to run data quality checks?

Masthead itself doesn't charge for running data quality checks. Put simply, customers pay only for the execution of rules.

The cost of each data quality check is equivalent to the data processed in BigQuery at the chosen data location and the selected pricing plan for the particular Google Cloud project.

Dataplex pricing is on pay-as-you-go usage and cost to run data quality check is based on Dataplex premium processing pricing.

  • After creating a rule in Masthead Data Quality with Dataplex, will it appear in Dataplex in Google Cloud?

Yes, once a rule is created in Masthead, it will be available in Dataplex UI and vice versa. Rules established in Masthead can be modified or deleted in Dataplex at any time.

Last updated