Data Quality
This functionality helps to check data for certain metrics, like completeness, uniqueness, validity, NULLs, consistency, or any custom rule you might need, like the length of the string, etc.
Masthead leverages the Dataplex API, a native Google Cloud solution, to power and execute the data quality rules, which are in essence, SQL queries.
What data does Masthead query?
None. Masthead does not request permissions, nor does it read or edit clients' data. The read-only permissions should be granted to Dataplex, a Google Cloud native service, while Masthead only helps to create and manage rules existing in Dataplex.
Is Data Quality available automatically?
No, but it can be enabled with just one click. You need to activate the Cloud Dataplex API in the connected Google Cloud Project. The link to this is provided in the Masthead UI, under the Data Quality tab. Additionally, rules must be manually created for each table you wish to monitor. Once the rules are created and scheduled, they will execute automatically.
What metrics can we monitor via data quality?
This depends on each business and the metrics that are crucial for it. Data Quality rules are scheduled SQL queries. There are pre-set rules for aspects like completeness, accuracy, consistency, validity, uniqueness, and Null checks. However, any metric within any dimension can be monitored through a custom SQL rule.
Why would I need Masthead to integrate with Dataplex?
Dataplex (Data Quality) is an SQL-based solution, in contrast to Masthead, which is based on logs and metadata. Unlike Dataplex, Masthead provides automated anomaly detection across the entire data platform by identifying anomalies in time-series tables and errors in pipelines in real-time. Additionally, it does not increase clients' cloud costs since it does not run SQL queries.
At Masthead, we believe Dataplex is the optimal solution for GCP (Google Cloud Platform) users to implement data quality checks. The benefits include:
It's 100x cheaper. You will only pay for SQL execution, which helps avoid the 100X additional costs associated with purchasing another SQL-first solution plus to compute costs for executed queries.
The safest approach. You do not expose your data to a new third-party vendor; this eliminates the need to spend time with your security and legal teams for onboarding a third-party tool for data quality checks in your data warehouse.
No cling to vendor. You do not need to grant permission to query or edit your data to any other third-party vendor, thus keeping it within Google Cloud, where your data check will remain.
Commonly asked questions
Can we sample the data?
Yes, to make data quality checks more cost-efficient, there's an easy-to-use UI-based setting that allows you to limit the scope and sample the data queried. You can also filter it like any SQL query, based on data dimensions.
How much does it cost to run data quality checks?
Masthead itself does not charge for running data quality checks. Dataplex pricing is on pay-as-you-go usage and cost to run data quality check is based on Dataplex Processing price. The cost of each data quality check is equivalent to the data processed in Google BigQuery at the chosen data server location and the selected pricing plan for the particular Google Cloud Project. Put simply, customers pay only for the execution of rules.
After creating a rule in Masthead, will it appear in Dataplex in Google Cloud?
Yes, once a rule is created in Masthead, it will be permanently available in Dataplex and vice versa. Rules established in Masthead can be modified or dissolved in Dataplex at any time.
Last updated