Data Anomaly Detection
This functionality helps to detect anomalies or outliers in time series data by accessing logs and metadata of Google BigQuery and utilizing machine learning.
Last updated
This functionality helps to detect anomalies or outliers in time series data by accessing logs and metadata of Google BigQuery and utilizing machine learning.
Last updated
None. Masthead does not request permissions nor read or edit clients' data.
Up to 6 hours, depending on the number of tables in BigQuery. During the deployment, Masthead parses retrospective logs, which allows an understanding of the patterns for every time-series table within BigQuery.
Freshness – the recency of a table update. Masthead automatically identifies the frequency of each table update by using GCP logs.
Volume – volume of data received per insert and per aggregate step.
Data Quality – custom data property changes. Masthead uses the data of the regularly scheduled data scans to analyze the anomalies.
No. For Freshness and Volume Masthead parses retrospective logs and detects time-series automatically. Only Data Quality anomaly detection requires the configuration of custom data scans.
Our automated freshness monitoring system tracks the frequency of updates on a table and notifies when the latest update becomes outdated by creating an accident in Masthead UI and sending an alert to the Slack channel.
Masthead analyzes one month's worth of retrospective logs to inform its ML model, ensuring that the freshness metric is applied to all time series tables within 5-6 hours of Masthead's deployment.
Freshness anomalies are detected in real-time. As Masthead examines log patterns, any deviation from the expected data ingestion schedule immediately triggers an alert about the anomaly.
In the Incident tab, you will find:
The name of the table where the incident occurred.
Location on the table: its project and dataset.
A graph indicating when the table failed to update, marked in red. Each blue bar represents an update event for the table.
Below each incident, there is a display showing the duration of the missed updates and the expected update frequency based on past patterns.
Our automated volume anomaly detection examines variations in the number of rows within a table and provides real-time alerts for unexpected data volume changes. This includes significant additions or deletions of data or any unusual patterns in row changes.
Masthead analyzes one month's worth of retrospective logs to inform its ML model, ensuring that the freshness metric is applied to all time series tables within 5-6 hours of Masthead's deployment.
Volume anomalies are detected in real-time. When Masthead analyzes log patterns, any deviation from the expected data range during ingestion instantly triggers an anomaly alert, both in the Masthead interface and on Slack.
In the example above, you can see a table consistently adding rows and a sudden drop below the expected range during two ingestions.
In the Incident tab, you will find:
The name of the table where the incident occurred.
Location on the table: its project and dataset.
A graph indicating when the table failed to update, marked in red. Each blue bar represents an updated event for the table.
Below each incident, there is a display showing the duration of the missed updates and the expected update frequency based on past patterns.
This section outlines the requirements for integrating custom data quality scans. By leveraging custom data quality rules, you can implement specific validation queries tailored to your business needs. The results of these checks are stored in dedicated tables within your project, providing a clear record of data quality status over time.
Quality Checks Results Storage
When deploying custom data quality scans, the results must be stored in a dedicated table for each table being checked. This structure facilitates organized tracking and analysis of quality metrics.
Table Naming: You can choose an arbitrary name for the quality check table.
Required Label: It's crucial to apply the following label to the data quality tables:
Location: when you have data stored in multiple cloud regions, create a table for each region. Masthead will read the tables across all your regions.
Schema: The schema is defined by you but we need the following data:
table reference
REQUIRED
The full identifier of the table being checked.
project_name.dataset_name.table_name
scan name
OPTIONAL
Name of the scan run. Not required if there is only one scan per table.
rule name
REQUIRED
Name of the quality rule
value
REQUIRED
Count of events, for example
timestamp
REQUIRED
Check execution time
Example scan query:
Masthead Access Permissions
Service Account: The service account principal used for this integration is:
masthead-quality-checks@masthead-prod.iam.gserviceaccount.com
Required Roles: The service account requires the following access permissions (guide) to read data from the table:
BigQuery Data Viewer (roles/bigquery.dataViewer)
Scans and Data Synchronization Schedule
Data quality checks can be executed on any schedule on your side. Masthead will determine the optimal frequency to sync the data scan results and run analysis.
Integration Steps
Implement Data Quality Rules: orchestrate the quality check rules and save the results into the table for Masthead.
Share the scan results schema mapping with Masthead.
Monitoring & Alerting: Masthead will analyze the scan results, identify thresholds and triggers for the incidents, and send notifications via your notification service integration.
What is the look-back period during onboarding?
Masthead uses 4 weeks of retrospective logs available in the audit log.
Do you access the schema table?
No, to collect the necessary data points, Masthead uses only logs.
Which tables can be monitored in the data warehouse using Masthead's anomaly detection?
All tables that have been updated within the cadences in the past month prior to Masthead's deployment will be automatically monitored.