Insights

insights table

The core insights data table that provides actionable recommendations for cost optimization and data management.

Description

An alpha version of the core insights data table for Masthead's customers. This table consolidates recommendations for cost optimization by identifying unused and dead-end tables, along with potential actions to reduce storage costs and improve data management efficiency.

Schema

Column name
Data type
Value

category

STRING

Category of the insight (currently only 'Cost'; other categories may be added in the future)

type

STRING

Type of insight; valid values include 'Dead end', 'Unused', etc.

subtype

STRING

Specific subtype of the insight (e.g., 'Dead end table', 'Unused table')

tenant

STRING

Account/tenant identifier

project_id

STRING

Google Cloud project ID

target_resource

STRING

Name of the target resource

last_updated_time

TIMESTAMP

When this insight was last updated

overview

JSON

JSON object with insight details; see "Overview object" section below for structure

operations

JSON

Array of recommended actions to address the insight (see "Operations object" section)

Insight Types

Cost Category

Dead End

  • Leaf dead end table - Tables that are not used by any downstream processes and are at the end of the data lineage

  • Dead end table - Tables that have no downstream dependencies and may be candidates for removal

  • Dead End pipelines - Pipelines that do not produce any useful output and are not used by any downstream processes. (Not yet implemented)

Unused Tables

  • Unused table - Tables that haven't been accessed or queried within the configured time window

Switching Storage Classes

  • Storage billing model - Datasets that can be moved to a more cost-effective storage billing model based on their update patterns and data compression. (Not yet implemented)

Compute Model Adjustments

  • Compute Model per pipeline - Recommendations for adjusting compute model assignment based on resources usage. (Not yet implemented)

Overview object

The overview field contains detailed information about each insight:

{
  "resource_type": "TABLE|VIEW",
  "resource_name": "project.dataset.table_name",
  "num_bytes": 1234567890,
  "potential_savings": null
}

Operations object

The operations field contains recommended actions:

[
  {
    "action": "remove",
    "resource_type": "TABLE|VIEW",
    "resource_name": "project.dataset.table_name"
  }
]

Table usage examples

Find dead end tables

-- Get all dead end tables with their details
SELECT
  subtype,
  project_id,
  target_resource,
  INT64(overview.num_bytes) AS num_bytes,
  STRING(overview.resource_type) AS resource_type,
  last_updated_time
FROM `masthead-prod`.account_name.insights
WHERE category = 'Cost'
  AND type = 'Dead end'
ORDER BY num_bytes DESC;

Find unused tables

-- Get all unused tables sorted by size
SELECT
  project_id,
  target_resource,
  INT64(overview.num_bytes) AS num_bytes,
  STRING(overview.resource_type) AS resource_type,
  last_updated_time
FROM `masthead-prod`.account_name.insights
WHERE category = 'Cost'
  AND type = 'Unused'
ORDER BY num_bytes DESC;

Calculate potential storage savings

-- Calculate total storage used by dead end and unused tables
SELECT
  type,
  COUNT(*) AS resource_count,
  SUM(SAFE.INT64(overview.num_bytes)) AS total_bytes,
  ROUND(SUM(SAFE.INT64(overview.num_bytes)) / POW(1024, 3), 2) AS total_gb
FROM `masthead-prod`.account_name.insights
WHERE category = 'Cost'
  AND subtype IN ('Dead end table', 'Unused table')
  AND overview.num_bytes IS NOT NULL
GROUP BY type
ORDER BY total_bytes DESC;

Get insights by project

-- View insights grouped by project
SELECT
  project_id,
  type,
  COUNT(*) AS insight_count,
  SUM(SAFE.INT64(overview.num_bytes)) AS total_bytes
FROM `masthead-prod`.account_name.insights
WHERE category = 'Cost'
  AND subtype IN ('Dead end table', 'Unused table')
  AND overview.num_bytes IS NOT NULL
GROUP BY project_id, type
ORDER BY total_bytes DESC;

Recent insights

-- Get insights updated in the last 7 days
SELECT
  category,
  type,
  subtype,
  project_id,
  target_resource,
  last_updated_time
FROM `masthead-prod`.account_name.insights
WHERE DATE(last_updated_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
ORDER BY last_updated_time DESC;

Limitations

  • Alpha Version: This is an early alpha version and the schema may change as we gather feedback.

  • Data Freshness: Insights are updated daily based on usage patterns and lineage analysis.

  • Data Retention: Insights are based on activity within the account's configured lookback window (30 days by default).

  • False Positives: Some tables marked as "unused" may be accessed through external tools not visible to Masthead's monitoring.

Best Practices

  1. Review Before Action: Always manually verify insights before taking action on production data.

  2. Test Removals: Consider moving tables to a staging area before permanent deletion.

  3. Check Dependencies: Cross-reference with the lineage table to understand data dependencies.

  4. Monitor Trends: Regular review of insights can help identify data management patterns and opportunities.

  5. Coordinate with Teams: Ensure data owners are aware before removing resources they may depend on.

Last updated