Metastore Explained Guide

Let me tell you a story. You're a data engineer at a growing startup. Your team dumps CSV files into S3 — sales reports, user events, transaction logs. Fifty files. Then five hundred. Then five thousand. One morning, a data analyst asks you: "Where's the Q3 revenue data?" And you realise... you have no idea. Is it in s3://data/reports/revenue_q3.csv or s3://analytics/sales/2025-q3/ or s3://data-lake-prod/revenue/year=2025/quarter=3/?

Nobody knows which files have which columns. Nobody knows what data type "revenue" is — is it a float? An integer? A string with a dollar sign? Nobody knows when the data was last updated, who created it, or whether it's even valid anymore.

You have a data swamp, not a data lake.

This is the problem a metastore solves.

What is a Metastore?

A metastore is a catalogue of metadata — it stores information about your data, not the data itself. Think of it as the card catalogue in a library. The books (data files) are on the shelves (S3, ADLS, GCS). The catalogue tells you: which shelf, what topic, how many pages, when it was published, and who wrote it.

What a Metastore Actually Stores

Table Definitions (Schema)Column names, data types, partitioning scheme — "revenue is a DECIMAL(10,2), partitioned by year"

Location MappingWhere the data files live — "this table's data is at s3://lake/sales/revenue/"

Partition InformationWhich partitions exist and where — "year=2025/quarter=3 has 47 Parquet files"

Statistics & MetadataRow counts, file sizes, column min/max — used by query optimizers to run queries faster

Access Control & GovernanceWho can read which tables, column-level masking, audit logs — "analysts can see revenue but not PII"

Without a metastore, every query engine (Spark, Presto, Trino, Athena) would need to scan every file to figure out what's inside. With a metastore, they just ask the catalogue: "What are the columns in the sales.revenue table? Where are the files?" and go straight to reading data. It's the difference between opening every book in the library versus checking the catalogue first.

The Evolution of Metastores

The Metastore Evolution

🐘Hive2010 (original)

→

☁AWS Glue2017 (managed)

→

🧱Unity Catalog2022 (governance)

→

❄Polaris2024 (open, Iceberg)

Apache Hive Metastore — The Original

The Hive Metastore (HMS) was built by Facebook in 2010 as part of Apache Hive. It became the de facto standard that every data tool speaks. Even today, Spark, Presto, Trino, Flink, and dozens of other engines can talk to a Hive Metastore.

How Hive Metastore Works

Hive Metastore Architecture

Query Engine(Spark / Trino)

Hive Metastore(Thrift API)

Backend DB(MySQL / PostgreSQL)

1 "What columns does sales.revenue have?"

2 SELECT * FROM TBLS WHERE TBL_NAME='revenue'

3 Schema: {id: BIGINT, amount: DECIMAL, date: DATE}

4 Schema + location: s3://lake/sales/revenue/

Engine reads Parquet files directly from S3

-- Hive DDL: Creating a table in the metastore
CREATE EXTERNAL TABLE sales.revenue (
    transaction_id  BIGINT,
    customer_id     BIGINT,
    amount          DECIMAL(10,2),
    currency        STRING,
    payment_method  STRING,
    created_at      TIMESTAMP
)
PARTITIONED BY (year INT, quarter INT)
STORED AS PARQUET
LOCATION 's3://data-lake-prod/sales/revenue/';

-- Add a partition (tells metastore where to find the data)
ALTER TABLE sales.revenue ADD PARTITION (year=2025, quarter=3)
LOCATION 's3://data-lake-prod/sales/revenue/year=2025/quarter=3/';

-- Now ANY engine connected to this metastore can query:
SELECT SUM(amount) FROM sales.revenue WHERE year=2025 AND quarter=3;
-- Spark, Presto, Trino, Athena — they all understand this schema

# Running Hive Metastore standalone (no Hive needed!)
# Most teams run HMS as a standalone service

# Docker Compose for Hive Metastore + PostgreSQL backend
version: '3.8'
services:
  metastore-db:
    image: postgres:15
    environment:
      POSTGRES_DB: metastore
      POSTGRES_USER: hive
      POSTGRES_PASSWORD: hive123
    volumes:
      - metastore-data:/var/lib/postgresql/data

  metastore:
    image: apache/hive:4.0.0
    depends_on:
      - metastore-db
    environment:
      SERVICE_NAME: metastore
      DB_DRIVER: postgres
      SERVICE_OPTS: >-
        -Djavax.jdo.option.ConnectionURL=jdbc:postgresql://metastore-db:5432/metastore
        -Djavax.jdo.option.ConnectionDriverName=org.postgresql.Driver
        -Djavax.jdo.option.ConnectionUserName=hive
        -Djavax.jdo.option.ConnectionPassword=hive123
    ports:
      - "9083:9083"  # Thrift API port

volumes:
  metastore-data:

Hive Metastore: Pros & Cons

Hive Metastore Assessment

✅ Strengths

🌐Universal — every engine speaks Hive protocol

📦Open source (Apache 2.0)

🔄Battle-tested at petabyte scale

🔒Self-hosted — full control over your metadata

❌ Weaknesses

🛠Operational burden — you manage the DB, backups, upgrades

🐢Single-threaded Thrift server — bottleneck at scale

🚫No built-in access control (anyone who connects can see everything)

📅Old architecture — designed for Hadoop, not modern lakehouse

AWS Glue Data Catalog — Managed Hive Metastore

AWS Glue Data Catalog is essentially a managed Hive Metastore hosted by AWS. You don't run any servers, don't manage a database, don't worry about backups. AWS handles everything. And it's deeply integrated with the AWS ecosystem — Athena, Redshift Spectrum, EMR, Lake Formation, and Glue ETL all share the same catalogue.

# AWS Glue: Creating a table via AWS CLI
aws glue create-table \
  --database-name sales \
  --table-input '{
    "Name": "revenue",
    "StorageDescriptor": {
      "Columns": [
        {"Name": "transaction_id", "Type": "bigint"},
        {"Name": "customer_id", "Type": "bigint"},
        {"Name": "amount", "Type": "decimal(10,2)"},
        {"Name": "currency", "Type": "string"},
        {"Name": "created_at", "Type": "timestamp"}
      ],
      "Location": "s3://data-lake-prod/sales/revenue/",
      "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
      "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
      "SerdeInfo": {
        "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
      }
    },
    "PartitionKeys": [
      {"Name": "year", "Type": "int"},
      {"Name": "quarter", "Type": "int"}
    ],
    "TableType": "EXTERNAL_TABLE"
  }'

# Now query via Athena (SQL over S3 — serverless):
# SELECT SUM(amount) FROM sales.revenue WHERE year=2025;
# No servers to manage. Pay per query.

# Glue Crawlers: Auto-discover schemas
# Instead of manually defining tables, let Glue scan your S3 bucket:
aws glue create-crawler \
  --name sales-crawler \
  --role GlueServiceRole \
  --database-name sales \
  --targets '{
    "S3Targets": [
      {"Path": "s3://data-lake-prod/sales/"}
    ]
  }'

# Run the crawler
aws glue start-crawler --name sales-crawler
# Glue scans the Parquet files, infers the schema, and creates
# tables + partitions in the Data Catalog automatically!

# Terraform for Glue Catalog
resource "aws_glue_catalog_database" "sales" {
  name = "sales"
}

resource "aws_glue_catalog_table" "revenue" {
  name          = "revenue"
  database_name = aws_glue_catalog_database.sales.name

  table_type = "EXTERNAL_TABLE"
  parameters = {
    classification = "parquet"
  }

  storage_descriptor {
    location      = "s3://data-lake-prod/sales/revenue/"
    input_format  = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"

    ser_de_info {
      serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
    }

    columns {
      name = "transaction_id"
      type = "bigint"
    }
    columns {
      name = "amount"
      type = "decimal(10,2)"
    }
  }

  partition_keys {
    name = "year"
    type = "int"
  }
}

AWS Lake Formation — Governance on Top of Glue

Glue Data Catalog tells you what data exists. Lake Formation adds who can access it. It provides fine-grained access control: table-level, column-level, even row-level filtering. This is how enterprises manage data lakes at scale.

# Lake Formation: Grant access to a table
aws lakeformation grant-permissions \
  --principal '{"DataLakePrincipalIdentifier": "arn:aws:iam::123456789:role/analyst-role"}' \
  --resource '{"Table": {"DatabaseName": "sales", "Name": "revenue"}}' \
  --permissions '["SELECT"]' \
  --permissions-with-grant-option '[]'

# Column-level access:
aws lakeformation grant-permissions \
  --principal '{"DataLakePrincipalIdentifier": "arn:aws:iam::123456789:role/analyst-role"}' \
  --resource '{"TableWithColumns": {"DatabaseName": "sales", "Name": "customers", "ColumnNames": ["name", "city"], "ColumnWildcard": null}}' \
  --permissions '["SELECT"]'
# Analysts can see name and city, but NOT email, phone, or SSN

Databricks Unity Catalog — The Governance-First Approach

Databricks Unity Catalog takes a fundamentally different approach. Instead of bolting governance onto an existing catalogue (like Lake Formation does with Glue), Unity Catalog was built from the ground up for governance. It provides a three-level namespace, centralised access control, data lineage, and cross-workspace sharing.

Unity Catalog: Three-Level Namespace

🏢Cataloge.g., production

📁Schemae.g., sales

📊Tablee.g., revenue

-- Unity Catalog: Creating objects
-- Three-level namespace: catalog.schema.table

-- Create a catalog (top-level container)
CREATE CATALOG production;

-- Create a schema within the catalog
CREATE SCHEMA production.sales;

-- Create a managed table (Unity Catalog manages the storage)
CREATE TABLE production.sales.revenue (
    transaction_id  BIGINT,
    customer_id     BIGINT,
    amount          DECIMAL(10,2),
    currency        STRING,
    created_at      TIMESTAMP
)
USING DELTA;  -- Delta Lake format (ACID transactions!)

-- Or create an external table (you manage storage)
CREATE TABLE production.sales.legacy_revenue
USING PARQUET
LOCATION 's3://data-lake-prod/sales/revenue/';

-- Grant access
GRANT SELECT ON TABLE production.sales.revenue TO analyst_group;
GRANT USAGE ON SCHEMA production.sales TO analyst_group;
GRANT USAGE ON CATALOG production TO analyst_group;

-- Column-level masking
ALTER TABLE production.sales.customers
ALTER COLUMN email SET MASK mask_email;
-- Analysts see: j***@example.com instead of john@example.com

Apache Polaris (Iceberg REST Catalog) — The Open Future

Apache Polaris (originally Snowflake's Polaris Catalog, donated to Apache in 2024) is the newest entrant. It's built specifically for Apache Iceberg tables and provides a vendor-neutral REST API. If you believe the future is open table formats (Iceberg), Polaris is worth watching closely.

# Polaris uses a REST API instead of Thrift (like Hive)
# This makes it language-agnostic and cloud-agnostic

# Create a catalog via REST API
curl -X POST https://polaris.example.com/api/management/v1/catalogs \
  -H "Content-Type: application/json" \
  -d '{
    "catalog": {
      "name": "production",
      "type": "INTERNAL",
      "properties": {
        "default-base-location": "s3://data-lake/production/"
      },
      "storageConfigInfo": {
        "storageType": "S3",
        "allowedLocations": ["s3://data-lake/production/"]
      }
    }
  }'

# Spark connects to Polaris using Iceberg REST catalog:
spark.conf.set("spark.sql.catalog.polaris", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.polaris.type", "rest")
spark.conf.set("spark.sql.catalog.polaris.uri", "https://polaris.example.com/api/catalog")
spark.conf.set("spark.sql.catalog.polaris.warehouse", "production")

# Then query normally:
spark.sql("SELECT * FROM polaris.sales.revenue WHERE year = 2025")

The Big Comparison

Metastore Comparison: Hive vs Glue vs Unity Catalog vs Polaris

Feature	Hive Metastore	AWS Glue	Unity Catalog	Polaris
Type	Open source	Managed (AWS)	Managed (Databricks)	Open source
Protocol	Thrift	Hive-compatible + AWS API	REST + Hive-compatible	REST (Iceberg spec)
Table formats	Hive tables	Hive + Iceberg	Delta Lake + Iceberg	Iceberg (native)
Access control	None built-in	Lake Formation	Built-in (column-level)	Pluggable
Data lineage	No	No	Yes (automatic)	No
Multi-engine	Everything speaks Hive	AWS services only	Databricks + Iceberg clients	Any Iceberg client
Cloud	Any / self-hosted	AWS only	AWS + Azure + GCP	Any / self-hosted
Ops overhead	High (self-manage)	Zero (managed)	Zero (managed)	Medium (self-manage)
Cost	Free (infra cost only)	Per-request pricing	Included with Databricks	Free (infra cost only)

When to Use What

Which Metastore Should You Choose?

What's your environment?

All-in on AWS?

AWS Glue + Lake FormationZero ops, deep integration

Using Databricks?

Unity CatalogBest governance, lineage

Multi-cloud / open?

Hive MS or PolarisNo vendor lock-in

The Modern Lakehouse Architecture

The modern data stack has converged on a pattern called the lakehouse. Instead of separate data lakes and data warehouses, you have one storage layer (S3/ADLS/GCS) with a table format (Delta Lake, Iceberg, Hudi) that provides warehouse-like features (ACID transactions, schema enforcement, time travel) on top of a data lake. The metastore is the central nervous system of this architecture.

Modern Lakehouse Architecture

Consumers (BI Tools, ML Pipelines, Applications)Tableau, Power BI, MLflow, dbt, custom apps — all query through the metastore

Query Engines (Spark, Trino, Athena, Redshift Spectrum)SQL and distributed compute — connect to the metastore for table definitions

Metastore (Glue / Unity Catalog / Hive / Polaris)The catalogue — knows every table, column, partition, and who can access what

Table Format (Delta Lake / Apache Iceberg / Apache Hudi)ACID transactions, schema evolution, time travel, partition pruning — on files

Object Storage (S3 / ADLS / GCS)The actual Parquet/ORC files live here. Cheap, durable, infinite scale.

Practical Tips from Production

Partition wisely: Over-partitioning (one file per partition) kills query performance. Aim for partitions with 100 MB+ of data each. Common strategy: partition by date (year/month/day), not by user_id.
Use Iceberg or Delta Lake, not raw Parquet: Raw Parquet tables in Hive have no ACID, no schema evolution, no time travel. Iceberg and Delta Lake solve all of this. Migration is a one-time cost with massive long-term benefits.
Automate schema discovery: Use Glue Crawlers, or write Spark jobs that register new partitions automatically. Don't manually run ALTER TABLE ADD PARTITION for every new data drop.
Name things properly from day one: production.sales.revenue is much better than db1.tbl_rev_v2_final_FINAL. You'll thank yourself in 6 months.
Start with governance: Don't wait until you have 500 tables to add access control. Set it up from day one. It's 10x harder to retrofit than to build in from the start.
Monitor metastore health: Slow metastore = slow queries everywhere. Monitor Thrift API latency, backend DB connections, and catalogue API rate limits.

Back to our story from the beginning. That analyst who couldn't find Q3 revenue data? With a metastore, the answer is one SQL query away: SELECT * FROM production.sales.revenue WHERE quarter=3. The metastore knows where the files are, what the columns are, and who's allowed to see them. No more hunting through S3 buckets. No more guessing column types. No more data swamp. Just a clean, governed, queryable lake.

Metastore Explained: Hive, AWS Glue, Databricks Unity Catalog, and Why You Need One

What is a Metastore?

The Evolution of Metastores

Apache Hive Metastore — The Original

How Hive Metastore Works

Hive Metastore: Pros & Cons

AWS Glue Data Catalog — Managed Hive Metastore

AWS Lake Formation — Governance on Top of Glue

Databricks Unity Catalog — The Governance-First Approach

Apache Polaris (Iceberg REST Catalog) — The Open Future

The Big Comparison

When to Use What

The Modern Lakehouse Architecture

Practical Tips from Production

Stuck on implementation?

Related Production Resources

Free learning tracks

Interactive engineering labs

Production cheatsheets

Key terms

Discussion

Discussion is unavailable

What is a Metastore?

The Evolution of Metastores

Apache Hive Metastore — The Original

How Hive Metastore Works

Hive Metastore: Pros & Cons

AWS Glue Data Catalog — Managed Hive Metastore

AWS Lake Formation — Governance on Top of Glue

Databricks Unity Catalog — The Governance-First Approach

Apache Polaris (Iceberg REST Catalog) — The Open Future

The Big Comparison

When to Use What

The Modern Lakehouse Architecture

Practical Tips from Production

Stuck on implementation?

Related Production Resources

Free learning tracks

Interactive engineering labs

Production cheatsheets

Key terms

Discussion

Discussion is unavailable

Continue Reading

Are DAGs Dying? The Rise of Declarative Data Pipelines

Distributed Systems Algorithms: Consensus, Replication, and Coordination at Production Scale