Let me tell you a story. You're a data engineer at a growing startup. Your team dumps CSV files into S3 — sales reports, user events, transaction logs. Fifty files. Then five hundred. Then five thousand. One morning, a data analyst asks you: "Where's the Q3 revenue data?" And you realise... you have no idea. Is it in s3://data/reports/revenue_q3.csv or s3://analytics/sales/2025-q3/ or s3://data-lake-prod/revenue/year=2025/quarter=3/?
Nobody knows which files have which columns. Nobody knows what data type "revenue" is — is it a float? An integer? A string with a dollar sign? Nobody knows when the data was last updated, who created it, or whether it's even valid anymore.
You have a data swamp, not a data lake.
This is the problem a metastore solves.
What is a Metastore?
A metastore is a catalogue of metadata — it stores information about your data, not the data itself. Think of it as the card catalogue in a library. The books (data files) are on the shelves (S3, ADLS, GCS). The catalogue tells you: which shelf, what topic, how many pages, when it was published, and who wrote it.
Without a metastore, every query engine (Spark, Presto, Trino, Athena) would need to scan every file to figure out what's inside. With a metastore, they just ask the catalogue: "What are the columns in the sales.revenue table? Where are the files?" and go straight to reading data. It's the difference between opening every book in the library versus checking the catalogue first.
The Evolution of Metastores
Apache Hive Metastore — The Original
The Hive Metastore (HMS) was built by Facebook in 2010 as part of Apache Hive. It became the de facto standard that every data tool speaks. Even today, Spark, Presto, Trino, Flink, and dozens of other engines can talk to a Hive Metastore.
How Hive Metastore Works
-- Hive DDL: Creating a table in the metastore
CREATE EXTERNAL TABLE sales.revenue (
transaction_id BIGINT,
customer_id BIGINT,
amount DECIMAL(10,2),
currency STRING,
payment_method STRING,
created_at TIMESTAMP
)
PARTITIONED BY (year INT, quarter INT)
STORED AS PARQUET
LOCATION 's3://data-lake-prod/sales/revenue/';
-- Add a partition (tells metastore where to find the data)
ALTER TABLE sales.revenue ADD PARTITION (year=2025, quarter=3)
LOCATION 's3://data-lake-prod/sales/revenue/year=2025/quarter=3/';
-- Now ANY engine connected to this metastore can query:
SELECT SUM(amount) FROM sales.revenue WHERE year=2025 AND quarter=3;
-- Spark, Presto, Trino, Athena — they all understand this schema
# Running Hive Metastore standalone (no Hive needed!)
# Most teams run HMS as a standalone service
# Docker Compose for Hive Metastore + PostgreSQL backend
version: '3.8'
services:
metastore-db:
image: postgres:15
environment:
POSTGRES_DB: metastore
POSTGRES_USER: hive
POSTGRES_PASSWORD: hive123
volumes:
- metastore-data:/var/lib/postgresql/data
metastore:
image: apache/hive:4.0.0
depends_on:
- metastore-db
environment:
SERVICE_NAME: metastore
DB_DRIVER: postgres
SERVICE_OPTS: >-
-Djavax.jdo.option.ConnectionURL=jdbc:postgresql://metastore-db:5432/metastore
-Djavax.jdo.option.ConnectionDriverName=org.postgresql.Driver
-Djavax.jdo.option.ConnectionUserName=hive
-Djavax.jdo.option.ConnectionPassword=hive123
ports:
- "9083:9083" # Thrift API port
volumes:
metastore-data:
Hive Metastore: Pros & Cons
AWS Glue Data Catalog — Managed Hive Metastore
AWS Glue Data Catalog is essentially a managed Hive Metastore hosted by AWS. You don't run any servers, don't manage a database, don't worry about backups. AWS handles everything. And it's deeply integrated with the AWS ecosystem — Athena, Redshift Spectrum, EMR, Lake Formation, and Glue ETL all share the same catalogue.
# AWS Glue: Creating a table via AWS CLI
aws glue create-table \
--database-name sales \
--table-input '{
"Name": "revenue",
"StorageDescriptor": {
"Columns": [
{"Name": "transaction_id", "Type": "bigint"},
{"Name": "customer_id", "Type": "bigint"},
{"Name": "amount", "Type": "decimal(10,2)"},
{"Name": "currency", "Type": "string"},
{"Name": "created_at", "Type": "timestamp"}
],
"Location": "s3://data-lake-prod/sales/revenue/",
"InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
}
},
"PartitionKeys": [
{"Name": "year", "Type": "int"},
{"Name": "quarter", "Type": "int"}
],
"TableType": "EXTERNAL_TABLE"
}'
# Now query via Athena (SQL over S3 — serverless):
# SELECT SUM(amount) FROM sales.revenue WHERE year=2025;
# No servers to manage. Pay per query.
# Glue Crawlers: Auto-discover schemas
# Instead of manually defining tables, let Glue scan your S3 bucket:
aws glue create-crawler \
--name sales-crawler \
--role GlueServiceRole \
--database-name sales \
--targets '{
"S3Targets": [
{"Path": "s3://data-lake-prod/sales/"}
]
}'
# Run the crawler
aws glue start-crawler --name sales-crawler
# Glue scans the Parquet files, infers the schema, and creates
# tables + partitions in the Data Catalog automatically!
# Terraform for Glue Catalog
resource "aws_glue_catalog_database" "sales" {
name = "sales"
}
resource "aws_glue_catalog_table" "revenue" {
name = "revenue"
database_name = aws_glue_catalog_database.sales.name
table_type = "EXTERNAL_TABLE"
parameters = {
classification = "parquet"
}
storage_descriptor {
location = "s3://data-lake-prod/sales/revenue/"
input_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"
ser_de_info {
serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
}
columns {
name = "transaction_id"
type = "bigint"
}
columns {
name = "amount"
type = "decimal(10,2)"
}
}
partition_keys {
name = "year"
type = "int"
}
}
AWS Lake Formation — Governance on Top of Glue
Glue Data Catalog tells you what data exists. Lake Formation adds who can access it. It provides fine-grained access control: table-level, column-level, even row-level filtering. This is how enterprises manage data lakes at scale.
# Lake Formation: Grant access to a table
aws lakeformation grant-permissions \
--principal '{"DataLakePrincipalIdentifier": "arn:aws:iam::123456789:role/analyst-role"}' \
--resource '{"Table": {"DatabaseName": "sales", "Name": "revenue"}}' \
--permissions '["SELECT"]' \
--permissions-with-grant-option '[]'
# Column-level access:
aws lakeformation grant-permissions \
--principal '{"DataLakePrincipalIdentifier": "arn:aws:iam::123456789:role/analyst-role"}' \
--resource '{"TableWithColumns": {"DatabaseName": "sales", "Name": "customers", "ColumnNames": ["name", "city"], "ColumnWildcard": null}}' \
--permissions '["SELECT"]'
# Analysts can see name and city, but NOT email, phone, or SSN
Databricks Unity Catalog — The Governance-First Approach
Databricks Unity Catalog takes a fundamentally different approach. Instead of bolting governance onto an existing catalogue (like Lake Formation does with Glue), Unity Catalog was built from the ground up for governance. It provides a three-level namespace, centralised access control, data lineage, and cross-workspace sharing.
-- Unity Catalog: Creating objects
-- Three-level namespace: catalog.schema.table
-- Create a catalog (top-level container)
CREATE CATALOG production;
-- Create a schema within the catalog
CREATE SCHEMA production.sales;
-- Create a managed table (Unity Catalog manages the storage)
CREATE TABLE production.sales.revenue (
transaction_id BIGINT,
customer_id BIGINT,
amount DECIMAL(10,2),
currency STRING,
created_at TIMESTAMP
)
USING DELTA; -- Delta Lake format (ACID transactions!)
-- Or create an external table (you manage storage)
CREATE TABLE production.sales.legacy_revenue
USING PARQUET
LOCATION 's3://data-lake-prod/sales/revenue/';
-- Grant access
GRANT SELECT ON TABLE production.sales.revenue TO analyst_group;
GRANT USAGE ON SCHEMA production.sales TO analyst_group;
GRANT USAGE ON CATALOG production TO analyst_group;
-- Column-level masking
ALTER TABLE production.sales.customers
ALTER COLUMN email SET MASK mask_email;
-- Analysts see: j***@example.com instead of john@example.com
Apache Polaris (Iceberg REST Catalog) — The Open Future
Apache Polaris (originally Snowflake's Polaris Catalog, donated to Apache in 2024) is the newest entrant. It's built specifically for Apache Iceberg tables and provides a vendor-neutral REST API. If you believe the future is open table formats (Iceberg), Polaris is worth watching closely.
# Polaris uses a REST API instead of Thrift (like Hive)
# This makes it language-agnostic and cloud-agnostic
# Create a catalog via REST API
curl -X POST https://polaris.example.com/api/management/v1/catalogs \
-H "Content-Type: application/json" \
-d '{
"catalog": {
"name": "production",
"type": "INTERNAL",
"properties": {
"default-base-location": "s3://data-lake/production/"
},
"storageConfigInfo": {
"storageType": "S3",
"allowedLocations": ["s3://data-lake/production/"]
}
}
}'
# Spark connects to Polaris using Iceberg REST catalog:
spark.conf.set("spark.sql.catalog.polaris", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.polaris.type", "rest")
spark.conf.set("spark.sql.catalog.polaris.uri", "https://polaris.example.com/api/catalog")
spark.conf.set("spark.sql.catalog.polaris.warehouse", "production")
# Then query normally:
spark.sql("SELECT * FROM polaris.sales.revenue WHERE year = 2025")
The Big Comparison
| Feature | Hive Metastore | AWS Glue | Unity Catalog | Polaris |
|---|---|---|---|---|
| Type | Open source | Managed (AWS) | Managed (Databricks) | Open source |
| Protocol | Thrift | Hive-compatible + AWS API | REST + Hive-compatible | REST (Iceberg spec) |
| Table formats | Hive tables | Hive + Iceberg | Delta Lake + Iceberg | Iceberg (native) |
| Access control | None built-in | Lake Formation | Built-in (column-level) | Pluggable |
| Data lineage | No | No | Yes (automatic) | No |
| Multi-engine | Everything speaks Hive | AWS services only | Databricks + Iceberg clients | Any Iceberg client |
| Cloud | Any / self-hosted | AWS only | AWS + Azure + GCP | Any / self-hosted |
| Ops overhead | High (self-manage) | Zero (managed) | Zero (managed) | Medium (self-manage) |
| Cost | Free (infra cost only) | Per-request pricing | Included with Databricks | Free (infra cost only) |
When to Use What
The Modern Lakehouse Architecture
The modern data stack has converged on a pattern called the lakehouse. Instead of separate data lakes and data warehouses, you have one storage layer (S3/ADLS/GCS) with a table format (Delta Lake, Iceberg, Hudi) that provides warehouse-like features (ACID transactions, schema enforcement, time travel) on top of a data lake. The metastore is the central nervous system of this architecture.
Practical Tips from Production
- Partition wisely: Over-partitioning (one file per partition) kills query performance. Aim for partitions with 100 MB+ of data each. Common strategy: partition by date (year/month/day), not by user_id.
- Use Iceberg or Delta Lake, not raw Parquet: Raw Parquet tables in Hive have no ACID, no schema evolution, no time travel. Iceberg and Delta Lake solve all of this. Migration is a one-time cost with massive long-term benefits.
- Automate schema discovery: Use Glue Crawlers, or write Spark jobs that register new partitions automatically. Don't manually run ALTER TABLE ADD PARTITION for every new data drop.
- Name things properly from day one:
production.sales.revenueis much better thandb1.tbl_rev_v2_final_FINAL. You'll thank yourself in 6 months. - Start with governance: Don't wait until you have 500 tables to add access control. Set it up from day one. It's 10x harder to retrofit than to build in from the start.
- Monitor metastore health: Slow metastore = slow queries everywhere. Monitor Thrift API latency, backend DB connections, and catalogue API rate limits.
Back to our story from the beginning. That analyst who couldn't find Q3 revenue data? With a metastore, the answer is one SQL query away: SELECT * FROM production.sales.revenue WHERE quarter=3. The metastore knows where the files are, what the columns are, and who's allowed to see them. No more hunting through S3 buckets. No more guessing column types. No more data swamp. Just a clean, governed, queryable lake.