Skip to main content

Greenplum

Greenplum

Greenplum is a massively parallel processing (MPP) database built on PostgreSQL, designed for large-scale analytics and data warehouse workloads.

Overview

  • Versions: 7.1.0, 7.0.0, 6.30.1
  • Cluster Support: ✅ Yes (MPP architecture)
  • Use Cases: MPP data warehouse, large-scale analytics, business intelligence
  • Features: Distributed queries, petabyte-scale, PostgreSQL-compatible

Key Features

  • Massively Parallel Processing (MPP): Distribute queries across multiple nodes for fast analytics
  • PostgreSQL Compatible: Leverage PostgreSQL tools, drivers, and SQL syntax
  • Petabyte Scale: Handle massive datasets efficiently
  • Columnar Storage: Optimize for analytical queries with column-oriented storage
  • Advanced Analytics: Built-in support for machine learning and statistical functions
  • Distributed Architecture: Coordinator and segment nodes for parallel processing
  • Data Distribution: Hash, random, or replicated table distribution
  • Partitioning: Time-based and list partitioning for large tables
  • External Tables: Query data in S3, HDFS, and other external sources
  • Parallel Data Loading: High-speed data ingestion with gpfdist

Deployment Modes

Greenplum is designed for cluster deployments:

Standalone (Development)

  • Single coordinator and single segment
  • Suitable for development and testing
  • Limited scalability
  • Lower resource requirements

Cluster (Production MPP)

  • Recommended for production workloads
  • Multiple segment nodes for parallel processing
  • 3-10 segment nodes configurable
  • Coordinator node for query planning
  • High availability with mirror segments
  • Horizontal scaling capability

Resource Tiers

TierCPUMemoryDiskSegmentsBest For
Small0.51GB10GB1Development, testing
Medium12GB25GB2Small analytics workloads
Large24GB50GB4Production analytics
XLarge48GB100GB8Large-scale data warehouse

Creating a Greenplum Add-on

  1. Navigate to Add-onsCreate Add-on
  2. Select Greenplum as the type
  3. Choose a version (7.1.0, 7.0.0, or 6.30.1)
  4. Select deployment mode:
    • Standalone: For development/testing
    • Cluster: For production (coordinator + segment nodes)
  5. Configure:
    • Label: Descriptive name (e.g., "Analytics Warehouse")
    • Description: Purpose and notes
    • Environment: Development or Production
    • Resource Tier: Based on your workload requirements
    • Segment Count: Number of segment nodes (cluster mode)
  6. Configure backups:
    • Schedule: Daily recommended for production
    • Retention: 7+ days for production
  7. Click Create Add-on

Connection Information

After deployment, connection details are available in the add-on details page and automatically injected into your apps via STRONGLY_SERVICES.

Connection String Format

postgresql://username:password@coordinator-host:5432/database

Accessing Connection Details

import os
import json
import psycopg2

# Parse STRONGLY_SERVICES
services = json.loads(os.environ.get('STRONGLY_SERVICES', '{}'))

# Get Greenplum add-on connection
gp_addon = services['addons']['addon-id']

# Connect using connection string (Greenplum is PostgreSQL-compatible)
conn = psycopg2.connect(gp_addon['connectionString'])
cursor = conn.cursor()

# Or connect using individual parameters
conn = psycopg2.connect(
host=gp_addon['host'],
port=gp_addon['port'],
database=gp_addon['database'],
user=gp_addon['username'],
password=gp_addon['password']
)
cursor = conn.cursor()

# Execute query
cursor.execute("SELECT version()")
version = cursor.fetchone()
print(f"Greenplum version: {version[0]}")

cursor.close()
conn.close()

Core Concepts

Distributed Architecture

Greenplum uses a coordinator-segment architecture:

  • Coordinator (Master): Query planning, client connections, metadata
  • Segments: Store data and execute queries in parallel
  • Interconnect: High-speed network for data exchange between segments

Data Distribution

Tables can be distributed across segments using different strategies:

-- Hash distribution (default, best for most cases)
CREATE TABLE sales (
id SERIAL,
customer_id INT,
amount DECIMAL(10,2),
sale_date DATE
) DISTRIBUTED BY (customer_id);

-- Random distribution (good for small lookup tables)
CREATE TABLE countries (
code CHAR(2),
name VARCHAR(100)
) DISTRIBUTED RANDOMLY;

-- Replicated distribution (copy to all segments)
CREATE TABLE small_lookup (
id INT,
value VARCHAR(100)
) DISTRIBUTED REPLICATED;

Partitioning

Partition large tables for better query performance:

-- Range partitioning by date
CREATE TABLE sales (
id SERIAL,
customer_id INT,
amount DECIMAL(10,2),
sale_date DATE
)
DISTRIBUTED BY (customer_id)
PARTITION BY RANGE (sale_date)
(
PARTITION sales_2023 START ('2023-01-01') END ('2024-01-01'),
PARTITION sales_2024 START ('2024-01-01') END ('2025-01-01'),
DEFAULT PARTITION sales_other
);

-- Add new partition
ALTER TABLE sales ADD PARTITION sales_2025
START ('2025-01-01') END ('2026-01-01');

-- Drop old partition
ALTER TABLE sales DROP PARTITION sales_2023;

Common Operations

Creating Tables

-- Fact table with hash distribution and partitioning
CREATE TABLE fact_orders (
order_id BIGINT,
customer_id INT,
product_id INT,
quantity INT,
amount DECIMAL(10,2),
order_date DATE,
region VARCHAR(50)
)
DISTRIBUTED BY (customer_id)
PARTITION BY RANGE (order_date)
(
START ('2024-01-01') END ('2025-01-01')
EVERY (INTERVAL '1 month')
);

-- Dimension table with replication
CREATE TABLE dim_products (
product_id INT PRIMARY KEY,
product_name VARCHAR(200),
category VARCHAR(100),
price DECIMAL(10,2)
) DISTRIBUTED REPLICATED;

-- Columnar storage for analytics
CREATE TABLE analytics_events (
event_id BIGINT,
user_id INT,
event_type VARCHAR(50),
event_data JSON,
timestamp TIMESTAMP
)
WITH (appendonly=true, orientation=column, compresstype=zstd)
DISTRIBUTED BY (user_id)
PARTITION BY RANGE (timestamp)
(
START ('2024-01-01') END ('2025-01-01')
EVERY (INTERVAL '1 day')
);

Loading Data

COPY Command:

-- Load from CSV file
COPY sales FROM '/path/to/sales.csv' CSV HEADER;

-- Load with specific columns
COPY sales (customer_id, amount, sale_date)
FROM '/path/to/data.csv' CSV;

gpfdist (High-speed parallel loading):

-- Create external table pointing to gpfdist
CREATE EXTERNAL TABLE ext_sales (
customer_id INT,
amount DECIMAL(10,2),
sale_date DATE
)
LOCATION ('gpfdist://filehost:8080/sales*.csv')
FORMAT 'CSV' (HEADER);

-- Load data in parallel
INSERT INTO sales SELECT * FROM ext_sales;

External Tables (S3, HDFS):

-- S3 external table (requires PXF)
CREATE EXTERNAL TABLE s3_data (
id INT,
name VARCHAR(100),
value DECIMAL(10,2)
)
LOCATION ('pxf://my-bucket/data/*.csv?PROFILE=s3:text')
FORMAT 'CSV';

-- Query external data
SELECT * FROM s3_data WHERE value > 100;

Analytical Queries

-- Aggregation across segments
SELECT
region,
COUNT(*) as order_count,
SUM(amount) as total_revenue,
AVG(amount) as avg_order_value
FROM fact_orders
WHERE order_date >= '2024-01-01'
GROUP BY region
ORDER BY total_revenue DESC;

-- Window functions
SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as running_total
FROM fact_orders
ORDER BY customer_id, order_date;

-- Join optimization
SELECT
o.order_id,
p.product_name,
o.quantity,
o.amount
FROM fact_orders o
INNER JOIN dim_products p ON o.product_id = p.product_id
WHERE o.order_date >= '2024-01-01';

Vacuum and Analyze

-- Analyze table for query optimization
ANALYZE fact_orders;

-- Vacuum to reclaim space
VACUUM fact_orders;

-- Full vacuum and analyze
VACUUM FULL ANALYZE fact_orders;

Advanced Analytics

Window Functions

-- Ranking
SELECT
product_id,
product_name,
revenue,
RANK() OVER (ORDER BY revenue DESC) as rank,
DENSE_RANK() OVER (ORDER BY revenue DESC) as dense_rank,
ROW_NUMBER() OVER (ORDER BY revenue DESC) as row_num
FROM product_revenue;

-- Moving average
SELECT
sale_date,
amount,
AVG(amount) OVER (
ORDER BY sale_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as moving_avg_7day
FROM daily_sales;

Common Table Expressions (CTEs)

-- Recursive CTE
WITH RECURSIVE subordinates AS (
SELECT employee_id, name, manager_id, 1 as level
FROM employees
WHERE manager_id IS NULL

UNION ALL

SELECT e.employee_id, e.name, e.manager_id, s.level + 1
FROM employees e
INNER JOIN subordinates s ON e.manager_id = s.employee_id
)
SELECT * FROM subordinates ORDER BY level, name;

-- Multiple CTEs
WITH
monthly_sales AS (
SELECT
DATE_TRUNC('month', sale_date) as month,
SUM(amount) as total
FROM sales
GROUP BY 1
),
growth AS (
SELECT
month,
total,
LAG(total) OVER (ORDER BY month) as prev_month,
(total - LAG(total) OVER (ORDER BY month)) / LAG(total) OVER (ORDER BY month) * 100 as growth_pct
FROM monthly_sales
)
SELECT * FROM growth WHERE growth_pct > 10;

MADlib (Machine Learning)

Greenplum includes MADlib for in-database machine learning:

-- Linear regression
SELECT madlib.linregr_train(
'sales', -- source table
'sales_model', -- output model table
'amount', -- dependent variable
'ARRAY[1, quantity, price]' -- independent variables
);

-- Predict using model
SELECT madlib.linregr_predict(
ARRAY[1, 10, 50.00],
'sales_model'
);

-- K-means clustering
SELECT madlib.kmeans_random(
'customers', -- source table
'customer_clusters', -- output table
'features', -- column with features array
5 -- number of clusters
);

Performance Optimization

Distribution Keys

Choose distribution keys carefully:

-- Good: Distribute by join key
CREATE TABLE orders (
order_id BIGINT,
customer_id INT,
...
) DISTRIBUTED BY (customer_id);

CREATE TABLE customers (
customer_id INT,
...
) DISTRIBUTED BY (customer_id);

-- Now joins are co-located (no data movement)
SELECT * FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;

Columnar Storage

Use columnar storage for analytical workloads:

-- Append-optimized columnar table
CREATE TABLE analytics_data (
date DATE,
dimension1 VARCHAR(100),
dimension2 VARCHAR(100),
metric1 DECIMAL(10,2),
metric2 DECIMAL(10,2)
)
WITH (
appendonly=true,
orientation=column,
compresstype=zstd,
compresslevel=5
)
DISTRIBUTED BY (date);

Indexes

-- B-tree index (use sparingly in Greenplum)
CREATE INDEX idx_orders_date ON orders(order_date);

-- Bitmap index (better for low-cardinality columns)
CREATE INDEX idx_orders_status ON orders USING bitmap(status);

-- Partial index
CREATE INDEX idx_active_orders ON orders(order_date)
WHERE status = 'active';

Query Optimization

-- Use EXPLAIN to analyze query plan
EXPLAIN SELECT * FROM orders WHERE order_date >= '2024-01-01';

-- EXPLAIN ANALYZE to see actual execution
EXPLAIN ANALYZE
SELECT
customer_id,
SUM(amount) as total
FROM orders
GROUP BY customer_id;

-- Check data skew
SELECT gp_segment_id, COUNT(*)
FROM orders
GROUP BY gp_segment_id;

Backup & Restore

Greenplum add-ons use pg_dump for backups, with MPP-aware backup capabilities.

Backup Configuration

  • Tool: pg_dump (MPP-aware)
  • Format: .sql
  • Type: Logical backup with parallel dump
  • Storage: AWS S3 (s3://strongly-backups/backups/<addon-id>/)

Manual Backup

  1. Go to add-on details page
  2. Click Backup Now
  3. Monitor progress in job logs
  4. Backup saved as backup-YYYYMMDDHHMMSS.sql

Scheduled Backups

Configure during add-on creation or in settings:

  • Daily backups: Recommended for production
  • Retention: 7-14 days minimum for production
  • Custom cron: For specific schedules

Restore Process

  1. Navigate to Backups tab
  2. Select backup from list
  3. Click Restore
  4. Confirm (add-on will stop temporarily)
  5. Data restored across all segments
  6. Add-on automatically restarts
Data Loss

Restoring from backup replaces ALL current data. Create a current backup first if needed.

Monitoring

Monitor your Greenplum add-on through the Strongly platform:

  • CPU Usage: Track CPU utilization across segments
  • Memory Usage: Monitor memory consumption
  • Disk Space: Watch disk utilization per segment
  • Query Performance: Slow query detection
  • Data Skew: Check for uneven data distribution
  • Segment Health: Status of coordinator and segment nodes

System Views

-- Check segment configuration
SELECT * FROM gp_segment_configuration;

-- Check database size
SELECT pg_size_pretty(pg_database_size(current_database()));

-- Check table sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema', 'gp_toolkit')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

-- Check data skew
SELECT gp_segment_id, COUNT(*)
FROM your_table
GROUP BY gp_segment_id
ORDER BY gp_segment_id;

-- Active queries
SELECT
pid,
usename,
datname,
query_start,
now() - query_start AS duration,
query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;

Best Practices

  1. Choose Distribution Keys Wisely: Use join columns and high-cardinality columns
  2. Use Partitioning: Partition large tables by date or other logical divisions
  3. Columnar for Analytics: Use columnar storage for analytical tables
  4. Replicate Small Tables: Replicate dimension tables to avoid data movement
  5. Regular ANALYZE: Keep statistics up-to-date for query optimization
  6. Monitor Data Skew: Ensure even data distribution across segments
  7. Batch Operations: Load data in bulk using gpfdist or COPY
  8. Vacuum Regularly: Reclaim space and maintain performance
  9. Use External Tables: Query data in S3/HDFS without loading
  10. Optimize Queries: Use EXPLAIN to understand query plans
  11. Backup Production: Enable daily backups for production databases
  12. Test Restores: Verify backup restoration procedures

Migration from PostgreSQL

Greenplum is PostgreSQL-compatible, but consider these differences:

-- PostgreSQL: Single server
-- Greenplum: Distributed, specify distribution

-- PostgreSQL
CREATE TABLE users (id INT, name VARCHAR(100));

-- Greenplum (specify distribution)
CREATE TABLE users (id INT, name VARCHAR(100))
DISTRIBUTED BY (id);

-- PostgreSQL: Any indexes work well
-- Greenplum: Prefer bitmap indexes for low-cardinality

-- PostgreSQL
CREATE INDEX idx_status ON orders(status);

-- Greenplum
CREATE INDEX idx_status ON orders USING bitmap(status);

-- Update/Delete operations
-- PostgreSQL: Row-level operations efficient
-- Greenplum: Better to use INSERT SELECT and truncate/reload

Troubleshooting

Connection Issues

# Test connection
psql "postgresql://username:password@host:5432/database"

# Check connection limits
SELECT * FROM pg_settings WHERE name = 'max_connections';

Performance Issues

-- Find slow queries
SELECT
pid,
usename,
now() - query_start AS duration,
query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5 minutes'
ORDER BY duration DESC;

-- Check for data skew
SELECT gp_segment_id, COUNT(*) as row_count
FROM large_table
GROUP BY gp_segment_id
ORDER BY row_count DESC;

-- Analyze table statistics
ANALYZE VERBOSE your_table;

Segment Issues

-- Check segment status
SELECT * FROM gp_segment_configuration
WHERE status <> 'u' OR mode <> 's';

-- Check for mirroring issues
SELECT * FROM gp_segment_configuration
WHERE preferred_role <> role;

Use Cases

Data Warehouse

-- Star schema with fact and dimension tables
CREATE TABLE fact_sales (
sale_id BIGINT,
date_id INT,
product_id INT,
customer_id INT,
store_id INT,
quantity INT,
amount DECIMAL(10,2)
)
DISTRIBUTED BY (date_id)
PARTITION BY RANGE (date_id)
(START (20240101) END (20250101) EVERY (100));

CREATE TABLE dim_date (
date_id INT PRIMARY KEY,
date DATE,
year INT,
quarter INT,
month INT,
day INT,
day_of_week INT
) DISTRIBUTED REPLICATED;

-- Complex analytical query
SELECT
d.year,
d.quarter,
SUM(f.amount) as revenue,
COUNT(DISTINCT f.customer_id) as unique_customers
FROM fact_sales f
INNER JOIN dim_date d ON f.date_id = d.date_id
GROUP BY d.year, d.quarter
ORDER BY d.year, d.quarter;

Real-time Analytics

-- Append-optimized table for continuous ingestion
CREATE TABLE event_stream (
event_id BIGINT,
user_id INT,
event_type VARCHAR(50),
timestamp TIMESTAMP,
data JSON
)
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (user_id)
PARTITION BY RANGE (timestamp)
(
START ('2024-01-01') END ('2025-01-01')
EVERY (INTERVAL '1 day')
);

-- Real-time aggregation
SELECT
event_type,
COUNT(*) as event_count,
COUNT(DISTINCT user_id) as unique_users
FROM event_stream
WHERE timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY event_type;

Support

For issues or questions:

  • Check add-on logs in the Strongly dashboard
  • Review Greenplum official documentation
  • Contact Strongly support through the platform