Greenplum
Greenplum is a massively parallel processing (MPP) database built on PostgreSQL, designed for large-scale analytics and data warehouse workloads.
Overview
- Versions: 7.1.0, 7.0.0, 6.30.1
- Cluster Support: ✅ Yes (MPP architecture)
- Use Cases: MPP data warehouse, large-scale analytics, business intelligence
- Features: Distributed queries, petabyte-scale, PostgreSQL-compatible
Key Features
- Massively Parallel Processing (MPP): Distribute queries across multiple nodes for fast analytics
- PostgreSQL Compatible: Leverage PostgreSQL tools, drivers, and SQL syntax
- Petabyte Scale: Handle massive datasets efficiently
- Columnar Storage: Optimize for analytical queries with column-oriented storage
- Advanced Analytics: Built-in support for machine learning and statistical functions
- Distributed Architecture: Coordinator and segment nodes for parallel processing
- Data Distribution: Hash, random, or replicated table distribution
- Partitioning: Time-based and list partitioning for large tables
- External Tables: Query data in S3, HDFS, and other external sources
- Parallel Data Loading: High-speed data ingestion with gpfdist
Deployment Modes
Greenplum is designed for cluster deployments:
Standalone (Development)
- Single coordinator and single segment
- Suitable for development and testing
- Limited scalability
- Lower resource requirements
Cluster (Production MPP)
- Recommended for production workloads
- Multiple segment nodes for parallel processing
- 3-10 segment nodes configurable
- Coordinator node for query planning
- High availability with mirror segments
- Horizontal scaling capability
Resource Tiers
| Tier | CPU | Memory | Disk | Segments | Best For |
|---|---|---|---|---|---|
| Small | 0.5 | 1GB | 10GB | 1 | Development, testing |
| Medium | 1 | 2GB | 25GB | 2 | Small analytics workloads |
| Large | 2 | 4GB | 50GB | 4 | Production analytics |
| XLarge | 4 | 8GB | 100GB | 8 | Large-scale data warehouse |
Creating a Greenplum Add-on
- Navigate to Add-ons → Create Add-on
- Select Greenplum as the type
- Choose a version (7.1.0, 7.0.0, or 6.30.1)
- Select deployment mode:
- Standalone: For development/testing
- Cluster: For production (coordinator + segment nodes)
- Configure:
- Label: Descriptive name (e.g., "Analytics Warehouse")
- Description: Purpose and notes
- Environment: Development or Production
- Resource Tier: Based on your workload requirements
- Segment Count: Number of segment nodes (cluster mode)
- Configure backups:
- Schedule: Daily recommended for production
- Retention: 7+ days for production
- Click Create Add-on
Connection Information
After deployment, connection details are available in the add-on details page and automatically injected into your apps via STRONGLY_SERVICES.
Connection String Format
postgresql://username:password@coordinator-host:5432/database
Accessing Connection Details
- Python
- Node.js
- Go
import os
import json
import psycopg2
# Parse STRONGLY_SERVICES
services = json.loads(os.environ.get('STRONGLY_SERVICES', '{}'))
# Get Greenplum add-on connection
gp_addon = services['addons']['addon-id']
# Connect using connection string (Greenplum is PostgreSQL-compatible)
conn = psycopg2.connect(gp_addon['connectionString'])
cursor = conn.cursor()
# Or connect using individual parameters
conn = psycopg2.connect(
host=gp_addon['host'],
port=gp_addon['port'],
database=gp_addon['database'],
user=gp_addon['username'],
password=gp_addon['password']
)
cursor = conn.cursor()
# Execute query
cursor.execute("SELECT version()")
version = cursor.fetchone()
print(f"Greenplum version: {version[0]}")
cursor.close()
conn.close()
const { Pool } = require('pg');
// Parse STRONGLY_SERVICES
const services = JSON.parse(process.env.STRONGLY_SERVICES || '{}');
const gpAddon = services.addons['addon-id'];
// Connect using connection string
const pool = new Pool({
connectionString: gpAddon.connectionString
});
// Or connect using individual parameters
const pool = new Pool({
host: gpAddon.host,
port: gpAddon.port,
database: gpAddon.database,
user: gpAddon.username,
password: gpAddon.password
});
// Execute query
const result = await pool.query('SELECT version()');
console.log('Greenplum version:', result.rows[0].version);
await pool.end();
package main
import (
"database/sql"
"encoding/json"
"fmt"
"os"
_ "github.com/lib/pq"
)
type Services struct {
Addons map[string]Addon `json:"addons"`
}
type Addon struct {
ConnectionString string `json:"connectionString"`
Host string `json:"host"`
Port int `json:"port"`
Database string `json:"database"`
Username string `json:"username"`
Password string `json:"password"`
}
func main() {
var services Services
json.Unmarshal([]byte(os.Getenv("STRONGLY_SERVICES")), &services)
gpAddon := services.Addons["addon-id"]
// Connect using connection string
db, err := sql.Open("postgres", gpAddon.ConnectionString)
if err != nil {
panic(err)
}
defer db.Close()
// Execute query
var version string
err = db.QueryRow("SELECT version()").Scan(&version)
if err != nil {
panic(err)
}
fmt.Println("Greenplum version:", version)
}
Core Concepts
Distributed Architecture
Greenplum uses a coordinator-segment architecture:
- Coordinator (Master): Query planning, client connections, metadata
- Segments: Store data and execute queries in parallel
- Interconnect: High-speed network for data exchange between segments
Data Distribution
Tables can be distributed across segments using different strategies:
-- Hash distribution (default, best for most cases)
CREATE TABLE sales (
id SERIAL,
customer_id INT,
amount DECIMAL(10,2),
sale_date DATE
) DISTRIBUTED BY (customer_id);
-- Random distribution (good for small lookup tables)
CREATE TABLE countries (
code CHAR(2),
name VARCHAR(100)
) DISTRIBUTED RANDOMLY;
-- Replicated distribution (copy to all segments)
CREATE TABLE small_lookup (
id INT,
value VARCHAR(100)
) DISTRIBUTED REPLICATED;
Partitioning
Partition large tables for better query performance:
-- Range partitioning by date
CREATE TABLE sales (
id SERIAL,
customer_id INT,
amount DECIMAL(10,2),
sale_date DATE
)
DISTRIBUTED BY (customer_id)
PARTITION BY RANGE (sale_date)
(
PARTITION sales_2023 START ('2023-01-01') END ('2024-01-01'),
PARTITION sales_2024 START ('2024-01-01') END ('2025-01-01'),
DEFAULT PARTITION sales_other
);
-- Add new partition
ALTER TABLE sales ADD PARTITION sales_2025
START ('2025-01-01') END ('2026-01-01');
-- Drop old partition
ALTER TABLE sales DROP PARTITION sales_2023;
Common Operations
Creating Tables
-- Fact table with hash distribution and partitioning
CREATE TABLE fact_orders (
order_id BIGINT,
customer_id INT,
product_id INT,
quantity INT,
amount DECIMAL(10,2),
order_date DATE,
region VARCHAR(50)
)
DISTRIBUTED BY (customer_id)
PARTITION BY RANGE (order_date)
(
START ('2024-01-01') END ('2025-01-01')
EVERY (INTERVAL '1 month')
);
-- Dimension table with replication
CREATE TABLE dim_products (
product_id INT PRIMARY KEY,
product_name VARCHAR(200),
category VARCHAR(100),
price DECIMAL(10,2)
) DISTRIBUTED REPLICATED;
-- Columnar storage for analytics
CREATE TABLE analytics_events (
event_id BIGINT,
user_id INT,
event_type VARCHAR(50),
event_data JSON,
timestamp TIMESTAMP
)
WITH (appendonly=true, orientation=column, compresstype=zstd)
DISTRIBUTED BY (user_id)
PARTITION BY RANGE (timestamp)
(
START ('2024-01-01') END ('2025-01-01')
EVERY (INTERVAL '1 day')
);
Loading Data
COPY Command:
-- Load from CSV file
COPY sales FROM '/path/to/sales.csv' CSV HEADER;
-- Load with specific columns
COPY sales (customer_id, amount, sale_date)
FROM '/path/to/data.csv' CSV;
gpfdist (High-speed parallel loading):
-- Create external table pointing to gpfdist
CREATE EXTERNAL TABLE ext_sales (
customer_id INT,
amount DECIMAL(10,2),
sale_date DATE
)
LOCATION ('gpfdist://filehost:8080/sales*.csv')
FORMAT 'CSV' (HEADER);
-- Load data in parallel
INSERT INTO sales SELECT * FROM ext_sales;
External Tables (S3, HDFS):
-- S3 external table (requires PXF)
CREATE EXTERNAL TABLE s3_data (
id INT,
name VARCHAR(100),
value DECIMAL(10,2)
)
LOCATION ('pxf://my-bucket/data/*.csv?PROFILE=s3:text')
FORMAT 'CSV';
-- Query external data
SELECT * FROM s3_data WHERE value > 100;
Analytical Queries
-- Aggregation across segments
SELECT
region,
COUNT(*) as order_count,
SUM(amount) as total_revenue,
AVG(amount) as avg_order_value
FROM fact_orders
WHERE order_date >= '2024-01-01'
GROUP BY region
ORDER BY total_revenue DESC;
-- Window functions
SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as running_total
FROM fact_orders
ORDER BY customer_id, order_date;
-- Join optimization
SELECT
o.order_id,
p.product_name,
o.quantity,
o.amount
FROM fact_orders o
INNER JOIN dim_products p ON o.product_id = p.product_id
WHERE o.order_date >= '2024-01-01';
Vacuum and Analyze
-- Analyze table for query optimization
ANALYZE fact_orders;
-- Vacuum to reclaim space
VACUUM fact_orders;
-- Full vacuum and analyze
VACUUM FULL ANALYZE fact_orders;
Advanced Analytics
Window Functions
-- Ranking
SELECT
product_id,
product_name,
revenue,
RANK() OVER (ORDER BY revenue DESC) as rank,
DENSE_RANK() OVER (ORDER BY revenue DESC) as dense_rank,
ROW_NUMBER() OVER (ORDER BY revenue DESC) as row_num
FROM product_revenue;
-- Moving average
SELECT
sale_date,
amount,
AVG(amount) OVER (
ORDER BY sale_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as moving_avg_7day
FROM daily_sales;
Common Table Expressions (CTEs)
-- Recursive CTE
WITH RECURSIVE subordinates AS (
SELECT employee_id, name, manager_id, 1 as level
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.name, e.manager_id, s.level + 1
FROM employees e
INNER JOIN subordinates s ON e.manager_id = s.employee_id
)
SELECT * FROM subordinates ORDER BY level, name;
-- Multiple CTEs
WITH
monthly_sales AS (
SELECT
DATE_TRUNC('month', sale_date) as month,
SUM(amount) as total
FROM sales
GROUP BY 1
),
growth AS (
SELECT
month,
total,
LAG(total) OVER (ORDER BY month) as prev_month,
(total - LAG(total) OVER (ORDER BY month)) / LAG(total) OVER (ORDER BY month) * 100 as growth_pct
FROM monthly_sales
)
SELECT * FROM growth WHERE growth_pct > 10;
MADlib (Machine Learning)
Greenplum includes MADlib for in-database machine learning:
-- Linear regression
SELECT madlib.linregr_train(
'sales', -- source table
'sales_model', -- output model table
'amount', -- dependent variable
'ARRAY[1, quantity, price]' -- independent variables
);
-- Predict using model
SELECT madlib.linregr_predict(
ARRAY[1, 10, 50.00],
'sales_model'
);
-- K-means clustering
SELECT madlib.kmeans_random(
'customers', -- source table
'customer_clusters', -- output table
'features', -- column with features array
5 -- number of clusters
);
Performance Optimization
Distribution Keys
Choose distribution keys carefully:
-- Good: Distribute by join key
CREATE TABLE orders (
order_id BIGINT,
customer_id INT,
...
) DISTRIBUTED BY (customer_id);
CREATE TABLE customers (
customer_id INT,
...
) DISTRIBUTED BY (customer_id);
-- Now joins are co-located (no data movement)
SELECT * FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;
Columnar Storage
Use columnar storage for analytical workloads:
-- Append-optimized columnar table
CREATE TABLE analytics_data (
date DATE,
dimension1 VARCHAR(100),
dimension2 VARCHAR(100),
metric1 DECIMAL(10,2),
metric2 DECIMAL(10,2)
)
WITH (
appendonly=true,
orientation=column,
compresstype=zstd,
compresslevel=5
)
DISTRIBUTED BY (date);
Indexes
-- B-tree index (use sparingly in Greenplum)
CREATE INDEX idx_orders_date ON orders(order_date);
-- Bitmap index (better for low-cardinality columns)
CREATE INDEX idx_orders_status ON orders USING bitmap(status);
-- Partial index
CREATE INDEX idx_active_orders ON orders(order_date)
WHERE status = 'active';
Query Optimization
-- Use EXPLAIN to analyze query plan
EXPLAIN SELECT * FROM orders WHERE order_date >= '2024-01-01';
-- EXPLAIN ANALYZE to see actual execution
EXPLAIN ANALYZE
SELECT
customer_id,
SUM(amount) as total
FROM orders
GROUP BY customer_id;
-- Check data skew
SELECT gp_segment_id, COUNT(*)
FROM orders
GROUP BY gp_segment_id;
Backup & Restore
Greenplum add-ons use pg_dump for backups, with MPP-aware backup capabilities.
Backup Configuration
- Tool:
pg_dump(MPP-aware) - Format:
.sql - Type: Logical backup with parallel dump
- Storage: AWS S3 (
s3://strongly-backups/backups/<addon-id>/)
Manual Backup
- Go to add-on details page
- Click Backup Now
- Monitor progress in job logs
- Backup saved as
backup-YYYYMMDDHHMMSS.sql
Scheduled Backups
Configure during add-on creation or in settings:
- Daily backups: Recommended for production
- Retention: 7-14 days minimum for production
- Custom cron: For specific schedules
Restore Process
- Navigate to Backups tab
- Select backup from list
- Click Restore
- Confirm (add-on will stop temporarily)
- Data restored across all segments
- Add-on automatically restarts
Restoring from backup replaces ALL current data. Create a current backup first if needed.
Monitoring
Monitor your Greenplum add-on through the Strongly platform:
- CPU Usage: Track CPU utilization across segments
- Memory Usage: Monitor memory consumption
- Disk Space: Watch disk utilization per segment
- Query Performance: Slow query detection
- Data Skew: Check for uneven data distribution
- Segment Health: Status of coordinator and segment nodes
System Views
-- Check segment configuration
SELECT * FROM gp_segment_configuration;
-- Check database size
SELECT pg_size_pretty(pg_database_size(current_database()));
-- Check table sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema', 'gp_toolkit')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;
-- Check data skew
SELECT gp_segment_id, COUNT(*)
FROM your_table
GROUP BY gp_segment_id
ORDER BY gp_segment_id;
-- Active queries
SELECT
pid,
usename,
datname,
query_start,
now() - query_start AS duration,
query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;
Best Practices
- Choose Distribution Keys Wisely: Use join columns and high-cardinality columns
- Use Partitioning: Partition large tables by date or other logical divisions
- Columnar for Analytics: Use columnar storage for analytical tables
- Replicate Small Tables: Replicate dimension tables to avoid data movement
- Regular ANALYZE: Keep statistics up-to-date for query optimization
- Monitor Data Skew: Ensure even data distribution across segments
- Batch Operations: Load data in bulk using gpfdist or COPY
- Vacuum Regularly: Reclaim space and maintain performance
- Use External Tables: Query data in S3/HDFS without loading
- Optimize Queries: Use EXPLAIN to understand query plans
- Backup Production: Enable daily backups for production databases
- Test Restores: Verify backup restoration procedures
Migration from PostgreSQL
Greenplum is PostgreSQL-compatible, but consider these differences:
-- PostgreSQL: Single server
-- Greenplum: Distributed, specify distribution
-- PostgreSQL
CREATE TABLE users (id INT, name VARCHAR(100));
-- Greenplum (specify distribution)
CREATE TABLE users (id INT, name VARCHAR(100))
DISTRIBUTED BY (id);
-- PostgreSQL: Any indexes work well
-- Greenplum: Prefer bitmap indexes for low-cardinality
-- PostgreSQL
CREATE INDEX idx_status ON orders(status);
-- Greenplum
CREATE INDEX idx_status ON orders USING bitmap(status);
-- Update/Delete operations
-- PostgreSQL: Row-level operations efficient
-- Greenplum: Better to use INSERT SELECT and truncate/reload
Troubleshooting
Connection Issues
# Test connection
psql "postgresql://username:password@host:5432/database"
# Check connection limits
SELECT * FROM pg_settings WHERE name = 'max_connections';
Performance Issues
-- Find slow queries
SELECT
pid,
usename,
now() - query_start AS duration,
query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5 minutes'
ORDER BY duration DESC;
-- Check for data skew
SELECT gp_segment_id, COUNT(*) as row_count
FROM large_table
GROUP BY gp_segment_id
ORDER BY row_count DESC;
-- Analyze table statistics
ANALYZE VERBOSE your_table;
Segment Issues
-- Check segment status
SELECT * FROM gp_segment_configuration
WHERE status <> 'u' OR mode <> 's';
-- Check for mirroring issues
SELECT * FROM gp_segment_configuration
WHERE preferred_role <> role;
Use Cases
Data Warehouse
-- Star schema with fact and dimension tables
CREATE TABLE fact_sales (
sale_id BIGINT,
date_id INT,
product_id INT,
customer_id INT,
store_id INT,
quantity INT,
amount DECIMAL(10,2)
)
DISTRIBUTED BY (date_id)
PARTITION BY RANGE (date_id)
(START (20240101) END (20250101) EVERY (100));
CREATE TABLE dim_date (
date_id INT PRIMARY KEY,
date DATE,
year INT,
quarter INT,
month INT,
day INT,
day_of_week INT
) DISTRIBUTED REPLICATED;
-- Complex analytical query
SELECT
d.year,
d.quarter,
SUM(f.amount) as revenue,
COUNT(DISTINCT f.customer_id) as unique_customers
FROM fact_sales f
INNER JOIN dim_date d ON f.date_id = d.date_id
GROUP BY d.year, d.quarter
ORDER BY d.year, d.quarter;
Real-time Analytics
-- Append-optimized table for continuous ingestion
CREATE TABLE event_stream (
event_id BIGINT,
user_id INT,
event_type VARCHAR(50),
timestamp TIMESTAMP,
data JSON
)
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (user_id)
PARTITION BY RANGE (timestamp)
(
START ('2024-01-01') END ('2025-01-01')
EVERY (INTERVAL '1 day')
);
-- Real-time aggregation
SELECT
event_type,
COUNT(*) as event_count,
COUNT(DISTINCT user_id) as unique_users
FROM event_stream
WHERE timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY event_type;
Support
For issues or questions:
- Check add-on logs in the Strongly dashboard
- Review Greenplum official documentation
- Contact Strongly support through the platform