Greenplum

Greenplum is a massively parallel processing (MPP) database built on PostgreSQL, designed for large-scale analytics and data warehouse workloads.

Overview

Versions: 7.1.0, 6.27.1
Default Port: 5432
Cluster Support: Yes (MPP architecture)
Use Cases: MPP data warehouse, large-scale analytics, business intelligence
Features: Distributed queries, petabyte-scale, PostgreSQL-compatible, MADlib for in-database ML

Key Features

Massively Parallel Processing (MPP): Distribute queries across multiple nodes for fast analytics
PostgreSQL Compatible: Leverage PostgreSQL tools, drivers, and SQL syntax
Petabyte Scale: Handle massive datasets efficiently
Columnar Storage: Optimize for analytical queries with column-oriented storage
Advanced Analytics: Built-in support for machine learning and statistical functions via MADlib
Distributed Architecture: Coordinator and segment nodes for parallel processing
Data Distribution: Hash, random, or replicated table distribution
Partitioning: Time-based and list partitioning for large tables
External Tables: Query data in S3, HDFS, and other external sources
Parallel Data Loading: High-speed data ingestion with gpfdist

Deployment Modes

Greenplum is designed for cluster deployments:

Standalone (Development)

Single coordinator and single segment
Suitable for development and testing
Limited scalability
Lower resource requirements

Cluster (Production MPP)

Recommended for production workloads
Multiple segment nodes for parallel processing
3-10 segment nodes configurable
Coordinator node for query planning
High availability with mirror segments
Horizontal scaling capability

Resource Tiers

Tier	CPU	Memory	Disk	Replicas	Best For
Small	0.5	1GB	10GB	1	Development, testing
Medium	1	2GB	25GB	1	Small analytics workloads
Large	2	4GB	50GB	3	Production analytics
XLarge	4	8GB	100GB	3	Large-scale data warehouse
Custom	User-defined	User-defined	User-defined	User-defined	Specific requirements

Creating a Greenplum Add-on

Navigate to Add-ons -> Create Add-on
Select Greenplum as the type
Choose a version (7.1.0 or 6.27.1)
Select deployment mode:
- Standalone: For development/testing
- Cluster: For production (coordinator + segment nodes)
Configure:
- Label: Descriptive name (e.g., "Analytics Warehouse")
- Description: Purpose and notes
- Resource Tier: Based on your workload requirements
Configure backups:
- Schedule: Hourly, Daily, Weekly, or Monthly
- Retention: Number of backups to keep (default: 7)
Click Create Add-on

Connection Information

After deployment, connection details are available in the add-on details page and automatically injected into your apps via STRONGLY_SERVICES.

Connection String Format

postgresql://username:password@host:5432/gpadmin

Greenplum is PostgreSQL-compatible, so it uses the postgresql:// connection string scheme. The default database name is gpadmin. Credentials (username and password) are auto-generated during add-on creation.

Accessing Connection Details

Python
Node.js
Go

import os
import json
import psycopg2

# Parse STRONGLY_SERVICES
services = json.loads(os.environ.get('STRONGLY_SERVICES', '{}'))

# Get Greenplum add-on connection
gp_addon = services['addons']['addon-id']

# Connect using connection string (Greenplum is PostgreSQL-compatible)
conn = psycopg2.connect(gp_addon['connectionString'])
cursor = conn.cursor()

# Or connect using individual parameters
conn = psycopg2.connect(
    host=gp_addon['host'],
    port=gp_addon['port'],
    database=gp_addon['database'],
    user=gp_addon['username'],
    password=gp_addon['password']
)
cursor = conn.cursor()

# Execute query
cursor.execute("SELECT version()")
version = cursor.fetchone()
print(f"Greenplum version: {version[0]}")

cursor.close()
conn.close()

const { Pool } = require('pg');

// Parse STRONGLY_SERVICES
const services = JSON.parse(process.env.STRONGLY_SERVICES || '{}');
const gpAddon = services.addons['addon-id'];

// Connect using connection string
const pool = new Pool({
  connectionString: gpAddon.connectionString
});

// Or connect using individual parameters
const pool = new Pool({
  host: gpAddon.host,
  port: gpAddon.port,
  database: gpAddon.database,
  user: gpAddon.username,
  password: gpAddon.password
});

// Execute query
const result = await pool.query('SELECT version()');
console.log('Greenplum version:', result.rows[0].version);

await pool.end();

package main

import (
    "database/sql"
    "encoding/json"
    "fmt"
    "os"
    _ "github.com/lib/pq"
)

type Services struct {
    Addons map[string]Addon `json:"addons"`
}

type Addon struct {
    ConnectionString string `json:"connectionString"`
    Host            string `json:"host"`
    Port            int    `json:"port"`
    Database        string `json:"database"`
    Username        string `json:"username"`
    Password        string `json:"password"`
}

func main() {
    var services Services
    json.Unmarshal([]byte(os.Getenv("STRONGLY_SERVICES")), &services)

    gpAddon := services.Addons["addon-id"]

    // Connect using connection string
    db, err := sql.Open("postgres", gpAddon.ConnectionString)
    if err != nil {
        panic(err)
    }
    defer db.Close()

    // Execute query
    var version string
    err = db.QueryRow("SELECT version()").Scan(&version)
    if err != nil {
        panic(err)
    }

    fmt.Println("Greenplum version:", version)
}

Core Concepts

Distributed Architecture

Greenplum uses a coordinator-segment architecture:

Coordinator (Master): Query planning, client connections, metadata
Segments: Store data and execute queries in parallel
Interconnect: High-speed network for data exchange between segments

Data Distribution

Tables can be distributed across segments using different strategies:

-- Hash distribution (default, best for most cases)
CREATE TABLE sales (
    id SERIAL,
    customer_id INT,
    amount DECIMAL(10,2),
    sale_date DATE
) DISTRIBUTED BY (customer_id);

-- Random distribution (good for small lookup tables)
CREATE TABLE countries (
    code CHAR(2),
    name VARCHAR(100)
) DISTRIBUTED RANDOMLY;

-- Replicated distribution (copy to all segments)
CREATE TABLE small_lookup (
    id INT,
    value VARCHAR(100)
) DISTRIBUTED REPLICATED;

Partitioning

Partition large tables for better query performance:

-- Range partitioning by date
CREATE TABLE sales (
    id SERIAL,
    customer_id INT,
    amount DECIMAL(10,2),
    sale_date DATE
)
DISTRIBUTED BY (customer_id)
PARTITION BY RANGE (sale_date)
(
    PARTITION sales_2023 START ('2023-01-01') END ('2024-01-01'),
    PARTITION sales_2024 START ('2024-01-01') END ('2025-01-01'),
    DEFAULT PARTITION sales_other
);

-- Add new partition
ALTER TABLE sales ADD PARTITION sales_2025
START ('2025-01-01') END ('2026-01-01');

-- Drop old partition
ALTER TABLE sales DROP PARTITION sales_2023;

Common Operations

Creating Tables

-- Fact table with hash distribution and partitioning
CREATE TABLE fact_orders (
    order_id BIGINT,
    customer_id INT,
    product_id INT,
    quantity INT,
    amount DECIMAL(10,2),
    order_date DATE,
    region VARCHAR(50)
)
DISTRIBUTED BY (customer_id)
PARTITION BY RANGE (order_date)
(
    START ('2024-01-01') END ('2025-01-01')
    EVERY (INTERVAL '1 month')
);

-- Dimension table with replication
CREATE TABLE dim_products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(200),
    category VARCHAR(100),
    price DECIMAL(10,2)
) DISTRIBUTED REPLICATED;

-- Columnar storage for analytics
CREATE TABLE analytics_events (
    event_id BIGINT,
    user_id INT,
    event_type VARCHAR(50),
    event_data JSON,
    timestamp TIMESTAMP
)
WITH (appendonly=true, orientation=column, compresstype=zstd)
DISTRIBUTED BY (user_id)
PARTITION BY RANGE (timestamp)
(
    START ('2024-01-01') END ('2025-01-01')
    EVERY (INTERVAL '1 day')
);

Loading Data

COPY Command:

-- Load from CSV file
COPY sales FROM '/path/to/sales.csv' CSV HEADER;

-- Load with specific columns
COPY sales (customer_id, amount, sale_date)
FROM '/path/to/data.csv' CSV;

gpfdist (High-speed parallel loading):

-- Create external table pointing to gpfdist
CREATE EXTERNAL TABLE ext_sales (
    customer_id INT,
    amount DECIMAL(10,2),
    sale_date DATE
)
LOCATION ('gpfdist://filehost:8080/sales*.csv')
FORMAT 'CSV' (HEADER);

-- Load data in parallel
INSERT INTO sales SELECT * FROM ext_sales;

External Tables (S3, HDFS):

-- S3 external table (requires PXF)
CREATE EXTERNAL TABLE s3_data (
    id INT,
    name VARCHAR(100),
    value DECIMAL(10,2)
)
LOCATION ('pxf://my-bucket/data/*.csv?PROFILE=s3:text')
FORMAT 'CSV';

-- Query external data
SELECT * FROM s3_data WHERE value > 100;

Analytical Queries

-- Aggregation across segments
SELECT
    region,
    COUNT(*) as order_count,
    SUM(amount) as total_revenue,
    AVG(amount) as avg_order_value
FROM fact_orders
WHERE order_date >= '2024-01-01'
GROUP BY region
ORDER BY total_revenue DESC;

-- Window functions
SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as running_total
FROM fact_orders
ORDER BY customer_id, order_date;

-- Join optimization
SELECT
    o.order_id,
    p.product_name,
    o.quantity,
    o.amount
FROM fact_orders o
INNER JOIN dim_products p ON o.product_id = p.product_id
WHERE o.order_date >= '2024-01-01';

Vacuum and Analyze

-- Analyze table for query optimization
ANALYZE fact_orders;

-- Vacuum to reclaim space
VACUUM fact_orders;

-- Full vacuum and analyze
VACUUM FULL ANALYZE fact_orders;

Advanced Analytics

Window Functions

-- Ranking
SELECT
    product_id,
    product_name,
    revenue,
    RANK() OVER (ORDER BY revenue DESC) as rank,
    DENSE_RANK() OVER (ORDER BY revenue DESC) as dense_rank,
    ROW_NUMBER() OVER (ORDER BY revenue DESC) as row_num
FROM product_revenue;

-- Moving average
SELECT
    sale_date,
    amount,
    AVG(amount) OVER (
        ORDER BY sale_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as moving_avg_7day
FROM daily_sales;

Common Table Expressions (CTEs)

-- Recursive CTE
WITH RECURSIVE subordinates AS (
    SELECT employee_id, name, manager_id, 1 as level
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    SELECT e.employee_id, e.name, e.manager_id, s.level + 1
    FROM employees e
    INNER JOIN subordinates s ON e.manager_id = s.employee_id
)
SELECT * FROM subordinates ORDER BY level, name;

-- Multiple CTEs
WITH
monthly_sales AS (
    SELECT
        DATE_TRUNC('month', sale_date) as month,
        SUM(amount) as total
    FROM sales
    GROUP BY 1
),
growth AS (
    SELECT
        month,
        total,
        LAG(total) OVER (ORDER BY month) as prev_month,
        (total - LAG(total) OVER (ORDER BY month)) / LAG(total) OVER (ORDER BY month) * 100 as growth_pct
    FROM monthly_sales
)
SELECT * FROM growth WHERE growth_pct > 10;

MADlib (Machine Learning)

Greenplum includes MADlib for in-database machine learning:

-- Linear regression
SELECT madlib.linregr_train(
    'sales',                    -- source table
    'sales_model',              -- output model table
    'amount',                   -- dependent variable
    'ARRAY[1, quantity, price]' -- independent variables
);

-- Predict using model
SELECT madlib.linregr_predict(
    ARRAY[1, 10, 50.00],
    'sales_model'
);

-- K-means clustering
SELECT madlib.kmeans_random(
    'customers',          -- source table
    'customer_clusters',  -- output table
    'features',           -- column with features array
    5                     -- number of clusters
);

Performance Optimization

Distribution Keys

Choose distribution keys carefully:

-- Good: Distribute by join key
CREATE TABLE orders (
    order_id BIGINT,
    customer_id INT,
    ...
) DISTRIBUTED BY (customer_id);

CREATE TABLE customers (
    customer_id INT,
    ...
) DISTRIBUTED BY (customer_id);

-- Now joins are co-located (no data movement)
SELECT * FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;

Columnar Storage

Use columnar storage for analytical workloads:

-- Append-optimized columnar table
CREATE TABLE analytics_data (
    date DATE,
    dimension1 VARCHAR(100),
    dimension2 VARCHAR(100),
    metric1 DECIMAL(10,2),
    metric2 DECIMAL(10,2)
)
WITH (
    appendonly=true,
    orientation=column,
    compresstype=zstd,
    compresslevel=5
)
DISTRIBUTED BY (date);

Indexes

-- B-tree index (use sparingly in Greenplum)
CREATE INDEX idx_orders_date ON orders(order_date);

-- Bitmap index (better for low-cardinality columns)
CREATE INDEX idx_orders_status ON orders USING bitmap(status);

-- Partial index
CREATE INDEX idx_active_orders ON orders(order_date)
WHERE status = 'active';

Query Optimization

-- Use EXPLAIN to analyze query plan
EXPLAIN SELECT * FROM orders WHERE order_date >= '2024-01-01';

-- EXPLAIN ANALYZE to see actual execution
EXPLAIN ANALYZE
SELECT
    customer_id,
    SUM(amount) as total
FROM orders
GROUP BY customer_id;

-- Check data skew
SELECT gp_segment_id, COUNT(*)
FROM orders
GROUP BY gp_segment_id;

Backup & Restore

Greenplum add-ons use pg_dump for backups, with MPP-aware backup capabilities.

Backup Configuration

Tool: pg_dump (MPP-aware)
Format: .sql
Type: Logical backup with parallel dump
Storage: AWS S3 (s3://strongly-backups/backups/<addon-id>/)

Manual Backup

Go to add-on details page
Click Backup Now
Monitor progress in job logs
Backup saved as backup-YYYYMMDDHHMMSS.sql

Scheduled Backups

Configure during add-on creation or in settings:

Hourly: For critical data with frequent changes
Daily: Recommended for most production workloads
Weekly/Monthly: For less frequently changing data
Retention: 7-14 days minimum for production

Data Loss

Backup restoration replaces ALL current data. Create a current backup before restoring if needed.

Monitoring

Monitor your Greenplum add-on through the Strongly platform:

CPU Usage: Track CPU utilization across segments
Memory Usage: Monitor memory consumption
Disk Space: Watch disk utilization per segment
Query Performance: Slow query detection
Data Skew: Check for uneven data distribution
Segment Health: Status of coordinator and segment nodes

System Views

-- Check segment configuration
SELECT * FROM gp_segment_configuration;

-- Check database size
SELECT pg_size_pretty(pg_database_size(current_database()));

-- Check table sizes
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema', 'gp_toolkit')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

-- Check data skew
SELECT gp_segment_id, COUNT(*)
FROM your_table
GROUP BY gp_segment_id
ORDER BY gp_segment_id;

-- Active queries
SELECT
    pid,
    usename,
    datname,
    query_start,
    now() - query_start AS duration,
    query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;

Best Practices

Choose Distribution Keys Wisely: Use join columns and high-cardinality columns
Use Partitioning: Partition large tables by date or other logical divisions
Columnar for Analytics: Use columnar storage for analytical tables
Replicate Small Tables: Replicate dimension tables to avoid data movement
Regular ANALYZE: Keep statistics up-to-date for query optimization
Monitor Data Skew: Ensure even data distribution across segments
Batch Operations: Load data in bulk using gpfdist or COPY
Vacuum Regularly: Reclaim space and maintain performance
Use External Tables: Query data in S3/HDFS without loading
Optimize Queries: Use EXPLAIN to understand query plans
Backup Production: Enable daily backups for production databases
Test Restores: Verify backup restoration procedures

Migration from PostgreSQL

Greenplum is PostgreSQL-compatible, but consider these differences:

-- PostgreSQL: Single server
-- Greenplum: Distributed, specify distribution

-- PostgreSQL
CREATE TABLE users (id INT, name VARCHAR(100));

-- Greenplum (specify distribution)
CREATE TABLE users (id INT, name VARCHAR(100))
DISTRIBUTED BY (id);

-- PostgreSQL: Any indexes work well
-- Greenplum: Prefer bitmap indexes for low-cardinality

-- PostgreSQL
CREATE INDEX idx_status ON orders(status);

-- Greenplum
CREATE INDEX idx_status ON orders USING bitmap(status);

-- Update/Delete operations
-- PostgreSQL: Row-level operations efficient
-- Greenplum: Better to use INSERT SELECT and truncate/reload

Use Cases

Data Warehouse

-- Star schema with fact and dimension tables
CREATE TABLE fact_sales (
    sale_id BIGINT,
    date_id INT,
    product_id INT,
    customer_id INT,
    store_id INT,
    quantity INT,
    amount DECIMAL(10,2)
)
DISTRIBUTED BY (date_id)
PARTITION BY RANGE (date_id)
(START (20240101) END (20250101) EVERY (100));

CREATE TABLE dim_date (
    date_id INT PRIMARY KEY,
    date DATE,
    year INT,
    quarter INT,
    month INT,
    day INT,
    day_of_week INT
) DISTRIBUTED REPLICATED;

-- Complex analytical query
SELECT
    d.year,
    d.quarter,
    SUM(f.amount) as revenue,
    COUNT(DISTINCT f.customer_id) as unique_customers
FROM fact_sales f
INNER JOIN dim_date d ON f.date_id = d.date_id
GROUP BY d.year, d.quarter
ORDER BY d.year, d.quarter;

Real-time Analytics

-- Append-optimized table for continuous ingestion
CREATE TABLE event_stream (
    event_id BIGINT,
    user_id INT,
    event_type VARCHAR(50),
    timestamp TIMESTAMP,
    data JSON
)
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (user_id)
PARTITION BY RANGE (timestamp)
(
    START ('2024-01-01') END ('2025-01-01')
    EVERY (INTERVAL '1 day')
);

-- Real-time aggregation
SELECT
    event_type,
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as unique_users
FROM event_stream
WHERE timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY event_type;

Troubleshooting

Connection Issues

# Test connection
psql "postgresql://username:password@host:5432/gpadmin"

# Check connection limits
SELECT * FROM pg_settings WHERE name = 'max_connections';

Performance Issues

-- Find slow queries
SELECT
    pid,
    usename,
    now() - query_start AS duration,
    query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5 minutes'
ORDER BY duration DESC;

-- Check for data skew
SELECT gp_segment_id, COUNT(*) as row_count
FROM large_table
GROUP BY gp_segment_id
ORDER BY row_count DESC;

-- Analyze table statistics
ANALYZE VERBOSE your_table;

Segment Issues

-- Check segment status
SELECT * FROM gp_segment_configuration
WHERE status <> 'u' OR mode <> 's';

-- Check for mirroring issues
SELECT * FROM gp_segment_configuration
WHERE preferred_role <> role;

Support

For issues or questions:

Check add-on logs in the Strongly dashboard
Review Greenplum official documentation
Contact Strongly support through the platform

Overview​

Key Features​

Deployment Modes​

Standalone (Development)​

Cluster (Production MPP)​

Resource Tiers​

Creating a Greenplum Add-on​

Connection Information​

Connection String Format​

Accessing Connection Details​

Core Concepts​

Distributed Architecture​

Data Distribution​

Partitioning​

Common Operations​

Creating Tables​

Loading Data​

Analytical Queries​

Vacuum and Analyze​

Advanced Analytics​

Window Functions​

Common Table Expressions (CTEs)​

MADlib (Machine Learning)​

Performance Optimization​

Distribution Keys​

Columnar Storage​

Indexes​

Query Optimization​

Backup & Restore​

Backup Configuration​

Manual Backup​

Scheduled Backups​

Monitoring​

System Views​

Best Practices​

Migration from PostgreSQL​

Use Cases​

Data Warehouse​

Real-time Analytics​

Troubleshooting​

Connection Issues​

Performance Issues​

Segment Issues​

Support​

Related Documentation​