Skip to main content

Google BigQuery Configuration

Connect to Google BigQuery for analytics, large-scale queries, and business intelligence workloads.

Connection Parameters

Required Fields

  • Project ID: GCP project ID
  • Service Account Key: Complete JSON key file content

Optional Fields

  • Dataset ID: Optional default dataset

Configuration Example

When creating a BigQuery data source, provide the following information:

FieldExample ValueNotes
Nameprod-bigqueryUnique identifier
LabelProduction BigQueryDisplay name
Project IDmy-gcp-projectGCP project ID
Dataset IDanalyticsOptional default dataset
Service Account Key{ "type": "service_account", ... }Full JSON key

Creating a Service Account Key

To create a service account key for BigQuery:

  1. Go to GCP Console
  2. Navigate to IAM & AdminService Accounts
  3. Click Create Service Account or select an existing one
  4. Grant necessary roles (see Required Permissions)
  5. Click Keys tab → Add KeyCreate New Key
  6. Select JSON format
  7. Download the JSON key file
  8. Copy and paste the entire JSON content into the Service Account Key field
Security

Keep your service account key secure. Never commit it to version control or share it publicly.

Required Permissions

Grant the following IAM roles to your service account:

For Read-Only Access

- BigQuery Data Viewer (roles/bigquery.dataViewer)
- BigQuery Job User (roles/bigquery.jobUser)

For Read-Write Access

- BigQuery Data Editor (roles/bigquery.dataEditor)
- BigQuery Job User (roles/bigquery.jobUser)

For Full Access

- BigQuery Admin (roles/bigquery.admin)
Least Privilege

Use the minimum required permissions. For most applications, read-only access is sufficient.

Usage in Applications

Python Example

import os, json
from google.cloud import bigquery
from google.oauth2 import service_account

# Parse STRONGLY_SERVICES environment variable
services = json.loads(os.getenv('STRONGLY_SERVICES', '{}'))
datasources = services.get('datasources', {})

# Get BigQuery data source
bq_config = datasources['prod-bigquery']
creds_dict = json.loads(bq_config['credentials']['service_account_key'])

# Create credentials from service account key
credentials = service_account.Credentials.from_service_account_info(creds_dict)

# Create BigQuery client
client = bigquery.Client(
credentials=credentials,
project=bq_config['credentials']['project_id']
)

# Execute query
query = """
SELECT *
FROM `my-project.analytics.users`
WHERE active = TRUE
"""
results = client.query(query).result()

# Process results
for row in results:
print(f"{row.id}: {row.name}")

Python with Pandas

import os, json
import pandas as pd
from google.cloud import bigquery
from google.oauth2 import service_account

# Parse STRONGLY_SERVICES environment variable
services = json.loads(os.getenv('STRONGLY_SERVICES', '{}'))
datasources = services.get('datasources', {})

# Get BigQuery data source
bq_config = datasources['prod-bigquery']
creds_dict = json.loads(bq_config['credentials']['service_account_key'])

# Create credentials
credentials = service_account.Credentials.from_service_account_info(creds_dict)

# Create client
client = bigquery.Client(
credentials=credentials,
project=bq_config['credentials']['project_id']
)

# Query into DataFrame
query = """
SELECT *
FROM `my-project.analytics.users`
WHERE active = TRUE
LIMIT 1000
"""
df = client.query(query).to_dataframe()
print(df.head())

Node.js Example

const { BigQuery } = require('@google-cloud/bigquery');

// Parse STRONGLY_SERVICES environment variable
const services = JSON.parse(process.env.STRONGLY_SERVICES || '{}');
const datasources = services.datasources || {};

// Get BigQuery data source
const bqConfig = datasources['prod-bigquery'];
const credentials = JSON.parse(bqConfig.credentials.service_account_key);

// Create BigQuery client
const bigquery = new BigQuery({
projectId: bqConfig.credentials.project_id,
credentials: credentials
});

// Execute query
const query = `
SELECT *
FROM \`my-project.analytics.users\`
WHERE active = TRUE
`;

async function runQuery() {
const [rows] = await bigquery.query(query);
console.log('Rows:');
rows.forEach(row => console.log(row));
}

runQuery();

Java Example

import com.google.auth.oauth2.ServiceAccountCredentials;
import com.google.cloud.bigquery.*;
import org.json.*;
import java.io.*;

public class BigQueryExample {
public static void main(String[] args) throws Exception {
// Parse STRONGLY_SERVICES environment variable
String servicesJson = System.getenv("STRONGLY_SERVICES");
JSONObject services = new JSONObject(servicesJson);
JSONObject datasources = services.getJSONObject("datasources");
JSONObject bqConfig = datasources.getJSONObject("prod-bigquery");
JSONObject creds = bqConfig.getJSONObject("credentials");

// Parse service account key
String keyJson = creds.getString("service_account_key");
InputStream keyStream = new ByteArrayInputStream(keyJson.getBytes());
ServiceAccountCredentials credentials =
ServiceAccountCredentials.fromStream(keyStream);

// Create BigQuery client
BigQuery bigquery = BigQueryOptions.newBuilder()
.setCredentials(credentials)
.setProjectId(creds.getString("project_id"))
.build()
.getService();

// Execute query
String query = "SELECT * FROM `my-project.analytics.users` WHERE active = TRUE";
QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query).build();
TableResult results = bigquery.query(queryConfig);

// Process results
for (FieldValueList row : results.iterateAll()) {
System.out.println(row.toString());
}
}
}

Query Syntax

Fully Qualified Table Names

Always use fully qualified table names in the format:

`project-id.dataset-id.table-name`

Example:

SELECT COUNT(*)
FROM `my-project.analytics.users`
WHERE created_date >= '2024-01-01'

Standard SQL vs Legacy SQL

BigQuery supports two SQL dialects. Always use Standard SQL (default):

# Standard SQL (recommended)
query = """
SELECT * FROM `project.dataset.table`
WHERE date >= '2024-01-01'
"""

# Legacy SQL (not recommended)
query = """
#legacySQL
SELECT * FROM [project:dataset.table]
WHERE date >= '2024-01-01'
"""

Cost Optimization

Query Cost Estimation

Before running expensive queries, estimate costs:

# Dry run to estimate bytes processed
job_config = bigquery.QueryJobConfig(dry_run=True)
query_job = client.query(query, job_config=job_config)

print(f"This query will process {query_job.total_bytes_processed} bytes")
# Approximate cost: $5 per TB processed
estimated_cost = (query_job.total_bytes_processed / (1024**4)) * 5
print(f"Estimated cost: ${estimated_cost:.2f}")

Optimize Query Costs

  1. Use Partitioning: Query only necessary partitions
SELECT * FROM `project.dataset.table`
WHERE _PARTITIONTIME >= TIMESTAMP('2024-01-01')
  1. Select Specific Columns: Avoid SELECT *
-- Bad (expensive)
SELECT * FROM `project.dataset.table`

-- Good (cheaper)
SELECT id, name, email FROM `project.dataset.table`
  1. Use Clustering: Optimize queries with WHERE clauses
CREATE TABLE `project.dataset.table`
PARTITION BY DATE(created_date)
CLUSTER BY user_id, region
AS SELECT * FROM source_table
  1. Limit Results During Development:
SELECT * FROM `project.dataset.table`
LIMIT 100
  1. Use Table Preview: Free preview of table data
# Preview table (free)
table = client.get_table("project.dataset.table")
rows = client.list_rows(table, max_results=10)
for row in rows:
print(row)

Common Issues

Permission Denied

  • Verify service account has required BigQuery roles
  • Check project ID is correct
  • Ensure service account key is valid and not expired
  • Grant dataset-level permissions if needed:
# Grant dataset access
bq show --format=prettyjson project:dataset > /tmp/dataset.json
# Edit JSON to add service account email to access list
bq update --source /tmp/dataset.json project:dataset

Invalid Credentials

  • Ensure service account key JSON is complete and valid
  • Verify the key hasn't been deleted or revoked
  • Check that the service account still exists
  • Regenerate key if necessary

Table Not Found

  • Verify table name is fully qualified: `project.dataset.table`
  • Check if table exists in BigQuery console
  • Ensure service account has access to the dataset
  • Verify dataset ID in your query matches actual dataset name

Query Timeout

  • Break large queries into smaller chunks
  • Use appropriate WHERE clauses to filter data
  • Consider using materialized views for complex queries
  • Increase query timeout in your client:
job_config = bigquery.QueryJobConfig(
use_query_cache=True,
use_legacy_sql=False,
maximum_bytes_billed=1000000000 # 1 GB limit
)
query_job = client.query(query, job_config=job_config)
results = query_job.result(timeout=300) # 5 minute timeout

Best Practices

  1. Use Standard SQL: Always use Standard SQL, not Legacy SQL
  2. Partition Large Tables: Use date partitioning for time-series data
  3. Cluster Tables: Use clustering for frequently filtered columns
  4. Select Specific Columns: Avoid SELECT * to reduce costs
  5. Use Query Cache: Enable query cache to avoid reprocessing
  6. Monitor Costs: Set up billing alerts and monitor query costs
  7. Least Privilege: Grant minimal required permissions to service accounts
  8. Secure Keys: Never commit service account keys to version control
  9. Query Preview: Use preview for development to avoid charges
  10. Batch Queries: Batch multiple queries when possible to reduce overhead