Skip to main content

Logs, Metrics & Monitoring

Monitor your application's health, performance, and resource usage in real-time.

Overview

The Strongly platform provides comprehensive monitoring capabilities:

  • Real-time Logs: Stream application logs in real-time
  • Resource Metrics: CPU, memory, disk usage
  • Health Checks: Kubernetes liveness and readiness probes
  • Request Metrics: Request rates, response times, error rates
  • Autoscaling Events: Track scaling decisions and triggers

Viewing Logs

Real-time Log Streaming

  1. Navigate to your app details page
  2. Click Logs tab
  3. View real-time log output from all instances

Features:

  • Live streaming (updates every 2-3 seconds)
  • Multi-instance aggregation
  • Searchable and filterable
  • Downloadable for analysis

Log Levels

Applications should implement structured logging with levels:

Node.js Example:

const winston = require('winston');

const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.json(),
transports: [
new winston.transports.Console({
format: winston.format.simple()
})
]
});

logger.error('Database connection failed', { error: err.message });
logger.warn('High memory usage detected', { usage: '85%' });
logger.info('User logged in', { userId: 123 });
logger.debug('Processing request', { requestId: 'abc123' });

Python Example:

import logging
import os

# Configure logging
logging.basicConfig(
level=os.environ.get('LOG_LEVEL', 'INFO'),
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

logger = logging.getLogger(__name__)

logger.error(f'Database connection failed: {err}')
logger.warning(f'High memory usage: 85%')
logger.info(f'User logged in: {user_id}')
logger.debug(f'Processing request: {request_id}')

Best Practices for Logging

  1. Use Structured Logging: JSON format for easy parsing
  2. Include Context: Request IDs, user IDs, timestamps
  3. Appropriate Levels: DEBUG for development, INFO for production
  4. Avoid Sensitive Data: Never log passwords, tokens, or PII
  5. Log Errors with Stack Traces: Include full error details

Good Logging:

logger.info('User login successful', {
userId: user.id,
email: user.email,
ip: req.ip,
timestamp: new Date().toISOString()
});

logger.error('Payment processing failed', {
orderId: order.id,
amount: order.amount,
error: err.message,
stack: err.stack
});

Bad Logging:

console.log('User logged in');  // No context
console.log(user); // Too much data, potential PII
console.log('Error: ' + err); // No stack trace

Resource Metrics

CPU Usage

Monitor CPU consumption across instances:

  • Current Usage: Real-time CPU percentage
  • Average Usage: Average over last 5 minutes
  • Peak Usage: Highest CPU usage in time window
  • Throttling: When CPU limit is reached

Metrics:

  • cpu_usage_percent: Percentage of allocated CPU
  • cpu_usage_cores: Absolute CPU cores used
  • cpu_throttled_seconds: Time spent throttled

Memory Usage

Track memory consumption and prevent OOM errors:

  • Current Usage: Real-time memory consumption
  • Average Usage: Average over last 5 minutes
  • Peak Usage: Highest memory usage in time window
  • OOM Events: Out-of-memory kills

Metrics:

  • memory_usage_percent: Percentage of allocated memory
  • memory_usage_bytes: Absolute memory used
  • memory_oom_kills: Count of OOM events

Disk Usage

Monitor disk space consumption:

  • Current Usage: Disk space used
  • Available: Remaining disk space
  • I/O Metrics: Read/write operations

Metrics:

  • disk_usage_percent: Percentage of allocated disk
  • disk_usage_bytes: Absolute disk space used
  • disk_io_read_bytes: Bytes read from disk
  • disk_io_write_bytes: Bytes written to disk

Network Metrics

Track network traffic:

  • Inbound Traffic: Bytes received
  • Outbound Traffic: Bytes sent
  • Connection Count: Active connections

Metrics:

  • network_rx_bytes: Bytes received
  • network_tx_bytes: Bytes transmitted
  • network_connections: Active connections

Health Checks

Kubernetes uses health checks to monitor application health:

Liveness Probe

Determines if the application is running. If it fails, Kubernetes restarts the container.

Configuration (from manifest):

health_check:
path: /health
port: 3000
initial_delay: 10 # Wait before first check
period: 30 # Check every 30 seconds
timeout: 3 # Timeout after 3 seconds
failure_threshold: 3 # Restart after 3 failures

Implementation:

// Express.js
app.get('/health', (req, res) => {
res.status(200).json({
status: 'ok',
timestamp: new Date().toISOString()
});
});

Readiness Probe

Determines if the application is ready to receive traffic. If it fails, Kubernetes stops sending requests.

Use for:

  • Database connection checks
  • External dependency checks
  • Startup tasks completion

Implementation:

// Express.js
app.get('/ready', async (req, res) => {
try {
// Check database connection
await db.ping();

// Check external API
await fetch('https://api.example.com/health');

res.status(200).json({
status: 'ready',
checks: {
database: 'ok',
externalApi: 'ok'
}
});
} catch (err) {
res.status(503).json({
status: 'not ready',
error: err.message
});
}
});

Health Check Status

View health check results:

  • Healthy: All checks passing
  • Unhealthy: Some checks failing
  • Unknown: No data or probes not configured

Request Metrics

Request Rate

Track incoming request rate:

  • Requests per Second: Current request rate
  • Requests per Minute: Aggregated over 1 minute
  • Request Count: Total requests over time period

Response Time

Monitor application performance:

  • Average Response Time: Mean response time
  • P50 (Median): 50% of requests faster than this
  • P95: 95% of requests faster than this
  • P99: 99% of requests faster than this

Example Metrics:

Average: 45ms
P50: 35ms
P95: 120ms
P99: 250ms

Error Rate

Track failed requests:

  • Error Count: Total errors in time period
  • Error Rate: Percentage of failed requests
  • Error Types: Breakdown by status code (4xx, 5xx)

Metrics:

  • http_requests_total: Total HTTP requests
  • http_requests_errors: Failed HTTP requests
  • http_request_duration_seconds: Response time histogram
  • http_requests_by_status: Requests grouped by status code

Monitoring Dashboard

App Details Page

View comprehensive monitoring data:

  1. Overview Tab:

    • Current status (Running, Stopped, Failed)
    • Instance count and health
    • Quick metrics summary
  2. Metrics Tab:

    • CPU usage chart
    • Memory usage chart
    • Network traffic chart
    • Disk usage chart
  3. Logs Tab:

    • Real-time log streaming
    • Search and filter
    • Download logs
  4. Scaling Tab (if autoscaling enabled):

    • Current vs desired replicas
    • Scaling events history
    • CPU/memory thresholds
    • Scaling triggers

Custom Metrics

Prometheus Metrics

Expose custom metrics for Prometheus scraping:

Node.js Example:

const promClient = require('prom-client');

// Create custom metrics
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status']
});

const activeUsers = new promClient.Gauge({
name: 'active_users_total',
help: 'Number of active users'
});

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.end(await promClient.register.metrics());
});

// Instrument requests
app.use((req, res, next) => {
const start = Date.now();

res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration
.labels(req.method, req.route?.path || req.path, res.statusCode)
.observe(duration);
});

next();
});

Python Example:

from prometheus_client import Counter, Histogram, Gauge, generate_latest

# Create custom metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)

http_request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)

active_users = Gauge(
'active_users_total',
'Number of active users'
)

# Expose metrics endpoint
@app.route('/metrics')
def metrics():
return generate_latest()

# Instrument requests
@app.before_request
def before_request():
request.start_time = time.time()

@app.after_request
def after_request(response):
duration = time.time() - request.start_time

http_requests_total.labels(
method=request.method,
endpoint=request.endpoint,
status=response.status_code
).inc()

http_request_duration.labels(
method=request.method,
endpoint=request.endpoint
).observe(duration)

return response

Alerting

Set up alerts for critical events:

Alert Types

  1. Resource Alerts:

    • High CPU usage (> 80%)
    • High memory usage (> 85%)
    • Disk space low (< 10%)
  2. Application Alerts:

    • High error rate (> 5%)
    • Slow response time (P95 > 1s)
    • Health check failures
  3. Scaling Alerts:

    • Scaled to max replicas
    • Frequent scaling events
    • Scaling failures

Alert Configuration

Configure alerts in app settings:

alerts:
- name: high_cpu_usage
metric: cpu_usage_percent
threshold: 80
duration: 5m
severity: warning

- name: high_error_rate
metric: http_error_rate
threshold: 5
duration: 2m
severity: critical

- name: health_check_failed
metric: health_check_failures
threshold: 3
duration: 1m
severity: critical

Troubleshooting with Logs

Common Patterns

Application Crashes:

# Search for error logs
Error: Cannot read property 'id' of undefined
at /app/server.js:45:23

# Check stack trace for root cause
# Fix code and redeploy

High Memory Usage:

# Look for memory-related warnings
FATAL ERROR: Reached heap limit
Allocation failed - JavaScript heap out of memory

# Increase memory limit in manifest
# Or optimize application code

Connection Issues:

# Search for connection errors
Error: connect ECONNREFUSED 10.0.2.15:5432
# Check STRONGLY_SERVICES configuration
# Verify service is running

Performance Optimization

Identifying Bottlenecks

  1. High CPU:

    • Check slow endpoints
    • Optimize algorithms
    • Add caching
    • Scale horizontally
  2. High Memory:

    • Check for memory leaks
    • Optimize data structures
    • Implement pagination
    • Increase memory limit
  3. Slow Response:

    • Add database indexes
    • Implement caching
    • Optimize queries
    • Use connection pooling

Monitoring Checklist

  • ✅ Health check endpoint implemented
  • ✅ Structured logging in place
  • ✅ Custom metrics exposed
  • ✅ Resource limits configured appropriately
  • ✅ Alerts set up for critical metrics
  • ✅ Log retention policy defined
  • ✅ Regular log review process

Best Practices

  1. Log Everything Important: Request IDs, user actions, errors
  2. Monitor Proactively: Set up alerts before issues occur
  3. Review Metrics Regularly: Weekly review of performance trends
  4. Optimize Based on Data: Use metrics to guide optimization efforts
  5. Test Health Checks: Ensure health endpoints work correctly
  6. Rotate Logs: Implement log rotation to manage disk space
  7. Secure Metrics: Don't expose sensitive data in metrics

Next Steps