Logs, Metrics & Monitoring
Monitor your application's health, performance, and resource usage in real-time.
Overview
The Strongly platform provides comprehensive monitoring capabilities:
- Real-time Logs: Stream application logs in real-time
- Resource Metrics: CPU, memory, disk usage
- Health Checks: Kubernetes liveness and readiness probes
- Request Metrics: Request rates, response times, error rates
- Autoscaling Events: Track scaling decisions and triggers
Viewing Logs
Real-time Log Streaming
- Navigate to your app details page
- Click Logs tab
- View real-time log output from all instances
Features:
- Live streaming (updates every 2-3 seconds)
- Multi-instance aggregation
- Searchable and filterable
- Downloadable for analysis
Log Levels
Applications should implement structured logging with levels:
Node.js Example:
const winston = require('winston');
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.json(),
transports: [
new winston.transports.Console({
format: winston.format.simple()
})
]
});
logger.error('Database connection failed', { error: err.message });
logger.warn('High memory usage detected', { usage: '85%' });
logger.info('User logged in', { userId: 123 });
logger.debug('Processing request', { requestId: 'abc123' });
Python Example:
import logging
import os
# Configure logging
logging.basicConfig(
level=os.environ.get('LOG_LEVEL', 'INFO'),
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
logger.error(f'Database connection failed: {err}')
logger.warning(f'High memory usage: 85%')
logger.info(f'User logged in: {user_id}')
logger.debug(f'Processing request: {request_id}')
Best Practices for Logging
- Use Structured Logging: JSON format for easy parsing
- Include Context: Request IDs, user IDs, timestamps
- Appropriate Levels: DEBUG for development, INFO for production
- Avoid Sensitive Data: Never log passwords, tokens, or PII
- Log Errors with Stack Traces: Include full error details
Good Logging:
logger.info('User login successful', {
userId: user.id,
email: user.email,
ip: req.ip,
timestamp: new Date().toISOString()
});
logger.error('Payment processing failed', {
orderId: order.id,
amount: order.amount,
error: err.message,
stack: err.stack
});
Bad Logging:
console.log('User logged in'); // No context
console.log(user); // Too much data, potential PII
console.log('Error: ' + err); // No stack trace
Resource Metrics
CPU Usage
Monitor CPU consumption across instances:
- Current Usage: Real-time CPU percentage
- Average Usage: Average over last 5 minutes
- Peak Usage: Highest CPU usage in time window
- Throttling: When CPU limit is reached
Metrics:
cpu_usage_percent: Percentage of allocated CPUcpu_usage_cores: Absolute CPU cores usedcpu_throttled_seconds: Time spent throttled
Memory Usage
Track memory consumption and prevent OOM errors:
- Current Usage: Real-time memory consumption
- Average Usage: Average over last 5 minutes
- Peak Usage: Highest memory usage in time window
- OOM Events: Out-of-memory kills
Metrics:
memory_usage_percent: Percentage of allocated memorymemory_usage_bytes: Absolute memory usedmemory_oom_kills: Count of OOM events
Disk Usage
Monitor disk space consumption:
- Current Usage: Disk space used
- Available: Remaining disk space
- I/O Metrics: Read/write operations
Metrics:
disk_usage_percent: Percentage of allocated diskdisk_usage_bytes: Absolute disk space useddisk_io_read_bytes: Bytes read from diskdisk_io_write_bytes: Bytes written to disk
Network Metrics
Track network traffic:
- Inbound Traffic: Bytes received
- Outbound Traffic: Bytes sent
- Connection Count: Active connections
Metrics:
network_rx_bytes: Bytes receivednetwork_tx_bytes: Bytes transmittednetwork_connections: Active connections
Health Checks
Kubernetes uses health checks to monitor application health:
Liveness Probe
Determines if the application is running. If it fails, Kubernetes restarts the container.
Configuration (from manifest):
health_check:
path: /health
port: 3000
initial_delay: 10 # Wait before first check
period: 30 # Check every 30 seconds
timeout: 3 # Timeout after 3 seconds
failure_threshold: 3 # Restart after 3 failures
Implementation:
// Express.js
app.get('/health', (req, res) => {
res.status(200).json({
status: 'ok',
timestamp: new Date().toISOString()
});
});
Readiness Probe
Determines if the application is ready to receive traffic. If it fails, Kubernetes stops sending requests.
Use for:
- Database connection checks
- External dependency checks
- Startup tasks completion
Implementation:
// Express.js
app.get('/ready', async (req, res) => {
try {
// Check database connection
await db.ping();
// Check external API
await fetch('https://api.example.com/health');
res.status(200).json({
status: 'ready',
checks: {
database: 'ok',
externalApi: 'ok'
}
});
} catch (err) {
res.status(503).json({
status: 'not ready',
error: err.message
});
}
});
Health Check Status
View health check results:
- Healthy: All checks passing
- Unhealthy: Some checks failing
- Unknown: No data or probes not configured
Request Metrics
Request Rate
Track incoming request rate:
- Requests per Second: Current request rate
- Requests per Minute: Aggregated over 1 minute
- Request Count: Total requests over time period
Response Time
Monitor application performance:
- Average Response Time: Mean response time
- P50 (Median): 50% of requests faster than this
- P95: 95% of requests faster than this
- P99: 99% of requests faster than this
Example Metrics:
Average: 45ms
P50: 35ms
P95: 120ms
P99: 250ms
Error Rate
Track failed requests:
- Error Count: Total errors in time period
- Error Rate: Percentage of failed requests
- Error Types: Breakdown by status code (4xx, 5xx)
Metrics:
http_requests_total: Total HTTP requestshttp_requests_errors: Failed HTTP requestshttp_request_duration_seconds: Response time histogramhttp_requests_by_status: Requests grouped by status code
Monitoring Dashboard
App Details Page
View comprehensive monitoring data:
-
Overview Tab:
- Current status (Running, Stopped, Failed)
- Instance count and health
- Quick metrics summary
-
Metrics Tab:
- CPU usage chart
- Memory usage chart
- Network traffic chart
- Disk usage chart
-
Logs Tab:
- Real-time log streaming
- Search and filter
- Download logs
-
Scaling Tab (if autoscaling enabled):
- Current vs desired replicas
- Scaling events history
- CPU/memory thresholds
- Scaling triggers
Custom Metrics
Prometheus Metrics
Expose custom metrics for Prometheus scraping:
Node.js Example:
const promClient = require('prom-client');
// Create custom metrics
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status']
});
const activeUsers = new promClient.Gauge({
name: 'active_users_total',
help: 'Number of active users'
});
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.end(await promClient.register.metrics());
});
// Instrument requests
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration
.labels(req.method, req.route?.path || req.path, res.statusCode)
.observe(duration);
});
next();
});
Python Example:
from prometheus_client import Counter, Histogram, Gauge, generate_latest
# Create custom metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
http_request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
active_users = Gauge(
'active_users_total',
'Number of active users'
)
# Expose metrics endpoint
@app.route('/metrics')
def metrics():
return generate_latest()
# Instrument requests
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
duration = time.time() - request.start_time
http_requests_total.labels(
method=request.method,
endpoint=request.endpoint,
status=response.status_code
).inc()
http_request_duration.labels(
method=request.method,
endpoint=request.endpoint
).observe(duration)
return response
Alerting
Set up alerts for critical events:
Alert Types
-
Resource Alerts:
- High CPU usage (> 80%)
- High memory usage (> 85%)
- Disk space low (< 10%)
-
Application Alerts:
- High error rate (> 5%)
- Slow response time (P95 > 1s)
- Health check failures
-
Scaling Alerts:
- Scaled to max replicas
- Frequent scaling events
- Scaling failures
Alert Configuration
Configure alerts in app settings:
alerts:
- name: high_cpu_usage
metric: cpu_usage_percent
threshold: 80
duration: 5m
severity: warning
- name: high_error_rate
metric: http_error_rate
threshold: 5
duration: 2m
severity: critical
- name: health_check_failed
metric: health_check_failures
threshold: 3
duration: 1m
severity: critical
Troubleshooting with Logs
Common Patterns
Application Crashes:
# Search for error logs
Error: Cannot read property 'id' of undefined
at /app/server.js:45:23
# Check stack trace for root cause
# Fix code and redeploy
High Memory Usage:
# Look for memory-related warnings
FATAL ERROR: Reached heap limit
Allocation failed - JavaScript heap out of memory
# Increase memory limit in manifest
# Or optimize application code
Connection Issues:
# Search for connection errors
Error: connect ECONNREFUSED 10.0.2.15:5432
# Check STRONGLY_SERVICES configuration
# Verify service is running
Performance Optimization
Identifying Bottlenecks
-
High CPU:
- Check slow endpoints
- Optimize algorithms
- Add caching
- Scale horizontally
-
High Memory:
- Check for memory leaks
- Optimize data structures
- Implement pagination
- Increase memory limit
-
Slow Response:
- Add database indexes
- Implement caching
- Optimize queries
- Use connection pooling
Monitoring Checklist
- ✅ Health check endpoint implemented
- ✅ Structured logging in place
- ✅ Custom metrics exposed
- ✅ Resource limits configured appropriately
- ✅ Alerts set up for critical metrics
- ✅ Log retention policy defined
- ✅ Regular log review process
Best Practices
- Log Everything Important: Request IDs, user actions, errors
- Monitor Proactively: Set up alerts before issues occur
- Review Metrics Regularly: Weekly review of performance trends
- Optimize Based on Data: Use metrics to guide optimization efforts
- Test Health Checks: Ensure health endpoints work correctly
- Rotate Logs: Implement log rotation to manage disk space
- Secure Metrics: Don't expose sensitive data in metrics