Grafana Stack Integration
Clean Stack integrates with the Grafana observability stack (Grafana, Prometheus, Loki, and Tempo) to provide a complete observability solution.
Stack Overview
Quick Start
- Start the observability stack:
bun run platform:observability
- Access Grafana:
- URL: http://localhost:3000
- Default credentials:
- Username: admin
- Password: admin
Pre-configured Dashboards
1. Service Overview
- Request rates and latencies
- Error rates
- Resource usage
- Cache performance
2. Distributed Tracing
- End-to-end request flows
- Service dependencies
- Performance bottlenecks
- Error analysis
3. Log Analytics
- Structured log search
- Log correlation
- Pattern analysis
- Alert configuration
Custom Dashboard Creation
1. Metrics Dashboard
// Define custom metrics
const requestDuration = metrics.createHistogram('http_request_duration', {
description: 'HTTP request duration',
unit: 'ms',
boundaries: [10, 50, 100, 200, 500, 1000]
});
// Use in your code
app.use(async (ctx, next) => {
const startTime = Date.now();
try {
await next();
} finally {
const duration = Date.now() - startTime;
requestDuration.record(duration, {
path: ctx.path,
method: ctx.method,
status: ctx.status
});
}
});
Then in Grafana:
- Add new panel
- Query:
rate(http_request_duration_bucket[5m])
- Visualization: Heatmap
2. Trace Analysis
// Add custom attributes to spans
const span = tracer.startSpan('process-order');
span.setAttribute('order.id', orderId);
span.setAttribute('customer.type', customerType);
span.setAttribute('order.value', orderValue);
In Grafana Tempo:
- Search by attribute
- Create Service Graph
- Analyze Flame Graph
3. Log Queries
// Structured logging
logger.info('Order processed', {
orderId: 'order-123',
processingTime: 150,
customerTier: 'premium'
});
In Grafana Loki:
{service="order-service"}
| json
| processingTime > 100
| customerTier="premium"
Alert Configuration
1. High Latency Alert
# In Grafana UI:
alert:
name: High Service Latency
condition: avg_over_time(http_request_duration_seconds[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: Service experiencing high latency
2. Error Rate Alert
alert:
name: High Error Rate
condition: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
Data Retention
Default retention periods:
- Metrics (Prometheus): 15 days
- Logs (Loki): 7 days
- Traces (Tempo): 3 days
Configure in docker-compose:
prometheus:
command:
- '--storage.tsdb.retention.time=30d'
loki:
config:
table_manager:
retention_period: 168h
tempo:
retention_period: 72h
Best Practices
-
Dashboard Organization
- Use folders for different teams/services
- Standardize naming conventions
- Include documentation panels
-
Query Optimization
- Use recording rules for complex queries
- Limit high-cardinality labels
- Set appropriate time ranges
-
Alert Management
- Define clear severity levels
- Include runbooks in alerts
- Configure proper notification channels
Troubleshooting
Common Issues
-
Missing Data
- Check collector connectivity
- Verify port configurations
- Ensure correct label matching
-
Dashboard Performance
- Optimize time ranges
- Use appropriate refresh intervals
- Minimize panel count
-
Alert Issues
- Validate alert conditions
- Check notification settings
- Review alert history
Security Considerations
-
Access Control
grafana:
env:
GF_AUTH_DISABLE_LOGIN_FORM: "false"
GF_AUTH_ANONYMOUS_ENABLED: "false" -
Network Security
- Use TLS for data transmission
- Implement proper authentication
- Restrict network access
-
Data Protection
- Configure data retention
- Implement log sanitization
- Manage sensitive labels