Grafana Stack Integration

Clean Stack integrates with the Grafana observability stack (Grafana, Prometheus, Loki, and Tempo) to provide a complete observability solution.

Stack Overview

Quick Start

Start the observability stack:

bun run platform:observability

Access Grafana:

URL: http://localhost:3000
Default credentials:
- Username: admin
- Password: admin

Pre-configured Dashboards

1. Service Overview

Request rates and latencies
Error rates
Resource usage
Cache performance

2. Distributed Tracing

End-to-end request flows
Service dependencies
Performance bottlenecks
Error analysis

3. Log Analytics

Structured log search
Log correlation
Pattern analysis
Alert configuration

Custom Dashboard Creation

1. Metrics Dashboard

// Define custom metrics
const requestDuration = metrics.createHistogram('http_request_duration', {
  description: 'HTTP request duration',
  unit: 'ms',
  boundaries: [10, 50, 100, 200, 500, 1000]
});

// Use in your code
app.use(async (ctx, next) => {
  const startTime = Date.now();
  try {
    await next();
  } finally {
    const duration = Date.now() - startTime;
    requestDuration.record(duration, {
      path: ctx.path,
      method: ctx.method,
      status: ctx.status
    });
  }
});

Then in Grafana:

Add new panel
Query: rate(http_request_duration_bucket[5m])
Visualization: Heatmap

2. Trace Analysis

// Add custom attributes to spans
const span = tracer.startSpan('process-order');
span.setAttribute('order.id', orderId);
span.setAttribute('customer.type', customerType);
span.setAttribute('order.value', orderValue);

In Grafana Tempo:

Search by attribute
Create Service Graph
Analyze Flame Graph

3. Log Queries

// Structured logging
logger.info('Order processed', {
  orderId: 'order-123',
  processingTime: 150,
  customerTier: 'premium'
});

In Grafana Loki:

{service="order-service"} 
  | json
  | processingTime > 100
  | customerTier="premium"

Alert Configuration

1. High Latency Alert

# In Grafana UI:
alert:
  name: High Service Latency
  condition: avg_over_time(http_request_duration_seconds[5m]) > 0.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: Service experiencing high latency

2. Error Rate Alert

alert:
  name: High Error Rate
  condition: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) 
    / 
    sum(rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: critical

Data Retention

Default retention periods:

Metrics (Prometheus): 15 days
Logs (Loki): 7 days
Traces (Tempo): 3 days

Configure in docker-compose:

prometheus:
  command:
    - '--storage.tsdb.retention.time=30d'

loki:
  config:
    table_manager:
      retention_period: 168h

tempo:
  retention_period: 72h

Best Practices

Dashboard Organization
- Use folders for different teams/services
- Standardize naming conventions
- Include documentation panels
Query Optimization
- Use recording rules for complex queries
- Limit high-cardinality labels
- Set appropriate time ranges
Alert Management
- Define clear severity levels
- Include runbooks in alerts
- Configure proper notification channels

Troubleshooting

Common Issues

Missing Data
- Check collector connectivity
- Verify port configurations
- Ensure correct label matching
Dashboard Performance
- Optimize time ranges
- Use appropriate refresh intervals
- Minimize panel count
Alert Issues
- Validate alert conditions
- Check notification settings
- Review alert history

Security Considerations

Access Control

grafana:
  env:
    GF_AUTH_DISABLE_LOGIN_FORM: "false"
    GF_AUTH_ANONYMOUS_ENABLED: "false"

Network Security
- Use TLS for data transmission
- Implement proper authentication
- Restrict network access
Data Protection
- Configure data retention
- Implement log sanitization
- Manage sensitive labels

Stack Overview​

Quick Start​

Pre-configured Dashboards​

1. Service Overview​

2. Distributed Tracing​

3. Log Analytics​

Custom Dashboard Creation​

1. Metrics Dashboard​

2. Trace Analysis​

3. Log Queries​

Alert Configuration​

1. High Latency Alert​

2. Error Rate Alert​

Data Retention​

Best Practices​

Troubleshooting​

Common Issues​

Security Considerations​