When an alert fires at 3 AM, the on-call engineer's first task isn't fixing the problem—it's figuring out what the problem actually is. They open dashboards, check recent deployments, scan logs, and correlate signals across services.
This investigation phase often takes longer than the fix itself. AI can automate most of this work, delivering a preliminary diagnosis by the time the engineer opens their laptop.
This guide walks through setting up automated incident triage using IncidentFox.
What We're Building
By the end of this guide, you'll have:
- AI-powered alert enrichment that gathers context automatically
- Automated root cause analysis for incoming incidents
- Slack notifications with investigation summaries
- Integration with your existing PagerDuty alerts
Prerequisites
Before starting, you'll need:
- A Kubernetes cluster (for running IncidentFox)
- PagerDuty account with API access
- Slack workspace with permission to add apps
- Access to your monitoring stack (Prometheus, Datadog, or similar)
- Basic familiarity with kubectl and Helm
Deploy IncidentFox
First, add the IncidentFox Helm repository:
helm repo add incidentfox https://charts.incidentfox.ai
helm repo update
Create a values file for your configuration:
# values.yaml
config:
llm:
provider: openai # or anthropic, azure, local
apiKey: ${OPENAI_API_KEY}
integrations:
pagerduty:
enabled: true
apiKey: ${PAGERDUTY_API_KEY}
slack:
enabled: true
botToken: ${SLACK_BOT_TOKEN}
appToken: ${SLACK_APP_TOKEN}
prometheus:
enabled: true
url: http://prometheus:9090
storage:
type: postgresql
connectionString: ${DATABASE_URL}
Deploy with Helm:
helm install incidentfox incidentfox/incidentfox \
--namespace incidentfox \
--create-namespace \
-f values.yaml
Verify the deployment:
kubectl get pods -n incidentfox
Configure PagerDuty Integration
IncidentFox needs to receive alerts from PagerDuty. Set up a webhook:
- In PagerDuty, go to Services → select your service → Integrations
- Click Add Integration → Generic Webhook V3
- Set the webhook URL to your IncidentFox endpoint:
https://your-incidentfox-domain/api/webhooks/pagerduty
- Select which events to send (at minimum:
incident.triggered)
Test the integration by triggering a test alert. You should see it appear in IncidentFox logs:
kubectl logs -n incidentfox -l app=incidentfox -f
Connect Your Monitoring Stack
IncidentFox needs access to your observability data to investigate incidents. Configure your data sources.
Prometheus
If you're using Prometheus, IncidentFox can query metrics directly:
# In your values.yaml
integrations:
prometheus:
enabled: true
url: http://prometheus.monitoring:9090
Datadog
For Datadog, provide API credentials:
integrations:
datadog:
enabled: true
apiKey: ${DATADOG_API_KEY}
appKey: ${DATADOG_APP_KEY}
site: datadoghq.com # or datadoghq.eu
Logs
Connect your log aggregator:
integrations:
logs:
provider: loki # or elasticsearch, cloudwatch
url: http://loki.monitoring:3100
After updating your values, upgrade the deployment:
helm upgrade incidentfox incidentfox/incidentfox \
--namespace incidentfox \
-f values.yaml
Set Up Slack Notifications
Configure IncidentFox to post investigation summaries to Slack.
Create a Slack App
- Go to api.slack.com/apps → Create New App
- Choose From scratch, name it "IncidentFox"
- Under OAuth & Permissions, add these scopes:
chat:writechannels:readgroups:read
- Install the app to your workspace
- Copy the Bot User OAuth Token
Configure the Channel
In your values.yaml:
integrations:
slack:
enabled: true
botToken: xoxb-your-bot-token
defaultChannel: "#incidents" # Where to post summaries
Test the Integration
Trigger a test alert and verify IncidentFox posts to Slack. The message should include:
- Alert summary
- Affected service
- Probable root cause (if identified)
- Links to relevant dashboards
Add Your Team's Knowledge
AI SRE becomes more useful when it understands your specific environment. Add context through runbooks and historical incidents.
Import Runbooks
If you have existing runbooks (Markdown, Confluence, Notion), import them:
incidentfox kb import --source ./runbooks/
Or connect directly to Confluence:
knowledge:
confluence:
enabled: true
url: https://your-company.atlassian.net
email: ${CONFLUENCE_EMAIL}
apiToken: ${CONFLUENCE_API_TOKEN}
spaces: ["SRE", "RUNBOOKS"]
Import Historical Incidents
Past incidents help IncidentFox recognize patterns:
incidentfox incidents import --source pagerduty --since 2025-01-01
This imports incident data including:
- Alert details
- Timeline
- Resolution notes
- Post-incident reports
The more history you provide, the better IncidentFox can identify similar issues and suggest proven fixes.
Configure Triage Rules
Customize how IncidentFox handles different alert types.
Create a triage configuration:
# triage-rules.yaml
rules:
- name: high-severity-immediate
match:
severity: [critical, high]
actions:
- investigate: full
- notify:
channel: "#incidents-critical"
mention: "@oncall"
- name: database-alerts
match:
service: ["postgres", "redis", "mysql"]
actions:
- investigate: full
- runbook: database-troubleshooting
- notify:
channel: "#dba-alerts"
- name: low-severity-batch
match:
severity: [low, warning]
actions:
- investigate: basic
- batch:
window: 15m
channel: "#alerts-digest"
Apply the configuration:
kubectl create configmap triage-rules \
--from-file=triage-rules.yaml \
-n incidentfox
kubectl rollout restart deployment/incidentfox -n incidentfox
Test the Full Flow
Now test the complete automated triage flow:
- Trigger a test alert in PagerDuty (or wait for a real one)
- Watch the logs to see IncidentFox receive and process it:
kubectl logs -n incidentfox -l app=incidentfox -f - Check Slack for the investigation summary
- Review the analysis in the IncidentFox UI:
https://your-incidentfox-domain/incidents
A successful triage should show:
- Automatic data collection from your monitoring tools
- Correlation with recent changes
- Similar past incidents (if any)
- Probable root cause with confidence score
- Suggested remediation steps
Tune and Iterate
After running for a few incidents:
Review Accuracy
Check how accurate the AI's root cause analysis is. If it's often wrong about certain alert types, you may need:
- Better runbook documentation for those scenarios
- More historical incident data
- Adjusted correlation rules
Reduce Noise
If IncidentFox is over-alerting or creating too many low-value notifications:
- Adjust triage rules to batch low-severity alerts
- Tune the similarity threshold for "related alerts"
- Add suppression rules for known flaky alerts
Expand Coverage
Once confident in the setup:
- Add more services and alert sources
- Enable additional integrations (Grafana, Sentry, etc.)
- Consider enabling automated remediation for well-understood issues
What Success Looks Like
After implementing automated triage, you should see:
- Faster initial response: Engineers wake up to a summary instead of starting from scratch.
- Reduced investigation time: The AI has already gathered context and tested common hypotheses.
- More consistent triage: Every incident gets the same thorough initial analysis, regardless of which engineer is on call.
- Better knowledge capture: Runbooks and past incidents become actively useful rather than forgotten documentation.
Troubleshooting
IncidentFox isn't receiving alerts
- Check the PagerDuty webhook configuration
- Verify network connectivity from PagerDuty to your IncidentFox endpoint
- Check the ingress/load balancer logs
Investigation quality is poor
- Ensure monitoring integrations are working (test Prometheus queries)
- Import more historical incidents
- Add domain-specific runbooks
Slack notifications aren't appearing
- Verify the bot token has correct permissions
- Check the bot is invited to the target channel
- Review IncidentFox logs for Slack API errors
Next Steps
Once basic triage is working:
- Add more data sources: Connect additional monitoring tools, log systems, and deployment pipelines
- Build team-specific agents: Create specialized AI agents for different teams (database, payments, infrastructure) with domain-specific knowledge
- Enable advanced features: Dependency mapping, predictive alerting, automated remediation
- Measure impact: Track MTTR, investigation time, and engineer satisfaction to quantify the value
Conclusion
Automated incident triage reduces the cognitive load on on-call engineers by handling the initial investigation work. Instead of manually correlating signals across tools, engineers receive a preliminary diagnosis with supporting evidence.
The setup requires some upfront work—integrations, knowledge import, rule configuration—but pays off in faster incident response and more consistent triage quality.
Start with a subset of your alerts, validate the analysis quality, and expand from there.