How to Set Up Automated Incident Triage with AI

When an alert fires at 3 AM, the on-call engineer's first task isn't fixing the problem—it's figuring out what the problem actually is. They open dashboards, check recent deployments, scan logs, and correlate signals across services.

This investigation phase often takes longer than the fix itself. AI can automate most of this work, delivering a preliminary diagnosis by the time the engineer opens their laptop.

This guide walks through setting up automated incident triage using IncidentFox.

What We're Building

By the end of this guide, you'll have:

AI-powered alert enrichment that gathers context automatically
Automated root cause analysis for incoming incidents
Slack notifications with investigation summaries
Integration with your existing PagerDuty alerts

Prerequisites

Before starting, you'll need:

A Kubernetes cluster (for running IncidentFox)
PagerDuty account with API access
Slack workspace with permission to add apps
Access to your monitoring stack (Prometheus, Datadog, or similar)
Basic familiarity with kubectl and Helm

Deploy IncidentFox

First, add the IncidentFox Helm repository:

helm repo add incidentfox https://charts.incidentfox.ai
helm repo update

Create a values file for your configuration:

# values.yaml
config:
  llm:
    provider: openai  # or anthropic, azure, local
    apiKey: ${OPENAI_API_KEY}

  integrations:
    pagerduty:
      enabled: true
      apiKey: ${PAGERDUTY_API_KEY}

    slack:
      enabled: true
      botToken: ${SLACK_BOT_TOKEN}
      appToken: ${SLACK_APP_TOKEN}

    prometheus:
      enabled: true
      url: http://prometheus:9090

storage:
  type: postgresql
  connectionString: ${DATABASE_URL}

Deploy with Helm:

helm install incidentfox incidentfox/incidentfox \
  --namespace incidentfox \
  --create-namespace \
  -f values.yaml

Verify the deployment:

kubectl get pods -n incidentfox

Configure PagerDuty Integration

IncidentFox needs to receive alerts from PagerDuty. Set up a webhook:

In PagerDuty, go to Services → select your service → Integrations
Click Add Integration → Generic Webhook V3
Set the webhook URL to your IncidentFox endpoint:

https://your-incidentfox-domain/api/webhooks/pagerduty

Select which events to send (at minimum: incident.triggered)

Test the integration by triggering a test alert. You should see it appear in IncidentFox logs:

kubectl logs -n incidentfox -l app=incidentfox -f

Connect Your Monitoring Stack

IncidentFox needs access to your observability data to investigate incidents. Configure your data sources.

Prometheus

If you're using Prometheus, IncidentFox can query metrics directly:

# In your values.yaml
integrations:
  prometheus:
    enabled: true
    url: http://prometheus.monitoring:9090

Datadog

For Datadog, provide API credentials:

integrations:
  datadog:
    enabled: true
    apiKey: ${DATADOG_API_KEY}
    appKey: ${DATADOG_APP_KEY}
    site: datadoghq.com  # or datadoghq.eu

Logs

Connect your log aggregator:

integrations:
  logs:
    provider: loki  # or elasticsearch, cloudwatch
    url: http://loki.monitoring:3100

After updating your values, upgrade the deployment:

helm upgrade incidentfox incidentfox/incidentfox \
  --namespace incidentfox \
  -f values.yaml

Set Up Slack Notifications

Configure IncidentFox to post investigation summaries to Slack.

Create a Slack App

Go to api.slack.com/apps → Create New App
Choose From scratch, name it "IncidentFox"
Under OAuth & Permissions, add these scopes:
- chat:write
- channels:read
- groups:read
Install the app to your workspace
Copy the Bot User OAuth Token

Configure the Channel

In your values.yaml:

integrations:
  slack:
    enabled: true
    botToken: xoxb-your-bot-token
    defaultChannel: "#incidents"  # Where to post summaries

Test the Integration

Trigger a test alert and verify IncidentFox posts to Slack. The message should include:

Alert summary
Affected service
Probable root cause (if identified)
Links to relevant dashboards

Add Your Team's Knowledge

AI SRE becomes more useful when it understands your specific environment. Add context through runbooks and historical incidents.

Import Runbooks

If you have existing runbooks (Markdown, Confluence, Notion), import them:

incidentfox kb import --source ./runbooks/

Or connect directly to Confluence:

knowledge:
  confluence:
    enabled: true
    url: https://your-company.atlassian.net
    email: ${CONFLUENCE_EMAIL}
    apiToken: ${CONFLUENCE_API_TOKEN}
    spaces: ["SRE", "RUNBOOKS"]

Import Historical Incidents

Past incidents help IncidentFox recognize patterns:

incidentfox incidents import --source pagerduty --since 2025-01-01

This imports incident data including:

Alert details
Timeline
Resolution notes
Post-incident reports

The more history you provide, the better IncidentFox can identify similar issues and suggest proven fixes.

Configure Triage Rules

Customize how IncidentFox handles different alert types.

Create a triage configuration:

# triage-rules.yaml
rules:
  - name: high-severity-immediate
    match:
      severity: [critical, high]
    actions:
      - investigate: full
      - notify:
          channel: "#incidents-critical"
          mention: "@oncall"

  - name: database-alerts
    match:
      service: ["postgres", "redis", "mysql"]
    actions:
      - investigate: full
      - runbook: database-troubleshooting
      - notify:
          channel: "#dba-alerts"

  - name: low-severity-batch
    match:
      severity: [low, warning]
    actions:
      - investigate: basic
      - batch:
          window: 15m
          channel: "#alerts-digest"

Apply the configuration:

kubectl create configmap triage-rules \
  --from-file=triage-rules.yaml \
  -n incidentfox

kubectl rollout restart deployment/incidentfox -n incidentfox

Test the Full Flow

Now test the complete automated triage flow:

Trigger a test alert in PagerDuty (or wait for a real one)
Watch the logs to see IncidentFox receive and process it:
```
kubectl logs -n incidentfox -l app=incidentfox -f
```
Check Slack for the investigation summary

Review the analysis in the IncidentFox UI:

https://your-incidentfox-domain/incidents

A successful triage should show:

Automatic data collection from your monitoring tools
Correlation with recent changes
Similar past incidents (if any)
Probable root cause with confidence score
Suggested remediation steps

Tune and Iterate

After running for a few incidents:

Review Accuracy

Check how accurate the AI's root cause analysis is. If it's often wrong about certain alert types, you may need:

Better runbook documentation for those scenarios
More historical incident data
Adjusted correlation rules

Reduce Noise

If IncidentFox is over-alerting or creating too many low-value notifications:

Adjust triage rules to batch low-severity alerts
Tune the similarity threshold for "related alerts"
Add suppression rules for known flaky alerts

Expand Coverage

Once confident in the setup:

Add more services and alert sources
Enable additional integrations (Grafana, Sentry, etc.)
Consider enabling automated remediation for well-understood issues

What Success Looks Like

After implementing automated triage, you should see:

Faster initial response: Engineers wake up to a summary instead of starting from scratch.
Reduced investigation time: The AI has already gathered context and tested common hypotheses.
More consistent triage: Every incident gets the same thorough initial analysis, regardless of which engineer is on call.
Better knowledge capture: Runbooks and past incidents become actively useful rather than forgotten documentation.

Troubleshooting

IncidentFox isn't receiving alerts

Check the PagerDuty webhook configuration
Verify network connectivity from PagerDuty to your IncidentFox endpoint
Check the ingress/load balancer logs

Investigation quality is poor

Ensure monitoring integrations are working (test Prometheus queries)
Import more historical incidents
Add domain-specific runbooks

Slack notifications aren't appearing

Verify the bot token has correct permissions
Check the bot is invited to the target channel
Review IncidentFox logs for Slack API errors

Next Steps

Once basic triage is working:

Add more data sources: Connect additional monitoring tools, log systems, and deployment pipelines
Build team-specific agents: Create specialized AI agents for different teams (database, payments, infrastructure) with domain-specific knowledge
Enable advanced features: Dependency mapping, predictive alerting, automated remediation
Measure impact: Track MTTR, investigation time, and engineer satisfaction to quantify the value

Conclusion

Automated incident triage reduces the cognitive load on on-call engineers by handling the initial investigation work. Instead of manually correlating signals across tools, engineers receive a preliminary diagnosis with supporting evidence.

The setup requires some upfront work—integrations, knowledge import, rule configuration—but pays off in faster incident response and more consistent triage quality.

Start with a subset of your alerts, validate the analysis quality, and expand from there.