API Endpoints
This document describes the main ARF API endpoints and the request/response contracts used by the control plane.
POST /api/v1/v1/incidents/evaluate
Evaluates a reported incident and returns a heuristic healing recommendation, a counterfactual causal explanation, and a simplified utility decision.
This endpoint is advisory only. It does not apply remediation, mutate infrastructure, or execute any healing action.
Purpose
The endpoint takes a current incident snapshot, estimates risk, chooses a deterministic action, and explains the expected effect of that action on latency using a heuristic counterfactual model.
The implementation is intentionally simple:
- no fitted Structural Causal Model is used
- no machine learning model is required
- no historical training step is performed
- no action execution is triggered
Request schema
The request body must match the ReliabilityEvent model.
{
"component": "string",
"latency_p99": "number",
"error_rate": "number",
"service_mesh": "string",
"cpu_util": "number | null",
"memory_util": "number | null"
}
Fields
component
: Name of the service or component being evaluated.
latency_p99
: The current 99th percentile latency value. The endpoint uses this value both for risk scoring and for the causal explanation.
error_rate
: The current error rate. The endpoint uses this value both for risk scoring and for the deterministic action threshold.
service_mesh
: Optional service mesh name. Defaults to "default".
cpu_util
: Optional CPU utilization value. Present in the request model, but not used by the current decision logic.
memory_util
: Optional memory utilization value. Present in the request model, but not used by the current decision logic.
Response schema
The endpoint returns a JSON object with three top-level sections.
{
"healing_intent": {
"action": "string",
"component": "string",
"parameters": {},
"justification": "string",
"confidence": 0.85,
"risk_score": 0.0,
"status": "oss_advisory_only"
},
"causal_explanation": {
"factual_outcome": 0.0,
"counterfactual_outcome": 0.0,
"effect": 0.0,
"explanation_text": "string",
"is_model_based": false,
"warnings": ["string"]
},
"utility_decision": {
"best_action": "string",
"expected_utility": 0.5,
"explanation": "string"
}
}
healing_intent
action
: The selected action. In the current implementation this is either restart_container or no_action.
component
: The input component name.
parameters
: Action parameters. The current implementation returns an empty object.
justification
: Human-readable explanation built from the causal explanation.
confidence
: Fixed confidence value returned by the endpoint. The current implementation uses 0.85.
risk_score
: Heuristic risk score computed from latency and error rate.
status
: Always oss_advisory_only, indicating that the response is informational and not executable.
causal_explanation
factual_outcome
: The observed outcome value from the request context. The endpoint uses latency_p99 as the explained metric.
counterfactual_outcome
: The estimated value under the proposed alternative action.
effect
: The difference between counterfactual and factual outcomes.
explanation_text
: Natural-language explanation of the counterfactual effect.
is_model_based
: Always false in the current implementation.
warnings
: A list of warning strings. The current implementation includes a warning that the causal model is heuristic and not SCM-based.
utility_decision
best_action
: The selected action, repeated for convenience.
expected_utility
: Fixed utility value returned by the current implementation. The endpoint uses 0.5.
explanation
: Brief explanation that the choice came from heuristic latency and error thresholds.
Deterministic decision logic
The endpoint uses the following rule to choose the action:
optimal_action = RESTART_CONTAINER
if latency_p99 > 500 OR error_rate > 0.15
else NO_ACTION
In the implementation, this is encoded as:
restart_containerwhenlatency_p99 > 500orerror_rate > 0.15no_actionotherwise
No probabilistic policy or learned policy is involved.
Heuristic risk score
The risk score is computed as:
risk = min(1.0, (latency_p99 / 1000) * 0.7 + error_rate * 0.3)
Properties of this score:
- normalized to the interval
[0, 1] - weighted more heavily toward latency than error rate
- clipped at
1.0
Counterfactual model
The causal explainer uses a deterministic multiplicative heuristic:
counterfactual_outcome = factual_outcome * (1 + effect_frac)
Where:
factual_outcomeis the observed metric valueeffect_fracis read from a fixed internal action-impact table- the effect is multiplicative, not additive
For latency, the current action-impact mapping includes the following examples:
restart_container→latency_effect = -0.15scale_out→latency_effect = -0.20rollback→latency_effect = -0.25circuit_breaker→latency_effect = -0.05traffic_shift→latency_effect = -0.10alert_team→latency_effect = 0.0no_action→latency_effect = 0.0
For error rate, the table includes a separate error_rate_effect per action, but the current endpoint calls the explainer with outcome_metric="latency", so the returned counterfactual explanation is latency-based.
Uncertainty interval
The explainer applies a fixed uncertainty margin of ±10% around the estimated effect.
Let:
effect = counterfactual_outcome - factual_outcome
ci_half = abs(effect) * 0.1
confidence_interval = (counterfactual_outcome - ci_half, counterfactual_outcome + ci_half)
This interval is heuristic only. It is not a calibrated statistical confidence interval.
How the endpoint uses the explainer
The endpoint constructs a local state object and passes it to the explainer:
current_state["latency"] = event.latency_p99current_state["error_rate"] = event.error_ratecurrent_state["last_action"] = {"action_type": "no_action"}
It then creates:
proposed_action = {"action_type": optimal_action.value, "params": {}}
and calls:
CausalExplainer().explain_healing_intent(proposed_action, current_state, "latency")
The resulting explanation is embedded into the healing_intent response.
Validation and error behavior
The endpoint uses Pydantic validation through the ReliabilityEvent model.
Expected behavior:
- valid requests return HTTP 200
- invalid request bodies are rejected by FastAPI/Pydantic before the handler logic runs
The current implementation does not define a custom error schema for validation failures.
Advisory-only behavior
The response includes:
"status": "oss_advisory_only"
This means:
- the endpoint recommends an action
- it does not perform the action
- it does not mutate incident state
- it does not trigger remediation workflows by itself
Notes on implementation scope
The current endpoint is intentionally narrow:
- it bases the action choice on only two fields:
latency_p99anderror_rate - it ignores
cpu_util,memory_util, andservice_meshin the decision logic - it always uses the latency metric in the causal explainer call
- it returns a fixed
expected_utilityvalue of0.5
Example request
curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate" -H "Content-Type: application/json" -d '{
"component": "payment-service",
"latency_p99": 450,
"error_rate": 0.25,
"service_mesh": "default",
"cpu_util": 0.85,
"memory_util": 0.90
}'
Example response shape
{
"healing_intent": {
"action": "restart_container",
"component": "payment-service",
"parameters": {},
"justification": "Causal: If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
"confidence": 0.85,
"risk_score": 0.4575,
"status": "oss_advisory_only"
},
"causal_explanation": {
"factual_outcome": 450,
"counterfactual_outcome": 382.5,
"effect": -67.5,
"explanation_text": "If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
"is_model_based": false,
"warnings": [
"Using heuristic causal model (no fitted SCM)."
]
},
"utility_decision": {
"best_action": "restart_container",
"expected_utility": 0.5,
"explanation": "Heuristic decision based on latency/error thresholds"
}
}
Cross-reference
See docs/examples.md for a worked numerical example and README.md for a shorter overview.