# API Endpoints This document describes the main ARF API endpoints and the request/response contracts used by the control plane. ## POST `/api/v1/v1/incidents/evaluate` Evaluates a reported incident and returns a heuristic healing recommendation, a counterfactual causal explanation, and a simplified utility decision. This endpoint is **advisory only**. It does not apply remediation, mutate infrastructure, or execute any healing action. ### Purpose The endpoint takes a current incident snapshot, estimates risk, chooses a deterministic action, and explains the expected effect of that action on latency using a heuristic counterfactual model. The implementation is intentionally simple: - no fitted Structural Causal Model is used - no machine learning model is required - no historical training step is performed - no action execution is triggered ### Request schema The request body must match the `ReliabilityEvent` model. ```json { "component": "string", "latency_p99": "number", "error_rate": "number", "service_mesh": "string", "cpu_util": "number | null", "memory_util": "number | null" } ``` #### Fields `component` : Name of the service or component being evaluated. `latency_p99` : The current 99th percentile latency value. The endpoint uses this value both for risk scoring and for the causal explanation. `error_rate` : The current error rate. The endpoint uses this value both for risk scoring and for the deterministic action threshold. `service_mesh` : Optional service mesh name. Defaults to `"default"`. `cpu_util` : Optional CPU utilization value. Present in the request model, but not used by the current decision logic. `memory_util` : Optional memory utilization value. Present in the request model, but not used by the current decision logic. ### Response schema The endpoint returns a JSON object with three top-level sections. ```json { "healing_intent": { "action": "string", "component": "string", "parameters": {}, "justification": "string", "confidence": 0.85, "risk_score": 0.0, "status": "oss_advisory_only" }, "causal_explanation": { "factual_outcome": 0.0, "counterfactual_outcome": 0.0, "effect": 0.0, "explanation_text": "string", "is_model_based": false, "warnings": ["string"] }, "utility_decision": { "best_action": "string", "expected_utility": 0.5, "explanation": "string" } } ``` #### `healing_intent` `action` : The selected action. In the current implementation this is either `restart_container` or `no_action`. `component` : The input component name. `parameters` : Action parameters. The current implementation returns an empty object. `justification` : Human-readable explanation built from the causal explanation. `confidence` : Fixed confidence value returned by the endpoint. The current implementation uses `0.85`. `risk_score` : Heuristic risk score computed from latency and error rate. `status` : Always `oss_advisory_only`, indicating that the response is informational and not executable. #### `causal_explanation` `factual_outcome` : The observed outcome value from the request context. The endpoint uses `latency_p99` as the explained metric. `counterfactual_outcome` : The estimated value under the proposed alternative action. `effect` : The difference between counterfactual and factual outcomes. `explanation_text` : Natural-language explanation of the counterfactual effect. `is_model_based` : Always `false` in the current implementation. `warnings` : A list of warning strings. The current implementation includes a warning that the causal model is heuristic and not SCM-based. #### `utility_decision` `best_action` : The selected action, repeated for convenience. `expected_utility` : Fixed utility value returned by the current implementation. The endpoint uses `0.5`. `explanation` : Brief explanation that the choice came from heuristic latency and error thresholds. ### Deterministic decision logic The endpoint uses the following rule to choose the action: ```text optimal_action = RESTART_CONTAINER if latency_p99 > 500 OR error_rate > 0.15 else NO_ACTION ``` In the implementation, this is encoded as: - `restart_container` when `latency_p99 > 500` or `error_rate > 0.15` - `no_action` otherwise No probabilistic policy or learned policy is involved. ### Heuristic risk score The risk score is computed as: ```text risk = min(1.0, (latency_p99 / 1000) * 0.7 + error_rate * 0.3) ``` Properties of this score: - normalized to the interval `[0, 1]` - weighted more heavily toward latency than error rate - clipped at `1.0` ### Counterfactual model The causal explainer uses a deterministic multiplicative heuristic: ```text counterfactual_outcome = factual_outcome * (1 + effect_frac) ``` Where: - `factual_outcome` is the observed metric value - `effect_frac` is read from a fixed internal action-impact table - the effect is multiplicative, not additive For latency, the current action-impact mapping includes the following examples: - `restart_container` → `latency_effect = -0.15` - `scale_out` → `latency_effect = -0.20` - `rollback` → `latency_effect = -0.25` - `circuit_breaker` → `latency_effect = -0.05` - `traffic_shift` → `latency_effect = -0.10` - `alert_team` → `latency_effect = 0.0` - `no_action` → `latency_effect = 0.0` For error rate, the table includes a separate `error_rate_effect` per action, but the current endpoint calls the explainer with `outcome_metric="latency"`, so the returned counterfactual explanation is latency-based. ### Uncertainty interval The explainer applies a fixed uncertainty margin of ±10% around the estimated effect. Let: ```text effect = counterfactual_outcome - factual_outcome ci_half = abs(effect) * 0.1 confidence_interval = (counterfactual_outcome - ci_half, counterfactual_outcome + ci_half) ``` This interval is heuristic only. It is not a calibrated statistical confidence interval. ### How the endpoint uses the explainer The endpoint constructs a local state object and passes it to the explainer: - `current_state["latency"] = event.latency_p99` - `current_state["error_rate"] = event.error_rate` - `current_state["last_action"] = {"action_type": "no_action"}` It then creates: - `proposed_action = {"action_type": optimal_action.value, "params": {}}` and calls: ```text CausalExplainer().explain_healing_intent(proposed_action, current_state, "latency") ``` The resulting explanation is embedded into the `healing_intent` response. ### Validation and error behavior The endpoint uses Pydantic validation through the `ReliabilityEvent` model. Expected behavior: - valid requests return HTTP 200 - invalid request bodies are rejected by FastAPI/Pydantic before the handler logic runs The current implementation does not define a custom error schema for validation failures. ### Advisory-only behavior The response includes: ```json "status": "oss_advisory_only" ``` This means: - the endpoint recommends an action - it does not perform the action - it does not mutate incident state - it does not trigger remediation workflows by itself ### Notes on implementation scope The current endpoint is intentionally narrow: - it bases the action choice on only two fields: `latency_p99` and `error_rate` - it ignores `cpu_util`, `memory_util`, and `service_mesh` in the decision logic - it always uses the latency metric in the causal explainer call - it returns a fixed `expected_utility` value of `0.5` ### Example request ```bash curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate" -H "Content-Type: application/json" -d '{ "component": "payment-service", "latency_p99": 450, "error_rate": 0.25, "service_mesh": "default", "cpu_util": 0.85, "memory_util": 0.90 }' ``` ### Example response shape ```json { "healing_intent": { "action": "restart_container", "component": "payment-service", "parameters": {}, "justification": "Causal: If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.", "confidence": 0.85, "risk_score": 0.4575, "status": "oss_advisory_only" }, "causal_explanation": { "factual_outcome": 450, "counterfactual_outcome": 382.5, "effect": -67.5, "explanation_text": "If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.", "is_model_based": false, "warnings": [ "Using heuristic causal model (no fitted SCM)." ] }, "utility_decision": { "best_action": "restart_container", "expected_utility": 0.5, "explanation": "Heuristic decision based on latency/error thresholds" } } ``` ### Cross-reference See `docs/examples.md` for a worked numerical example and `README.md` for a shorter overview.