Spaces:
Build error
Build error
| # API Endpoints | |
| This document describes the main ARF API endpoints and the request/response contracts used by the control plane. | |
| ## POST `/api/v1/v1/incidents/evaluate` | |
| Evaluates a reported incident and returns a heuristic healing recommendation, a counterfactual causal explanation, and a simplified utility decision. | |
| This endpoint is **advisory only**. It does not apply remediation, mutate infrastructure, or execute any healing action. | |
| ### Purpose | |
| The endpoint takes a current incident snapshot, estimates risk, chooses a deterministic action, and explains the expected effect of that action on latency using a heuristic counterfactual model. | |
| The implementation is intentionally simple: | |
| - no fitted Structural Causal Model is used | |
| - no machine learning model is required | |
| - no historical training step is performed | |
| - no action execution is triggered | |
| ### Request schema | |
| The request body must match the `ReliabilityEvent` model. | |
| ```json | |
| { | |
| "component": "string", | |
| "latency_p99": "number", | |
| "error_rate": "number", | |
| "service_mesh": "string", | |
| "cpu_util": "number | null", | |
| "memory_util": "number | null" | |
| } | |
| ``` | |
| #### Fields | |
| `component` | |
| : Name of the service or component being evaluated. | |
| `latency_p99` | |
| : The current 99th percentile latency value. The endpoint uses this value both for risk scoring and for the causal explanation. | |
| `error_rate` | |
| : The current error rate. The endpoint uses this value both for risk scoring and for the deterministic action threshold. | |
| `service_mesh` | |
| : Optional service mesh name. Defaults to `"default"`. | |
| `cpu_util` | |
| : Optional CPU utilization value. Present in the request model, but not used by the current decision logic. | |
| `memory_util` | |
| : Optional memory utilization value. Present in the request model, but not used by the current decision logic. | |
| ### Response schema | |
| The endpoint returns a JSON object with three top-level sections. | |
| ```json | |
| { | |
| "healing_intent": { | |
| "action": "string", | |
| "component": "string", | |
| "parameters": {}, | |
| "justification": "string", | |
| "confidence": 0.85, | |
| "risk_score": 0.0, | |
| "status": "oss_advisory_only" | |
| }, | |
| "causal_explanation": { | |
| "factual_outcome": 0.0, | |
| "counterfactual_outcome": 0.0, | |
| "effect": 0.0, | |
| "explanation_text": "string", | |
| "is_model_based": false, | |
| "warnings": ["string"] | |
| }, | |
| "utility_decision": { | |
| "best_action": "string", | |
| "expected_utility": 0.5, | |
| "explanation": "string" | |
| } | |
| } | |
| ``` | |
| #### `healing_intent` | |
| `action` | |
| : The selected action. In the current implementation this is either `restart_container` or `no_action`. | |
| `component` | |
| : The input component name. | |
| `parameters` | |
| : Action parameters. The current implementation returns an empty object. | |
| `justification` | |
| : Human-readable explanation built from the causal explanation. | |
| `confidence` | |
| : Fixed confidence value returned by the endpoint. The current implementation uses `0.85`. | |
| `risk_score` | |
| : Heuristic risk score computed from latency and error rate. | |
| `status` | |
| : Always `oss_advisory_only`, indicating that the response is informational and not executable. | |
| #### `causal_explanation` | |
| `factual_outcome` | |
| : The observed outcome value from the request context. The endpoint uses `latency_p99` as the explained metric. | |
| `counterfactual_outcome` | |
| : The estimated value under the proposed alternative action. | |
| `effect` | |
| : The difference between counterfactual and factual outcomes. | |
| `explanation_text` | |
| : Natural-language explanation of the counterfactual effect. | |
| `is_model_based` | |
| : Always `false` in the current implementation. | |
| `warnings` | |
| : A list of warning strings. The current implementation includes a warning that the causal model is heuristic and not SCM-based. | |
| #### `utility_decision` | |
| `best_action` | |
| : The selected action, repeated for convenience. | |
| `expected_utility` | |
| : Fixed utility value returned by the current implementation. The endpoint uses `0.5`. | |
| `explanation` | |
| : Brief explanation that the choice came from heuristic latency and error thresholds. | |
| ### Deterministic decision logic | |
| The endpoint uses the following rule to choose the action: | |
| ```text | |
| optimal_action = RESTART_CONTAINER | |
| if latency_p99 > 500 OR error_rate > 0.15 | |
| else NO_ACTION | |
| ``` | |
| In the implementation, this is encoded as: | |
| - `restart_container` when `latency_p99 > 500` or `error_rate > 0.15` | |
| - `no_action` otherwise | |
| No probabilistic policy or learned policy is involved. | |
| ### Heuristic risk score | |
| The risk score is computed as: | |
| ```text | |
| risk = min(1.0, (latency_p99 / 1000) * 0.7 + error_rate * 0.3) | |
| ``` | |
| Properties of this score: | |
| - normalized to the interval `[0, 1]` | |
| - weighted more heavily toward latency than error rate | |
| - clipped at `1.0` | |
| ### Counterfactual model | |
| The causal explainer uses a deterministic multiplicative heuristic: | |
| ```text | |
| counterfactual_outcome = factual_outcome * (1 + effect_frac) | |
| ``` | |
| Where: | |
| - `factual_outcome` is the observed metric value | |
| - `effect_frac` is read from a fixed internal action-impact table | |
| - the effect is multiplicative, not additive | |
| For latency, the current action-impact mapping includes the following examples: | |
| - `restart_container` → `latency_effect = -0.15` | |
| - `scale_out` → `latency_effect = -0.20` | |
| - `rollback` → `latency_effect = -0.25` | |
| - `circuit_breaker` → `latency_effect = -0.05` | |
| - `traffic_shift` → `latency_effect = -0.10` | |
| - `alert_team` → `latency_effect = 0.0` | |
| - `no_action` → `latency_effect = 0.0` | |
| For error rate, the table includes a separate `error_rate_effect` per action, but the current endpoint calls the explainer with `outcome_metric="latency"`, so the returned counterfactual explanation is latency-based. | |
| ### Uncertainty interval | |
| The explainer applies a fixed uncertainty margin of ±10% around the estimated effect. | |
| Let: | |
| ```text | |
| effect = counterfactual_outcome - factual_outcome | |
| ci_half = abs(effect) * 0.1 | |
| confidence_interval = (counterfactual_outcome - ci_half, counterfactual_outcome + ci_half) | |
| ``` | |
| This interval is heuristic only. It is not a calibrated statistical confidence interval. | |
| ### How the endpoint uses the explainer | |
| The endpoint constructs a local state object and passes it to the explainer: | |
| - `current_state["latency"] = event.latency_p99` | |
| - `current_state["error_rate"] = event.error_rate` | |
| - `current_state["last_action"] = {"action_type": "no_action"}` | |
| It then creates: | |
| - `proposed_action = {"action_type": optimal_action.value, "params": {}}` | |
| and calls: | |
| ```text | |
| CausalExplainer().explain_healing_intent(proposed_action, current_state, "latency") | |
| ``` | |
| The resulting explanation is embedded into the `healing_intent` response. | |
| ### Validation and error behavior | |
| The endpoint uses Pydantic validation through the `ReliabilityEvent` model. | |
| Expected behavior: | |
| - valid requests return HTTP 200 | |
| - invalid request bodies are rejected by FastAPI/Pydantic before the handler logic runs | |
| The current implementation does not define a custom error schema for validation failures. | |
| ### Advisory-only behavior | |
| The response includes: | |
| ```json | |
| "status": "oss_advisory_only" | |
| ``` | |
| This means: | |
| - the endpoint recommends an action | |
| - it does not perform the action | |
| - it does not mutate incident state | |
| - it does not trigger remediation workflows by itself | |
| ### Notes on implementation scope | |
| The current endpoint is intentionally narrow: | |
| - it bases the action choice on only two fields: `latency_p99` and `error_rate` | |
| - it ignores `cpu_util`, `memory_util`, and `service_mesh` in the decision logic | |
| - it always uses the latency metric in the causal explainer call | |
| - it returns a fixed `expected_utility` value of `0.5` | |
| ### Example request | |
| ```bash | |
| curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate" -H "Content-Type: application/json" -d '{ | |
| "component": "payment-service", | |
| "latency_p99": 450, | |
| "error_rate": 0.25, | |
| "service_mesh": "default", | |
| "cpu_util": 0.85, | |
| "memory_util": 0.90 | |
| }' | |
| ``` | |
| ### Example response shape | |
| ```json | |
| { | |
| "healing_intent": { | |
| "action": "restart_container", | |
| "component": "payment-service", | |
| "parameters": {}, | |
| "justification": "Causal: If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.", | |
| "confidence": 0.85, | |
| "risk_score": 0.4575, | |
| "status": "oss_advisory_only" | |
| }, | |
| "causal_explanation": { | |
| "factual_outcome": 450, | |
| "counterfactual_outcome": 382.5, | |
| "effect": -67.5, | |
| "explanation_text": "If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.", | |
| "is_model_based": false, | |
| "warnings": [ | |
| "Using heuristic causal model (no fitted SCM)." | |
| ] | |
| }, | |
| "utility_decision": { | |
| "best_action": "restart_container", | |
| "expected_utility": 0.5, | |
| "explanation": "Heuristic decision based on latency/error thresholds" | |
| } | |
| } | |
| ``` | |
| ### Cross-reference | |
| See `docs/examples.md` for a worked numerical example and `README.md` for a shorter overview. | |