| # API Endpoints |
|
|
| This document describes the main ARF API endpoints and the request/response contracts used by the control plane. |
|
|
| ## POST `/api/v1/v1/incidents/evaluate` |
|
|
| Evaluates a reported incident and returns a heuristic healing recommendation, a counterfactual causal explanation, and a simplified utility decision. |
|
|
| This endpoint is **advisory only**. It does not apply remediation, mutate infrastructure, or execute any healing action. |
|
|
| ### Purpose |
|
|
| The endpoint takes a current incident snapshot, estimates risk, chooses a deterministic action, and explains the expected effect of that action on latency using a heuristic counterfactual model. |
|
|
| The implementation is intentionally simple: |
|
|
| - no fitted Structural Causal Model is used |
| - no machine learning model is required |
| - no historical training step is performed |
| - no action execution is triggered |
|
|
| ### Request schema |
|
|
| The request body must match the `ReliabilityEvent` model. |
|
|
| ```json |
| { |
| "component": "string", |
| "latency_p99": "number", |
| "error_rate": "number", |
| "service_mesh": "string", |
| "cpu_util": "number | null", |
| "memory_util": "number | null" |
| } |
| ``` |
|
|
| #### Fields |
|
|
| `component` |
| : Name of the service or component being evaluated. |
|
|
| `latency_p99` |
| : The current 99th percentile latency value. The endpoint uses this value both for risk scoring and for the causal explanation. |
|
|
| `error_rate` |
| : The current error rate. The endpoint uses this value both for risk scoring and for the deterministic action threshold. |
|
|
| `service_mesh` |
| : Optional service mesh name. Defaults to `"default"`. |
|
|
| `cpu_util` |
| : Optional CPU utilization value. Present in the request model, but not used by the current decision logic. |
|
|
| `memory_util` |
| : Optional memory utilization value. Present in the request model, but not used by the current decision logic. |
|
|
| ### Response schema |
|
|
| The endpoint returns a JSON object with three top-level sections. |
|
|
| ```json |
| { |
| "healing_intent": { |
| "action": "string", |
| "component": "string", |
| "parameters": {}, |
| "justification": "string", |
| "confidence": 0.85, |
| "risk_score": 0.0, |
| "status": "oss_advisory_only" |
| }, |
| "causal_explanation": { |
| "factual_outcome": 0.0, |
| "counterfactual_outcome": 0.0, |
| "effect": 0.0, |
| "explanation_text": "string", |
| "is_model_based": false, |
| "warnings": ["string"] |
| }, |
| "utility_decision": { |
| "best_action": "string", |
| "expected_utility": 0.5, |
| "explanation": "string" |
| } |
| } |
| ``` |
|
|
| #### `healing_intent` |
| |
| `action` |
| : The selected action. In the current implementation this is either `restart_container` or `no_action`. |
| |
| `component` |
| : The input component name. |
| |
| `parameters` |
| : Action parameters. The current implementation returns an empty object. |
| |
| `justification` |
| : Human-readable explanation built from the causal explanation. |
| |
| `confidence` |
| : Fixed confidence value returned by the endpoint. The current implementation uses `0.85`. |
| |
| `risk_score` |
| : Heuristic risk score computed from latency and error rate. |
|
|
| `status` |
| : Always `oss_advisory_only`, indicating that the response is informational and not executable. |
|
|
| #### `causal_explanation` |
| |
| `factual_outcome` |
| : The observed outcome value from the request context. The endpoint uses `latency_p99` as the explained metric. |
|
|
| `counterfactual_outcome` |
| : The estimated value under the proposed alternative action. |
|
|
| `effect` |
| : The difference between counterfactual and factual outcomes. |
|
|
| `explanation_text` |
| : Natural-language explanation of the counterfactual effect. |
|
|
| `is_model_based` |
| : Always `false` in the current implementation. |
|
|
| `warnings` |
| : A list of warning strings. The current implementation includes a warning that the causal model is heuristic and not SCM-based. |
|
|
| #### `utility_decision` |
| |
| `best_action` |
| : The selected action, repeated for convenience. |
|
|
| `expected_utility` |
| : Fixed utility value returned by the current implementation. The endpoint uses `0.5`. |
|
|
| `explanation` |
| : Brief explanation that the choice came from heuristic latency and error thresholds. |
|
|
| ### Deterministic decision logic |
|
|
| The endpoint uses the following rule to choose the action: |
|
|
| ```text |
| optimal_action = RESTART_CONTAINER |
| if latency_p99 > 500 OR error_rate > 0.15 |
| else NO_ACTION |
| ``` |
|
|
| In the implementation, this is encoded as: |
|
|
| - `restart_container` when `latency_p99 > 500` or `error_rate > 0.15` |
| - `no_action` otherwise |
|
|
| No probabilistic policy or learned policy is involved. |
|
|
| ### Heuristic risk score |
|
|
| The risk score is computed as: |
|
|
| ```text |
| risk = min(1.0, (latency_p99 / 1000) * 0.7 + error_rate * 0.3) |
| ``` |
|
|
| Properties of this score: |
|
|
| - normalized to the interval `[0, 1]` |
| - weighted more heavily toward latency than error rate |
| - clipped at `1.0` |
|
|
| ### Counterfactual model |
|
|
| The causal explainer uses a deterministic multiplicative heuristic: |
|
|
| ```text |
| counterfactual_outcome = factual_outcome * (1 + effect_frac) |
| ``` |
|
|
| Where: |
|
|
| - `factual_outcome` is the observed metric value |
| - `effect_frac` is read from a fixed internal action-impact table |
| - the effect is multiplicative, not additive |
|
|
| For latency, the current action-impact mapping includes the following examples: |
|
|
| - `restart_container` → `latency_effect = -0.15` |
| - `scale_out` → `latency_effect = -0.20` |
| - `rollback` → `latency_effect = -0.25` |
| - `circuit_breaker` → `latency_effect = -0.05` |
| - `traffic_shift` → `latency_effect = -0.10` |
| - `alert_team` → `latency_effect = 0.0` |
| - `no_action` → `latency_effect = 0.0` |
|
|
| For error rate, the table includes a separate `error_rate_effect` per action, but the current endpoint calls the explainer with `outcome_metric="latency"`, so the returned counterfactual explanation is latency-based. |
|
|
| ### Uncertainty interval |
|
|
| The explainer applies a fixed uncertainty margin of ±10% around the estimated effect. |
|
|
| Let: |
|
|
| ```text |
| effect = counterfactual_outcome - factual_outcome |
| ci_half = abs(effect) * 0.1 |
| confidence_interval = (counterfactual_outcome - ci_half, counterfactual_outcome + ci_half) |
| ``` |
|
|
| This interval is heuristic only. It is not a calibrated statistical confidence interval. |
|
|
| ### How the endpoint uses the explainer |
|
|
| The endpoint constructs a local state object and passes it to the explainer: |
|
|
| - `current_state["latency"] = event.latency_p99` |
| - `current_state["error_rate"] = event.error_rate` |
| - `current_state["last_action"] = {"action_type": "no_action"}` |
|
|
| It then creates: |
|
|
| - `proposed_action = {"action_type": optimal_action.value, "params": {}}` |
|
|
| and calls: |
|
|
| ```text |
| CausalExplainer().explain_healing_intent(proposed_action, current_state, "latency") |
| ``` |
|
|
| The resulting explanation is embedded into the `healing_intent` response. |
|
|
| ### Validation and error behavior |
|
|
| The endpoint uses Pydantic validation through the `ReliabilityEvent` model. |
|
|
| Expected behavior: |
|
|
| - valid requests return HTTP 200 |
| - invalid request bodies are rejected by FastAPI/Pydantic before the handler logic runs |
|
|
| The current implementation does not define a custom error schema for validation failures. |
|
|
| ### Advisory-only behavior |
|
|
| The response includes: |
|
|
| ```json |
| "status": "oss_advisory_only" |
| ``` |
|
|
| This means: |
|
|
| - the endpoint recommends an action |
| - it does not perform the action |
| - it does not mutate incident state |
| - it does not trigger remediation workflows by itself |
|
|
| ### Notes on implementation scope |
|
|
| The current endpoint is intentionally narrow: |
|
|
| - it bases the action choice on only two fields: `latency_p99` and `error_rate` |
| - it ignores `cpu_util`, `memory_util`, and `service_mesh` in the decision logic |
| - it always uses the latency metric in the causal explainer call |
| - it returns a fixed `expected_utility` value of `0.5` |
|
|
| ### Example request |
|
|
| ```bash |
| curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate" -H "Content-Type: application/json" -d '{ |
| "component": "payment-service", |
| "latency_p99": 450, |
| "error_rate": 0.25, |
| "service_mesh": "default", |
| "cpu_util": 0.85, |
| "memory_util": 0.90 |
| }' |
| ``` |
|
|
| ### Example response shape |
|
|
| ```json |
| { |
| "healing_intent": { |
| "action": "restart_container", |
| "component": "payment-service", |
| "parameters": {}, |
| "justification": "Causal: If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.", |
| "confidence": 0.85, |
| "risk_score": 0.4575, |
| "status": "oss_advisory_only" |
| }, |
| "causal_explanation": { |
| "factual_outcome": 450, |
| "counterfactual_outcome": 382.5, |
| "effect": -67.5, |
| "explanation_text": "If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.", |
| "is_model_based": false, |
| "warnings": [ |
| "Using heuristic causal model (no fitted SCM)." |
| ] |
| }, |
| "utility_decision": { |
| "best_action": "restart_container", |
| "expected_utility": 0.5, |
| "explanation": "Heuristic decision based on latency/error thresholds" |
| } |
| } |
| ``` |
|
|
| ### Cross-reference |
|
|
| See `docs/examples.md` for a worked numerical example and `README.md` for a shorter overview. |
|
|