petter2025's picture
Upload folder using huggingface_hub
afa4de7 verified
|
raw
history blame
8.94 kB
# API Endpoints
This document describes the main ARF API endpoints and the request/response contracts used by the control plane.
## POST `/api/v1/v1/incidents/evaluate`
Evaluates a reported incident and returns a heuristic healing recommendation, a counterfactual causal explanation, and a simplified utility decision.
This endpoint is **advisory only**. It does not apply remediation, mutate infrastructure, or execute any healing action.
### Purpose
The endpoint takes a current incident snapshot, estimates risk, chooses a deterministic action, and explains the expected effect of that action on latency using a heuristic counterfactual model.
The implementation is intentionally simple:
- no fitted Structural Causal Model is used
- no machine learning model is required
- no historical training step is performed
- no action execution is triggered
### Request schema
The request body must match the `ReliabilityEvent` model.
```json
{
"component": "string",
"latency_p99": "number",
"error_rate": "number",
"service_mesh": "string",
"cpu_util": "number | null",
"memory_util": "number | null"
}
```
#### Fields
`component`
: Name of the service or component being evaluated.
`latency_p99`
: The current 99th percentile latency value. The endpoint uses this value both for risk scoring and for the causal explanation.
`error_rate`
: The current error rate. The endpoint uses this value both for risk scoring and for the deterministic action threshold.
`service_mesh`
: Optional service mesh name. Defaults to `"default"`.
`cpu_util`
: Optional CPU utilization value. Present in the request model, but not used by the current decision logic.
`memory_util`
: Optional memory utilization value. Present in the request model, but not used by the current decision logic.
### Response schema
The endpoint returns a JSON object with three top-level sections.
```json
{
"healing_intent": {
"action": "string",
"component": "string",
"parameters": {},
"justification": "string",
"confidence": 0.85,
"risk_score": 0.0,
"status": "oss_advisory_only"
},
"causal_explanation": {
"factual_outcome": 0.0,
"counterfactual_outcome": 0.0,
"effect": 0.0,
"explanation_text": "string",
"is_model_based": false,
"warnings": ["string"]
},
"utility_decision": {
"best_action": "string",
"expected_utility": 0.5,
"explanation": "string"
}
}
```
#### `healing_intent`
`action`
: The selected action. In the current implementation this is either `restart_container` or `no_action`.
`component`
: The input component name.
`parameters`
: Action parameters. The current implementation returns an empty object.
`justification`
: Human-readable explanation built from the causal explanation.
`confidence`
: Fixed confidence value returned by the endpoint. The current implementation uses `0.85`.
`risk_score`
: Heuristic risk score computed from latency and error rate.
`status`
: Always `oss_advisory_only`, indicating that the response is informational and not executable.
#### `causal_explanation`
`factual_outcome`
: The observed outcome value from the request context. The endpoint uses `latency_p99` as the explained metric.
`counterfactual_outcome`
: The estimated value under the proposed alternative action.
`effect`
: The difference between counterfactual and factual outcomes.
`explanation_text`
: Natural-language explanation of the counterfactual effect.
`is_model_based`
: Always `false` in the current implementation.
`warnings`
: A list of warning strings. The current implementation includes a warning that the causal model is heuristic and not SCM-based.
#### `utility_decision`
`best_action`
: The selected action, repeated for convenience.
`expected_utility`
: Fixed utility value returned by the current implementation. The endpoint uses `0.5`.
`explanation`
: Brief explanation that the choice came from heuristic latency and error thresholds.
### Deterministic decision logic
The endpoint uses the following rule to choose the action:
```text
optimal_action = RESTART_CONTAINER
if latency_p99 > 500 OR error_rate > 0.15
else NO_ACTION
```
In the implementation, this is encoded as:
- `restart_container` when `latency_p99 > 500` or `error_rate > 0.15`
- `no_action` otherwise
No probabilistic policy or learned policy is involved.
### Heuristic risk score
The risk score is computed as:
```text
risk = min(1.0, (latency_p99 / 1000) * 0.7 + error_rate * 0.3)
```
Properties of this score:
- normalized to the interval `[0, 1]`
- weighted more heavily toward latency than error rate
- clipped at `1.0`
### Counterfactual model
The causal explainer uses a deterministic multiplicative heuristic:
```text
counterfactual_outcome = factual_outcome * (1 + effect_frac)
```
Where:
- `factual_outcome` is the observed metric value
- `effect_frac` is read from a fixed internal action-impact table
- the effect is multiplicative, not additive
For latency, the current action-impact mapping includes the following examples:
- `restart_container``latency_effect = -0.15`
- `scale_out``latency_effect = -0.20`
- `rollback``latency_effect = -0.25`
- `circuit_breaker``latency_effect = -0.05`
- `traffic_shift``latency_effect = -0.10`
- `alert_team``latency_effect = 0.0`
- `no_action``latency_effect = 0.0`
For error rate, the table includes a separate `error_rate_effect` per action, but the current endpoint calls the explainer with `outcome_metric="latency"`, so the returned counterfactual explanation is latency-based.
### Uncertainty interval
The explainer applies a fixed uncertainty margin of ±10% around the estimated effect.
Let:
```text
effect = counterfactual_outcome - factual_outcome
ci_half = abs(effect) * 0.1
confidence_interval = (counterfactual_outcome - ci_half, counterfactual_outcome + ci_half)
```
This interval is heuristic only. It is not a calibrated statistical confidence interval.
### How the endpoint uses the explainer
The endpoint constructs a local state object and passes it to the explainer:
- `current_state["latency"] = event.latency_p99`
- `current_state["error_rate"] = event.error_rate`
- `current_state["last_action"] = {"action_type": "no_action"}`
It then creates:
- `proposed_action = {"action_type": optimal_action.value, "params": {}}`
and calls:
```text
CausalExplainer().explain_healing_intent(proposed_action, current_state, "latency")
```
The resulting explanation is embedded into the `healing_intent` response.
### Validation and error behavior
The endpoint uses Pydantic validation through the `ReliabilityEvent` model.
Expected behavior:
- valid requests return HTTP 200
- invalid request bodies are rejected by FastAPI/Pydantic before the handler logic runs
The current implementation does not define a custom error schema for validation failures.
### Advisory-only behavior
The response includes:
```json
"status": "oss_advisory_only"
```
This means:
- the endpoint recommends an action
- it does not perform the action
- it does not mutate incident state
- it does not trigger remediation workflows by itself
### Notes on implementation scope
The current endpoint is intentionally narrow:
- it bases the action choice on only two fields: `latency_p99` and `error_rate`
- it ignores `cpu_util`, `memory_util`, and `service_mesh` in the decision logic
- it always uses the latency metric in the causal explainer call
- it returns a fixed `expected_utility` value of `0.5`
### Example request
```bash
curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate" -H "Content-Type: application/json" -d '{
"component": "payment-service",
"latency_p99": 450,
"error_rate": 0.25,
"service_mesh": "default",
"cpu_util": 0.85,
"memory_util": 0.90
}'
```
### Example response shape
```json
{
"healing_intent": {
"action": "restart_container",
"component": "payment-service",
"parameters": {},
"justification": "Causal: If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
"confidence": 0.85,
"risk_score": 0.4575,
"status": "oss_advisory_only"
},
"causal_explanation": {
"factual_outcome": 450,
"counterfactual_outcome": 382.5,
"effect": -67.5,
"explanation_text": "If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
"is_model_based": false,
"warnings": [
"Using heuristic causal model (no fitted SCM)."
]
},
"utility_decision": {
"best_action": "restart_container",
"expected_utility": 0.5,
"explanation": "Heuristic decision based on latency/error thresholds"
}
}
```
### Cross-reference
See `docs/examples.md` for a worked numerical example and `README.md` for a shorter overview.