File size: 8,936 Bytes
6d20eab | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 | # API Endpoints
This document describes the main ARF API endpoints and the request/response contracts used by the control plane.
## POST `/api/v1/v1/incidents/evaluate`
Evaluates a reported incident and returns a heuristic healing recommendation, a counterfactual causal explanation, and a simplified utility decision.
This endpoint is **advisory only**. It does not apply remediation, mutate infrastructure, or execute any healing action.
### Purpose
The endpoint takes a current incident snapshot, estimates risk, chooses a deterministic action, and explains the expected effect of that action on latency using a heuristic counterfactual model.
The implementation is intentionally simple:
- no fitted Structural Causal Model is used
- no machine learning model is required
- no historical training step is performed
- no action execution is triggered
### Request schema
The request body must match the `ReliabilityEvent` model.
```json
{
"component": "string",
"latency_p99": "number",
"error_rate": "number",
"service_mesh": "string",
"cpu_util": "number | null",
"memory_util": "number | null"
}
```
#### Fields
`component`
: Name of the service or component being evaluated.
`latency_p99`
: The current 99th percentile latency value. The endpoint uses this value both for risk scoring and for the causal explanation.
`error_rate`
: The current error rate. The endpoint uses this value both for risk scoring and for the deterministic action threshold.
`service_mesh`
: Optional service mesh name. Defaults to `"default"`.
`cpu_util`
: Optional CPU utilization value. Present in the request model, but not used by the current decision logic.
`memory_util`
: Optional memory utilization value. Present in the request model, but not used by the current decision logic.
### Response schema
The endpoint returns a JSON object with three top-level sections.
```json
{
"healing_intent": {
"action": "string",
"component": "string",
"parameters": {},
"justification": "string",
"confidence": 0.85,
"risk_score": 0.0,
"status": "oss_advisory_only"
},
"causal_explanation": {
"factual_outcome": 0.0,
"counterfactual_outcome": 0.0,
"effect": 0.0,
"explanation_text": "string",
"is_model_based": false,
"warnings": ["string"]
},
"utility_decision": {
"best_action": "string",
"expected_utility": 0.5,
"explanation": "string"
}
}
```
#### `healing_intent`
`action`
: The selected action. In the current implementation this is either `restart_container` or `no_action`.
`component`
: The input component name.
`parameters`
: Action parameters. The current implementation returns an empty object.
`justification`
: Human-readable explanation built from the causal explanation.
`confidence`
: Fixed confidence value returned by the endpoint. The current implementation uses `0.85`.
`risk_score`
: Heuristic risk score computed from latency and error rate.
`status`
: Always `oss_advisory_only`, indicating that the response is informational and not executable.
#### `causal_explanation`
`factual_outcome`
: The observed outcome value from the request context. The endpoint uses `latency_p99` as the explained metric.
`counterfactual_outcome`
: The estimated value under the proposed alternative action.
`effect`
: The difference between counterfactual and factual outcomes.
`explanation_text`
: Natural-language explanation of the counterfactual effect.
`is_model_based`
: Always `false` in the current implementation.
`warnings`
: A list of warning strings. The current implementation includes a warning that the causal model is heuristic and not SCM-based.
#### `utility_decision`
`best_action`
: The selected action, repeated for convenience.
`expected_utility`
: Fixed utility value returned by the current implementation. The endpoint uses `0.5`.
`explanation`
: Brief explanation that the choice came from heuristic latency and error thresholds.
### Deterministic decision logic
The endpoint uses the following rule to choose the action:
```text
optimal_action = RESTART_CONTAINER
if latency_p99 > 500 OR error_rate > 0.15
else NO_ACTION
```
In the implementation, this is encoded as:
- `restart_container` when `latency_p99 > 500` or `error_rate > 0.15`
- `no_action` otherwise
No probabilistic policy or learned policy is involved.
### Heuristic risk score
The risk score is computed as:
```text
risk = min(1.0, (latency_p99 / 1000) * 0.7 + error_rate * 0.3)
```
Properties of this score:
- normalized to the interval `[0, 1]`
- weighted more heavily toward latency than error rate
- clipped at `1.0`
### Counterfactual model
The causal explainer uses a deterministic multiplicative heuristic:
```text
counterfactual_outcome = factual_outcome * (1 + effect_frac)
```
Where:
- `factual_outcome` is the observed metric value
- `effect_frac` is read from a fixed internal action-impact table
- the effect is multiplicative, not additive
For latency, the current action-impact mapping includes the following examples:
- `restart_container` → `latency_effect = -0.15`
- `scale_out` → `latency_effect = -0.20`
- `rollback` → `latency_effect = -0.25`
- `circuit_breaker` → `latency_effect = -0.05`
- `traffic_shift` → `latency_effect = -0.10`
- `alert_team` → `latency_effect = 0.0`
- `no_action` → `latency_effect = 0.0`
For error rate, the table includes a separate `error_rate_effect` per action, but the current endpoint calls the explainer with `outcome_metric="latency"`, so the returned counterfactual explanation is latency-based.
### Uncertainty interval
The explainer applies a fixed uncertainty margin of ±10% around the estimated effect.
Let:
```text
effect = counterfactual_outcome - factual_outcome
ci_half = abs(effect) * 0.1
confidence_interval = (counterfactual_outcome - ci_half, counterfactual_outcome + ci_half)
```
This interval is heuristic only. It is not a calibrated statistical confidence interval.
### How the endpoint uses the explainer
The endpoint constructs a local state object and passes it to the explainer:
- `current_state["latency"] = event.latency_p99`
- `current_state["error_rate"] = event.error_rate`
- `current_state["last_action"] = {"action_type": "no_action"}`
It then creates:
- `proposed_action = {"action_type": optimal_action.value, "params": {}}`
and calls:
```text
CausalExplainer().explain_healing_intent(proposed_action, current_state, "latency")
```
The resulting explanation is embedded into the `healing_intent` response.
### Validation and error behavior
The endpoint uses Pydantic validation through the `ReliabilityEvent` model.
Expected behavior:
- valid requests return HTTP 200
- invalid request bodies are rejected by FastAPI/Pydantic before the handler logic runs
The current implementation does not define a custom error schema for validation failures.
### Advisory-only behavior
The response includes:
```json
"status": "oss_advisory_only"
```
This means:
- the endpoint recommends an action
- it does not perform the action
- it does not mutate incident state
- it does not trigger remediation workflows by itself
### Notes on implementation scope
The current endpoint is intentionally narrow:
- it bases the action choice on only two fields: `latency_p99` and `error_rate`
- it ignores `cpu_util`, `memory_util`, and `service_mesh` in the decision logic
- it always uses the latency metric in the causal explainer call
- it returns a fixed `expected_utility` value of `0.5`
### Example request
```bash
curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate" -H "Content-Type: application/json" -d '{
"component": "payment-service",
"latency_p99": 450,
"error_rate": 0.25,
"service_mesh": "default",
"cpu_util": 0.85,
"memory_util": 0.90
}'
```
### Example response shape
```json
{
"healing_intent": {
"action": "restart_container",
"component": "payment-service",
"parameters": {},
"justification": "Causal: If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
"confidence": 0.85,
"risk_score": 0.4575,
"status": "oss_advisory_only"
},
"causal_explanation": {
"factual_outcome": 450,
"counterfactual_outcome": 382.5,
"effect": -67.5,
"explanation_text": "If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
"is_model_based": false,
"warnings": [
"Using heuristic causal model (no fitted SCM)."
]
},
"utility_decision": {
"best_action": "restart_container",
"expected_utility": 0.5,
"explanation": "Heuristic decision based on latency/error thresholds"
}
}
```
### Cross-reference
See `docs/examples.md` for a worked numerical example and `README.md` for a shorter overview.
|