File size: 8,936 Bytes
6d20eab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
# API Endpoints

This document describes the main ARF API endpoints and the request/response contracts used by the control plane.

## POST `/api/v1/v1/incidents/evaluate`

Evaluates a reported incident and returns a heuristic healing recommendation, a counterfactual causal explanation, and a simplified utility decision.

This endpoint is **advisory only**. It does not apply remediation, mutate infrastructure, or execute any healing action.

### Purpose

The endpoint takes a current incident snapshot, estimates risk, chooses a deterministic action, and explains the expected effect of that action on latency using a heuristic counterfactual model.

The implementation is intentionally simple:

- no fitted Structural Causal Model is used
- no machine learning model is required
- no historical training step is performed
- no action execution is triggered

### Request schema

The request body must match the `ReliabilityEvent` model.

```json
{
  "component": "string",
  "latency_p99": "number",
  "error_rate": "number",
  "service_mesh": "string",
  "cpu_util": "number | null",
  "memory_util": "number | null"
}
```

#### Fields

`component`
: Name of the service or component being evaluated.

`latency_p99`
: The current 99th percentile latency value. The endpoint uses this value both for risk scoring and for the causal explanation.

`error_rate`
: The current error rate. The endpoint uses this value both for risk scoring and for the deterministic action threshold.

`service_mesh`
: Optional service mesh name. Defaults to `"default"`.

`cpu_util`
: Optional CPU utilization value. Present in the request model, but not used by the current decision logic.

`memory_util`
: Optional memory utilization value. Present in the request model, but not used by the current decision logic.

### Response schema

The endpoint returns a JSON object with three top-level sections.

```json
{
  "healing_intent": {
    "action": "string",
    "component": "string",
    "parameters": {},
    "justification": "string",
    "confidence": 0.85,
    "risk_score": 0.0,
    "status": "oss_advisory_only"
  },
  "causal_explanation": {
    "factual_outcome": 0.0,
    "counterfactual_outcome": 0.0,
    "effect": 0.0,
    "explanation_text": "string",
    "is_model_based": false,
    "warnings": ["string"]
  },
  "utility_decision": {
    "best_action": "string",
    "expected_utility": 0.5,
    "explanation": "string"
  }
}
```

#### `healing_intent`

`action`
: The selected action. In the current implementation this is either `restart_container` or `no_action`.

`component`
: The input component name.

`parameters`
: Action parameters. The current implementation returns an empty object.

`justification`
: Human-readable explanation built from the causal explanation.

`confidence`
: Fixed confidence value returned by the endpoint. The current implementation uses `0.85`.

`risk_score`
: Heuristic risk score computed from latency and error rate.

`status`
: Always `oss_advisory_only`, indicating that the response is informational and not executable.

#### `causal_explanation`

`factual_outcome`
: The observed outcome value from the request context. The endpoint uses `latency_p99` as the explained metric.

`counterfactual_outcome`
: The estimated value under the proposed alternative action.

`effect`
: The difference between counterfactual and factual outcomes.

`explanation_text`
: Natural-language explanation of the counterfactual effect.

`is_model_based`
: Always `false` in the current implementation.

`warnings`
: A list of warning strings. The current implementation includes a warning that the causal model is heuristic and not SCM-based.

#### `utility_decision`

`best_action`
: The selected action, repeated for convenience.

`expected_utility`
: Fixed utility value returned by the current implementation. The endpoint uses `0.5`.

`explanation`
: Brief explanation that the choice came from heuristic latency and error thresholds.

### Deterministic decision logic

The endpoint uses the following rule to choose the action:

```text
optimal_action = RESTART_CONTAINER
if latency_p99 > 500 OR error_rate > 0.15
else NO_ACTION
```

In the implementation, this is encoded as:

- `restart_container` when `latency_p99 > 500` or `error_rate > 0.15`
- `no_action` otherwise

No probabilistic policy or learned policy is involved.

### Heuristic risk score

The risk score is computed as:

```text
risk = min(1.0, (latency_p99 / 1000) * 0.7 + error_rate * 0.3)
```

Properties of this score:

- normalized to the interval `[0, 1]`
- weighted more heavily toward latency than error rate
- clipped at `1.0`

### Counterfactual model

The causal explainer uses a deterministic multiplicative heuristic:

```text
counterfactual_outcome = factual_outcome * (1 + effect_frac)
```

Where:

- `factual_outcome` is the observed metric value
- `effect_frac` is read from a fixed internal action-impact table
- the effect is multiplicative, not additive

For latency, the current action-impact mapping includes the following examples:

- `restart_container``latency_effect = -0.15`
- `scale_out``latency_effect = -0.20`
- `rollback``latency_effect = -0.25`
- `circuit_breaker``latency_effect = -0.05`
- `traffic_shift``latency_effect = -0.10`
- `alert_team``latency_effect = 0.0`
- `no_action``latency_effect = 0.0`

For error rate, the table includes a separate `error_rate_effect` per action, but the current endpoint calls the explainer with `outcome_metric="latency"`, so the returned counterfactual explanation is latency-based.

### Uncertainty interval

The explainer applies a fixed uncertainty margin of ±10% around the estimated effect.

Let:

```text
effect = counterfactual_outcome - factual_outcome
ci_half = abs(effect) * 0.1
confidence_interval = (counterfactual_outcome - ci_half, counterfactual_outcome + ci_half)
```

This interval is heuristic only. It is not a calibrated statistical confidence interval.

### How the endpoint uses the explainer

The endpoint constructs a local state object and passes it to the explainer:

- `current_state["latency"] = event.latency_p99`
- `current_state["error_rate"] = event.error_rate`
- `current_state["last_action"] = {"action_type": "no_action"}`

It then creates:

- `proposed_action = {"action_type": optimal_action.value, "params": {}}`

and calls:

```text
CausalExplainer().explain_healing_intent(proposed_action, current_state, "latency")
```

The resulting explanation is embedded into the `healing_intent` response.

### Validation and error behavior

The endpoint uses Pydantic validation through the `ReliabilityEvent` model.

Expected behavior:

- valid requests return HTTP 200
- invalid request bodies are rejected by FastAPI/Pydantic before the handler logic runs

The current implementation does not define a custom error schema for validation failures.

### Advisory-only behavior

The response includes:

```json
"status": "oss_advisory_only"
```

This means:

- the endpoint recommends an action
- it does not perform the action
- it does not mutate incident state
- it does not trigger remediation workflows by itself

### Notes on implementation scope

The current endpoint is intentionally narrow:

- it bases the action choice on only two fields: `latency_p99` and `error_rate`
- it ignores `cpu_util`, `memory_util`, and `service_mesh` in the decision logic
- it always uses the latency metric in the causal explainer call
- it returns a fixed `expected_utility` value of `0.5`

### Example request

```bash
curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate"   -H "Content-Type: application/json"   -d '{
    "component": "payment-service",
    "latency_p99": 450,
    "error_rate": 0.25,
    "service_mesh": "default",
    "cpu_util": 0.85,
    "memory_util": 0.90
  }'
```

### Example response shape

```json
{
  "healing_intent": {
    "action": "restart_container",
    "component": "payment-service",
    "parameters": {},
    "justification": "Causal: If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
    "confidence": 0.85,
    "risk_score": 0.4575,
    "status": "oss_advisory_only"
  },
  "causal_explanation": {
    "factual_outcome": 450,
    "counterfactual_outcome": 382.5,
    "effect": -67.5,
    "explanation_text": "If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
    "is_model_based": false,
    "warnings": [
      "Using heuristic causal model (no fitted SCM)."
    ]
  },
  "utility_decision": {
    "best_action": "restart_container",
    "expected_utility": 0.5,
    "explanation": "Heuristic decision based on latency/error thresholds"
  }
}
```

### Cross-reference

See `docs/examples.md` for a worked numerical example and `README.md` for a shorter overview.