Spaces:

A-R-F
/

Agentic-Reliability-Framework-API

Running

File size: 8,936 Bytes

6d20eab

# API Endpoints

This document describes the main ARF API endpoints and the request/response contracts used by the control plane.

## POST `/api/v1/v1/incidents/evaluate`

Evaluates a reported incident and returns a heuristic healing recommendation, a counterfactual causal explanation, and a simplified utility decision.

This endpoint is **advisory only**. It does not apply remediation, mutate infrastructure, or execute any healing action.

### Purpose

The endpoint takes a current incident snapshot, estimates risk, chooses a deterministic action, and explains the expected effect of that action on latency using a heuristic counterfactual model.

The implementation is intentionally simple:

- no fitted Structural Causal Model is used
- no machine learning model is required
- no historical training step is performed
- no action execution is triggered

### Request schema

The request body must match the `ReliabilityEvent` model.

```json
{
  "component": "string",
  "latency_p99": "number",
  "error_rate": "number",
  "service_mesh": "string",
  "cpu_util": "number | null",
  "memory_util": "number | null"
}
```

#### Fields

`component`
: Name of the service or component being evaluated.

`latency_p99`
: The current 99th percentile latency value. The endpoint uses this value both for risk scoring and for the causal explanation.

`error_rate`
: The current error rate. The endpoint uses this value both for risk scoring and for the deterministic action threshold.

`service_mesh`
: Optional service mesh name. Defaults to `"default"`.

`cpu_util`
: Optional CPU utilization value. Present in the request model, but not used by the current decision logic.

`memory_util`
: Optional memory utilization value. Present in the request model, but not used by the current decision logic.

### Response schema

The endpoint returns a JSON object with three top-level sections.

```json
{
  "healing_intent": {
    "action": "string",
    "component": "string",
    "parameters": {},
    "justification": "string",
    "confidence": 0.85,
    "risk_score": 0.0,
    "status": "oss_advisory_only"
  },
  "causal_explanation": {
    "factual_outcome": 0.0,
    "counterfactual_outcome": 0.0,
    "effect": 0.0,
    "explanation_text": "string",
    "is_model_based": false,
    "warnings": ["string"]
  },
  "utility_decision": {
    "best_action": "string",
    "expected_utility": 0.5,
    "explanation": "string"
  }
}
```

#### `healing_intent`

`action`
: The selected action. In the current implementation this is either `restart_container` or `no_action`.

`component`
: The input component name.

`parameters`
: Action parameters. The current implementation returns an empty object.

`justification`
: Human-readable explanation built from the causal explanation.

`confidence`
: Fixed confidence value returned by the endpoint. The current implementation uses `0.85`.

`risk_score`
: Heuristic risk score computed from latency and error rate.

`status`
: Always `oss_advisory_only`, indicating that the response is informational and not executable.

#### `causal_explanation`

`factual_outcome`
: The observed outcome value from the request context. The endpoint uses `latency_p99` as the explained metric.

`counterfactual_outcome`
: The estimated value under the proposed alternative action.

`effect`
: The difference between counterfactual and factual outcomes.

`explanation_text`
: Natural-language explanation of the counterfactual effect.

`is_model_based`
: Always `false` in the current implementation.

`warnings`
: A list of warning strings. The current implementation includes a warning that the causal model is heuristic and not SCM-based.

#### `utility_decision`

`best_action`
: The selected action, repeated for convenience.

`expected_utility`
: Fixed utility value returned by the current implementation. The endpoint uses `0.5`.

`explanation`
: Brief explanation that the choice came from heuristic latency and error thresholds.

### Deterministic decision logic

The endpoint uses the following rule to choose the action:

```text
optimal_action = RESTART_CONTAINER
if latency_p99 > 500 OR error_rate > 0.15
else NO_ACTION
```

In the implementation, this is encoded as:

- `restart_container` when `latency_p99 > 500` or `error_rate > 0.15`
- `no_action` otherwise

No probabilistic policy or learned policy is involved.

### Heuristic risk score

The risk score is computed as:

```text
risk = min(1.0, (latency_p99 / 1000) * 0.7 + error_rate * 0.3)
```

Properties of this score:

- normalized to the interval `[0, 1]`
- weighted more heavily toward latency than error rate
- clipped at `1.0`

### Counterfactual model

The causal explainer uses a deterministic multiplicative heuristic:

```text
counterfactual_outcome = factual_outcome * (1 + effect_frac)
```

Where:

- `factual_outcome` is the observed metric value
- `effect_frac` is read from a fixed internal action-impact table
- the effect is multiplicative, not additive

For latency, the current action-impact mapping includes the following examples:

- `restart_container` → `latency_effect = -0.15`
- `scale_out` → `latency_effect = -0.20`
- `rollback` → `latency_effect = -0.25`
- `circuit_breaker` → `latency_effect = -0.05`
- `traffic_shift` → `latency_effect = -0.10`
- `alert_team` → `latency_effect = 0.0`
- `no_action` → `latency_effect = 0.0`

For error rate, the table includes a separate `error_rate_effect` per action, but the current endpoint calls the explainer with `outcome_metric="latency"`, so the returned counterfactual explanation is latency-based.

### Uncertainty interval

The explainer applies a fixed uncertainty margin of ±10% around the estimated effect.

Let:

```text
effect = counterfactual_outcome - factual_outcome
ci_half = abs(effect) * 0.1
confidence_interval = (counterfactual_outcome - ci_half, counterfactual_outcome + ci_half)
```

This interval is heuristic only. It is not a calibrated statistical confidence interval.

### How the endpoint uses the explainer

The endpoint constructs a local state object and passes it to the explainer:

- `current_state["latency"] = event.latency_p99`
- `current_state["error_rate"] = event.error_rate`
- `current_state["last_action"] = {"action_type": "no_action"}`

It then creates:

- `proposed_action = {"action_type": optimal_action.value, "params": {}}`

and calls:

```text
CausalExplainer().explain_healing_intent(proposed_action, current_state, "latency")
```

The resulting explanation is embedded into the `healing_intent` response.

### Validation and error behavior

The endpoint uses Pydantic validation through the `ReliabilityEvent` model.

Expected behavior:

- valid requests return HTTP 200
- invalid request bodies are rejected by FastAPI/Pydantic before the handler logic runs

The current implementation does not define a custom error schema for validation failures.

### Advisory-only behavior

The response includes:

```json
"status": "oss_advisory_only"
```

This means:

- the endpoint recommends an action
- it does not perform the action
- it does not mutate incident state
- it does not trigger remediation workflows by itself

### Notes on implementation scope

The current endpoint is intentionally narrow:

- it bases the action choice on only two fields: `latency_p99` and `error_rate`
- it ignores `cpu_util`, `memory_util`, and `service_mesh` in the decision logic
- it always uses the latency metric in the causal explainer call
- it returns a fixed `expected_utility` value of `0.5`

### Example request

```bash
curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate"   -H "Content-Type: application/json"   -d '{
    "component": "payment-service",
    "latency_p99": 450,
    "error_rate": 0.25,
    "service_mesh": "default",
    "cpu_util": 0.85,
    "memory_util": 0.90
  }'
```

### Example response shape

```json
{
  "healing_intent": {
    "action": "restart_container",
    "component": "payment-service",
    "parameters": {},
    "justification": "Causal: If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
    "confidence": 0.85,
    "risk_score": 0.4575,
    "status": "oss_advisory_only"
  },
  "causal_explanation": {
    "factual_outcome": 450,
    "counterfactual_outcome": 382.5,
    "effect": -67.5,
    "explanation_text": "If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
    "is_model_based": false,
    "warnings": [
      "Using heuristic causal model (no fitted SCM)."
    ]
  },
  "utility_decision": {
    "best_action": "restart_container",
    "expected_utility": 0.5,
    "explanation": "Heuristic decision based on latency/error thresholds"
  }
}
```

### Cross-reference

See `docs/examples.md` for a worked numerical example and `README.md` for a shorter overview.