Spaces:

A-R-F
/

Agentic-Reliability-Framework-API

Running

App Files Files Community

Agentic-Reliability-Framework-API / docs /docs_endpoints.md

petter2025

Upload folder using huggingface_hub (#3)

6d20eab about 23 hours ago

preview code

raw

history blame

8.94 kB

	# API Endpoints

	This document describes the main ARF API endpoints and the request/response contracts used by the control plane.

	## POST `/api/v1/v1/incidents/evaluate`

	Evaluates a reported incident and returns a heuristic healing recommendation, a counterfactual causal explanation, and a simplified utility decision.

	This endpoint is advisory only. It does not apply remediation, mutate infrastructure, or execute any healing action.

	### Purpose

	The endpoint takes a current incident snapshot, estimates risk, chooses a deterministic action, and explains the expected effect of that action on latency using a heuristic counterfactual model.

	The implementation is intentionally simple:

	- no fitted Structural Causal Model is used
	- no machine learning model is required
	- no historical training step is performed
	- no action execution is triggered

	### Request schema

	The request body must match the `ReliabilityEvent` model.

	```json
	{
	"component": "string",
	"latency_p99": "number",
	"error_rate": "number",
	"service_mesh": "string",
	"cpu_util": "number \| null",
	"memory_util": "number \| null"
	}
	```

	#### Fields

	`component`
	: Name of the service or component being evaluated.

	`latency_p99`
	: The current 99th percentile latency value. The endpoint uses this value both for risk scoring and for the causal explanation.

	`error_rate`
	: The current error rate. The endpoint uses this value both for risk scoring and for the deterministic action threshold.

	`service_mesh`
	: Optional service mesh name. Defaults to `"default"`.

	`cpu_util`
	: Optional CPU utilization value. Present in the request model, but not used by the current decision logic.

	`memory_util`
	: Optional memory utilization value. Present in the request model, but not used by the current decision logic.

	### Response schema

	The endpoint returns a JSON object with three top-level sections.

	```json
	{
	"healing_intent": {
	"action": "string",
	"component": "string",
	"parameters": {},
	"justification": "string",
	"confidence": 0.85,
	"risk_score": 0.0,
	"status": "oss_advisory_only"
	},
	"causal_explanation": {
	"factual_outcome": 0.0,
	"counterfactual_outcome": 0.0,
	"effect": 0.0,
	"explanation_text": "string",
	"is_model_based": false,
	"warnings": ["string"]
	},
	"utility_decision": {
	"best_action": "string",
	"expected_utility": 0.5,
	"explanation": "string"
	}
	}
	```

	#### `healing_intent`

	`action`
	: The selected action. In the current implementation this is either `restart_container` or `no_action`.

	`component`
	: The input component name.

	`parameters`
	: Action parameters. The current implementation returns an empty object.

	`justification`
	: Human-readable explanation built from the causal explanation.

	`confidence`
	: Fixed confidence value returned by the endpoint. The current implementation uses `0.85`.

	`risk_score`
	: Heuristic risk score computed from latency and error rate.

	`status`
	: Always `oss_advisory_only`, indicating that the response is informational and not executable.

	#### `causal_explanation`

	`factual_outcome`
	: The observed outcome value from the request context. The endpoint uses `latency_p99` as the explained metric.

	`counterfactual_outcome`
	: The estimated value under the proposed alternative action.

	`effect`
	: The difference between counterfactual and factual outcomes.

	`explanation_text`
	: Natural-language explanation of the counterfactual effect.

	`is_model_based`
	: Always `false` in the current implementation.

	`warnings`
	: A list of warning strings. The current implementation includes a warning that the causal model is heuristic and not SCM-based.

	#### `utility_decision`

	`best_action`
	: The selected action, repeated for convenience.

	`expected_utility`
	: Fixed utility value returned by the current implementation. The endpoint uses `0.5`.

	`explanation`
	: Brief explanation that the choice came from heuristic latency and error thresholds.

	### Deterministic decision logic

	The endpoint uses the following rule to choose the action:

	```text
	optimal_action = RESTART_CONTAINER
	if latency_p99 > 500 OR error_rate > 0.15
	else NO_ACTION
	```

	In the implementation, this is encoded as:

	- `restart_container` when `latency_p99 > 500` or `error_rate > 0.15`
	- `no_action` otherwise

	No probabilistic policy or learned policy is involved.

	### Heuristic risk score

	The risk score is computed as:

	```text
	risk = min(1.0, (latency_p99 / 1000) * 0.7 + error_rate * 0.3)
	```

	Properties of this score:

	- normalized to the interval `[0, 1]`
	- weighted more heavily toward latency than error rate
	- clipped at `1.0`

	### Counterfactual model

	The causal explainer uses a deterministic multiplicative heuristic:

	```text
	counterfactual_outcome = factual_outcome * (1 + effect_frac)
	```

	Where:

	- `factual_outcome` is the observed metric value
	- `effect_frac` is read from a fixed internal action-impact table
	- the effect is multiplicative, not additive

	For latency, the current action-impact mapping includes the following examples:

	- `restart_container` → `latency_effect = -0.15`
	- `scale_out` → `latency_effect = -0.20`
	- `rollback` → `latency_effect = -0.25`
	- `circuit_breaker` → `latency_effect = -0.05`
	- `traffic_shift` → `latency_effect = -0.10`
	- `alert_team` → `latency_effect = 0.0`
	- `no_action` → `latency_effect = 0.0`

	For error rate, the table includes a separate `error_rate_effect` per action, but the current endpoint calls the explainer with `outcome_metric="latency"`, so the returned counterfactual explanation is latency-based.

	### Uncertainty interval

	The explainer applies a fixed uncertainty margin of ±10% around the estimated effect.

	Let:

	```text
	effect = counterfactual_outcome - factual_outcome
	ci_half = abs(effect) * 0.1
	confidence_interval = (counterfactual_outcome - ci_half, counterfactual_outcome + ci_half)
	```

	This interval is heuristic only. It is not a calibrated statistical confidence interval.

	### How the endpoint uses the explainer

	The endpoint constructs a local state object and passes it to the explainer:

	- `current_state["latency"] = event.latency_p99`
	- `current_state["error_rate"] = event.error_rate`
	- `current_state["last_action"] = {"action_type": "no_action"}`

	It then creates:

	- `proposed_action = {"action_type": optimal_action.value, "params": {}}`

	and calls:

	```text
	CausalExplainer().explain_healing_intent(proposed_action, current_state, "latency")
	```

	The resulting explanation is embedded into the `healing_intent` response.

	### Validation and error behavior

	The endpoint uses Pydantic validation through the `ReliabilityEvent` model.

	Expected behavior:

	- valid requests return HTTP 200
	- invalid request bodies are rejected by FastAPI/Pydantic before the handler logic runs

	The current implementation does not define a custom error schema for validation failures.

	### Advisory-only behavior

	The response includes:

	```json
	"status": "oss_advisory_only"
	```

	This means:

	- the endpoint recommends an action
	- it does not perform the action
	- it does not mutate incident state
	- it does not trigger remediation workflows by itself

	### Notes on implementation scope

	The current endpoint is intentionally narrow:

	- it bases the action choice on only two fields: `latency_p99` and `error_rate`
	- it ignores `cpu_util`, `memory_util`, and `service_mesh` in the decision logic
	- it always uses the latency metric in the causal explainer call
	- it returns a fixed `expected_utility` value of `0.5`

	### Example request

	```bash
	curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate" -H "Content-Type: application/json" -d '{
	"component": "payment-service",
	"latency_p99": 450,
	"error_rate": 0.25,
	"service_mesh": "default",
	"cpu_util": 0.85,
	"memory_util": 0.90
	}'
	```

	### Example response shape

	```json
	{
	"healing_intent": {
	"action": "restart_container",
	"component": "payment-service",
	"parameters": {},
	"justification": "Causal: If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
	"confidence": 0.85,
	"risk_score": 0.4575,
	"status": "oss_advisory_only"
	},
	"causal_explanation": {
	"factual_outcome": 450,
	"counterfactual_outcome": 382.5,
	"effect": -67.5,
	"explanation_text": "If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
	"is_model_based": false,
	"warnings": [
	"Using heuristic causal model (no fitted SCM)."
	]
	},
	"utility_decision": {
	"best_action": "restart_container",
	"expected_utility": 0.5,
	"explanation": "Heuristic decision based on latency/error thresholds"
	}
	}
	```

	### Cross-reference

	See `docs/examples.md` for a worked numerical example and `README.md` for a shorter overview.