Spaces:

A-R-F
/

Agentic-Reliability-Framework-API

Running

App Files Files Community

Agentic-Reliability-Framework-API / docs /docs_endpoints.md

petter2025

Upload folder using huggingface_hub (#3)

6d20eab about 21 hours ago

preview code

raw

history blame

8.94 kB

API Endpoints

This document describes the main ARF API endpoints and the request/response contracts used by the control plane.

POST `/api/v1/v1/incidents/evaluate`

Evaluates a reported incident and returns a heuristic healing recommendation, a counterfactual causal explanation, and a simplified utility decision.

This endpoint is advisory only. It does not apply remediation, mutate infrastructure, or execute any healing action.

Purpose

The endpoint takes a current incident snapshot, estimates risk, chooses a deterministic action, and explains the expected effect of that action on latency using a heuristic counterfactual model.

The implementation is intentionally simple:

no fitted Structural Causal Model is used
no machine learning model is required
no historical training step is performed
no action execution is triggered

Request schema

The request body must match the ReliabilityEvent model.

{
  "component": "string",
  "latency_p99": "number",
  "error_rate": "number",
  "service_mesh": "string",
  "cpu_util": "number | null",
  "memory_util": "number | null"
}

Fields

component : Name of the service or component being evaluated.

latency_p99 : The current 99th percentile latency value. The endpoint uses this value both for risk scoring and for the causal explanation.

error_rate : The current error rate. The endpoint uses this value both for risk scoring and for the deterministic action threshold.

service_mesh : Optional service mesh name. Defaults to "default".

cpu_util : Optional CPU utilization value. Present in the request model, but not used by the current decision logic.

memory_util : Optional memory utilization value. Present in the request model, but not used by the current decision logic.

Response schema

The endpoint returns a JSON object with three top-level sections.

{
  "healing_intent": {
    "action": "string",
    "component": "string",
    "parameters": {},
    "justification": "string",
    "confidence": 0.85,
    "risk_score": 0.0,
    "status": "oss_advisory_only"
  },
  "causal_explanation": {
    "factual_outcome": 0.0,
    "counterfactual_outcome": 0.0,
    "effect": 0.0,
    "explanation_text": "string",
    "is_model_based": false,
    "warnings": ["string"]
  },
  "utility_decision": {
    "best_action": "string",
    "expected_utility": 0.5,
    "explanation": "string"
  }
}

`healing_intent`

action : The selected action. In the current implementation this is either restart_container or no_action.

component : The input component name.

parameters : Action parameters. The current implementation returns an empty object.

justification : Human-readable explanation built from the causal explanation.

confidence : Fixed confidence value returned by the endpoint. The current implementation uses 0.85.

risk_score : Heuristic risk score computed from latency and error rate.

status : Always oss_advisory_only, indicating that the response is informational and not executable.

`causal_explanation`

factual_outcome : The observed outcome value from the request context. The endpoint uses latency_p99 as the explained metric.

counterfactual_outcome : The estimated value under the proposed alternative action.

effect : The difference between counterfactual and factual outcomes.

explanation_text : Natural-language explanation of the counterfactual effect.

is_model_based : Always false in the current implementation.

warnings : A list of warning strings. The current implementation includes a warning that the causal model is heuristic and not SCM-based.

`utility_decision`

best_action : The selected action, repeated for convenience.

expected_utility : Fixed utility value returned by the current implementation. The endpoint uses 0.5.

explanation : Brief explanation that the choice came from heuristic latency and error thresholds.

Deterministic decision logic

The endpoint uses the following rule to choose the action:

optimal_action = RESTART_CONTAINER
if latency_p99 > 500 OR error_rate > 0.15
else NO_ACTION

In the implementation, this is encoded as:

restart_container when latency_p99 > 500 or error_rate > 0.15
no_action otherwise

No probabilistic policy or learned policy is involved.

Heuristic risk score

The risk score is computed as:

risk = min(1.0, (latency_p99 / 1000) * 0.7 + error_rate * 0.3)

Properties of this score:

normalized to the interval [0, 1]
weighted more heavily toward latency than error rate
clipped at 1.0

Counterfactual model

The causal explainer uses a deterministic multiplicative heuristic:

counterfactual_outcome = factual_outcome * (1 + effect_frac)

Where:

factual_outcome is the observed metric value
effect_frac is read from a fixed internal action-impact table
the effect is multiplicative, not additive

For latency, the current action-impact mapping includes the following examples:

restart_container → latency_effect = -0.15
scale_out → latency_effect = -0.20
rollback → latency_effect = -0.25
circuit_breaker → latency_effect = -0.05
traffic_shift → latency_effect = -0.10
alert_team → latency_effect = 0.0
no_action → latency_effect = 0.0

For error rate, the table includes a separate error_rate_effect per action, but the current endpoint calls the explainer with outcome_metric="latency", so the returned counterfactual explanation is latency-based.

Uncertainty interval

The explainer applies a fixed uncertainty margin of ±10% around the estimated effect.

Let:

effect = counterfactual_outcome - factual_outcome
ci_half = abs(effect) * 0.1
confidence_interval = (counterfactual_outcome - ci_half, counterfactual_outcome + ci_half)

This interval is heuristic only. It is not a calibrated statistical confidence interval.

How the endpoint uses the explainer

The endpoint constructs a local state object and passes it to the explainer:

current_state["latency"] = event.latency_p99
current_state["error_rate"] = event.error_rate
current_state["last_action"] = {"action_type": "no_action"}

It then creates:

proposed_action = {"action_type": optimal_action.value, "params": {}}

and calls:

CausalExplainer().explain_healing_intent(proposed_action, current_state, "latency")

The resulting explanation is embedded into the healing_intent response.

Validation and error behavior

The endpoint uses Pydantic validation through the ReliabilityEvent model.

Expected behavior:

valid requests return HTTP 200
invalid request bodies are rejected by FastAPI/Pydantic before the handler logic runs

The current implementation does not define a custom error schema for validation failures.

Advisory-only behavior

The response includes:

"status": "oss_advisory_only"

This means:

the endpoint recommends an action
it does not perform the action
it does not mutate incident state
it does not trigger remediation workflows by itself

Notes on implementation scope

The current endpoint is intentionally narrow:

it bases the action choice on only two fields: latency_p99 and error_rate
it ignores cpu_util, memory_util, and service_mesh in the decision logic
it always uses the latency metric in the causal explainer call
it returns a fixed expected_utility value of 0.5

Example request

curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate"   -H "Content-Type: application/json"   -d '{
    "component": "payment-service",
    "latency_p99": 450,
    "error_rate": 0.25,
    "service_mesh": "default",
    "cpu_util": 0.85,
    "memory_util": 0.90
  }'

Example response shape

{
  "healing_intent": {
    "action": "restart_container",
    "component": "payment-service",
    "parameters": {},
    "justification": "Causal: If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
    "confidence": 0.85,
    "risk_score": 0.4575,
    "status": "oss_advisory_only"
  },
  "causal_explanation": {
    "factual_outcome": 450,
    "counterfactual_outcome": 382.5,
    "effect": -67.5,
    "explanation_text": "If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
    "is_model_based": false,
    "warnings": [
      "Using heuristic causal model (no fitted SCM)."
    ]
  },
  "utility_decision": {
    "best_action": "restart_container",
    "expected_utility": 0.5,
    "explanation": "Heuristic decision based on latency/error thresholds"
  }
}

Cross-reference

See docs/examples.md for a worked numerical example and README.md for a shorter overview.