petter2025's picture
Upload folder using huggingface_hub (#3)
6d20eab
|
raw
history blame
8.94 kB

API Endpoints

This document describes the main ARF API endpoints and the request/response contracts used by the control plane.

POST /api/v1/v1/incidents/evaluate

Evaluates a reported incident and returns a heuristic healing recommendation, a counterfactual causal explanation, and a simplified utility decision.

This endpoint is advisory only. It does not apply remediation, mutate infrastructure, or execute any healing action.

Purpose

The endpoint takes a current incident snapshot, estimates risk, chooses a deterministic action, and explains the expected effect of that action on latency using a heuristic counterfactual model.

The implementation is intentionally simple:

  • no fitted Structural Causal Model is used
  • no machine learning model is required
  • no historical training step is performed
  • no action execution is triggered

Request schema

The request body must match the ReliabilityEvent model.

{
  "component": "string",
  "latency_p99": "number",
  "error_rate": "number",
  "service_mesh": "string",
  "cpu_util": "number | null",
  "memory_util": "number | null"
}

Fields

component : Name of the service or component being evaluated.

latency_p99 : The current 99th percentile latency value. The endpoint uses this value both for risk scoring and for the causal explanation.

error_rate : The current error rate. The endpoint uses this value both for risk scoring and for the deterministic action threshold.

service_mesh : Optional service mesh name. Defaults to "default".

cpu_util : Optional CPU utilization value. Present in the request model, but not used by the current decision logic.

memory_util : Optional memory utilization value. Present in the request model, but not used by the current decision logic.

Response schema

The endpoint returns a JSON object with three top-level sections.

{
  "healing_intent": {
    "action": "string",
    "component": "string",
    "parameters": {},
    "justification": "string",
    "confidence": 0.85,
    "risk_score": 0.0,
    "status": "oss_advisory_only"
  },
  "causal_explanation": {
    "factual_outcome": 0.0,
    "counterfactual_outcome": 0.0,
    "effect": 0.0,
    "explanation_text": "string",
    "is_model_based": false,
    "warnings": ["string"]
  },
  "utility_decision": {
    "best_action": "string",
    "expected_utility": 0.5,
    "explanation": "string"
  }
}

healing_intent

action : The selected action. In the current implementation this is either restart_container or no_action.

component : The input component name.

parameters : Action parameters. The current implementation returns an empty object.

justification : Human-readable explanation built from the causal explanation.

confidence : Fixed confidence value returned by the endpoint. The current implementation uses 0.85.

risk_score : Heuristic risk score computed from latency and error rate.

status : Always oss_advisory_only, indicating that the response is informational and not executable.

causal_explanation

factual_outcome : The observed outcome value from the request context. The endpoint uses latency_p99 as the explained metric.

counterfactual_outcome : The estimated value under the proposed alternative action.

effect : The difference between counterfactual and factual outcomes.

explanation_text : Natural-language explanation of the counterfactual effect.

is_model_based : Always false in the current implementation.

warnings : A list of warning strings. The current implementation includes a warning that the causal model is heuristic and not SCM-based.

utility_decision

best_action : The selected action, repeated for convenience.

expected_utility : Fixed utility value returned by the current implementation. The endpoint uses 0.5.

explanation : Brief explanation that the choice came from heuristic latency and error thresholds.

Deterministic decision logic

The endpoint uses the following rule to choose the action:

optimal_action = RESTART_CONTAINER
if latency_p99 > 500 OR error_rate > 0.15
else NO_ACTION

In the implementation, this is encoded as:

  • restart_container when latency_p99 > 500 or error_rate > 0.15
  • no_action otherwise

No probabilistic policy or learned policy is involved.

Heuristic risk score

The risk score is computed as:

risk = min(1.0, (latency_p99 / 1000) * 0.7 + error_rate * 0.3)

Properties of this score:

  • normalized to the interval [0, 1]
  • weighted more heavily toward latency than error rate
  • clipped at 1.0

Counterfactual model

The causal explainer uses a deterministic multiplicative heuristic:

counterfactual_outcome = factual_outcome * (1 + effect_frac)

Where:

  • factual_outcome is the observed metric value
  • effect_frac is read from a fixed internal action-impact table
  • the effect is multiplicative, not additive

For latency, the current action-impact mapping includes the following examples:

  • restart_containerlatency_effect = -0.15
  • scale_outlatency_effect = -0.20
  • rollbacklatency_effect = -0.25
  • circuit_breakerlatency_effect = -0.05
  • traffic_shiftlatency_effect = -0.10
  • alert_teamlatency_effect = 0.0
  • no_actionlatency_effect = 0.0

For error rate, the table includes a separate error_rate_effect per action, but the current endpoint calls the explainer with outcome_metric="latency", so the returned counterfactual explanation is latency-based.

Uncertainty interval

The explainer applies a fixed uncertainty margin of ±10% around the estimated effect.

Let:

effect = counterfactual_outcome - factual_outcome
ci_half = abs(effect) * 0.1
confidence_interval = (counterfactual_outcome - ci_half, counterfactual_outcome + ci_half)

This interval is heuristic only. It is not a calibrated statistical confidence interval.

How the endpoint uses the explainer

The endpoint constructs a local state object and passes it to the explainer:

  • current_state["latency"] = event.latency_p99
  • current_state["error_rate"] = event.error_rate
  • current_state["last_action"] = {"action_type": "no_action"}

It then creates:

  • proposed_action = {"action_type": optimal_action.value, "params": {}}

and calls:

CausalExplainer().explain_healing_intent(proposed_action, current_state, "latency")

The resulting explanation is embedded into the healing_intent response.

Validation and error behavior

The endpoint uses Pydantic validation through the ReliabilityEvent model.

Expected behavior:

  • valid requests return HTTP 200
  • invalid request bodies are rejected by FastAPI/Pydantic before the handler logic runs

The current implementation does not define a custom error schema for validation failures.

Advisory-only behavior

The response includes:

"status": "oss_advisory_only"

This means:

  • the endpoint recommends an action
  • it does not perform the action
  • it does not mutate incident state
  • it does not trigger remediation workflows by itself

Notes on implementation scope

The current endpoint is intentionally narrow:

  • it bases the action choice on only two fields: latency_p99 and error_rate
  • it ignores cpu_util, memory_util, and service_mesh in the decision logic
  • it always uses the latency metric in the causal explainer call
  • it returns a fixed expected_utility value of 0.5

Example request

curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate"   -H "Content-Type: application/json"   -d '{
    "component": "payment-service",
    "latency_p99": 450,
    "error_rate": 0.25,
    "service_mesh": "default",
    "cpu_util": 0.85,
    "memory_util": 0.90
  }'

Example response shape

{
  "healing_intent": {
    "action": "restart_container",
    "component": "payment-service",
    "parameters": {},
    "justification": "Causal: If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
    "confidence": 0.85,
    "risk_score": 0.4575,
    "status": "oss_advisory_only"
  },
  "causal_explanation": {
    "factual_outcome": 450,
    "counterfactual_outcome": 382.5,
    "effect": -67.5,
    "explanation_text": "If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
    "is_model_based": false,
    "warnings": [
      "Using heuristic causal model (no fitted SCM)."
    ]
  },
  "utility_decision": {
    "best_action": "restart_container",
    "expected_utility": 0.5,
    "explanation": "Heuristic decision based on latency/error thresholds"
  }
}

Cross-reference

See docs/examples.md for a worked numerical example and README.md for a shorter overview.