Spaces:

A-R-F
/

Agentic-Reliability-Framework-API

Running

App Files Files Community

Upload folder using huggingface_hub

by petter2025 - opened about 17 hours ago

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

+3710

-790

This view is limited to 50 files because it contains too many changes. See the raw diff here.

Files changed (50) hide show

.dockerignore +8 -0
.gitignore +16 -33
Dockerfile +2 -1
README.md +90 -72
alembic/versions/d36deffe7fa2_add_beta_state_table_for_conjugate_.py +47 -0
app/api/deps.py +45 -63
app/api/routes_admin.py +36 -25
app/api/routes_governance.py +190 -71
app/api/routes_incidents.py +186 -54
app/api/routes_memory.py +5 -1
app/api/routes_payments.py +7 -5
app/api/routes_pricing.py +104 -0
app/api/routes_risk.py +16 -19
app/api/routes_users.py +7 -19
app/api/webhooks.py +2 -1
app/core/config.py +3 -0
app/core/usage_tracker.py +257 -93
app/database/models_intents.py +48 -6
app/database/session.py +1 -14
app/main.py +207 -67
app/models/__init__.py +1 -1
app/models/incident_models.py +3 -2
app/models/infrastructure_intents.py +7 -40
app/models/intent_models.py +1 -1
app/models/risk_models.py +1 -1
app/services/incident_service.py +2 -1
app/services/intent_adapter.py +162 -65
app/services/intent_service.py +2 -1
app/services/intent_store.py +7 -3
app/services/outcome_service.py +117 -57
app/services/risk_service.py +348 -69
app/services/wilson_monitor.py +56 -0
docker-compose.test.yml +12 -0
docs/authentication.md +25 -0
docs/development.md +55 -0
docs/docs_endpoints.md +314 -0
docs/endpoints.md +34 -0
docs/examples.md +54 -0
docs/index.md +16 -0
monitor.sh +18 -0
render.yaml +19 -0
requirements-dev.txt +3 -0
requirements.txt +9 -5
runtime.txt +2 -0
seed_rag_data.py +67 -0
start.sh +68 -0
tests/conftest.py +128 -0
tests/test_deps.py +15 -0
tests/test_governance.py +71 -0
tests/test_healing_endpoint.py +21 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,8 @@

+.git
+__pycache__
+*.pyc
+.env
+venv
+.pytest_cache
+.coverage
+htmlcov

.gitignore CHANGED Viewed

@@ -1,50 +1,33 @@
 # Python
 __pycache__/
-*.py[cod]
-*$py.class
-*.so
 .Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-# Virtual Environment
 venv/
 env/
 ENV/
-.env/
 .venv/
 # IDE
 .vscode/
 .idea/
 *.swp
 *.swo
-*~
 # OS
 .DS_Store
-.DS_Store?
-._*
-.Spotlight-V100
-.Trashes
-ehthumbs.db
-Thumbs.db
-# Hugging Face Spaces
-data/
-models/
-logs/
-*.log

 # Python
 __pycache__/
+*.pyc
+*.pyo
+*.pyd
 .Python
+*.so
+# Virtual environments
 venv/
 env/
 ENV/
 .venv/
+# Build artifacts
+dist/
+build/
+*.egg-info/
 # IDE
 .vscode/
 .idea/
 *.swp
 *.swo
 # OS
 .DS_Store
+.env
+test.db
+venv
+.coverage
+monitor.log
+monitor_loop.log

Dockerfile CHANGED Viewed

@@ -1,6 +1,7 @@
 FROM python:3.12-slim
 WORKDIR /app
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 COPY . .
-CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7860"]

 FROM python:3.12-slim
+RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
 WORKDIR /app
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 COPY . .
+CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,103 +1,121 @@
----
-title: Agentic Reliability Framework (ARF) v4 – Public API Demo
-emoji: 🤖
-colorFrom: blue
-colorTo: green
-sdk: docker
-python_version: '3.10'
-app_file: app.py
-pinned: false
----
-# Agentic Reliability Framework (ARF) – Public API Demo (Sandbox)
-**Problem:** Most AI‑driven governance systems fail silently in production, leading to outages, security breaches, and compliance violations.
-**Solution:** ARF turns probabilistic AI into deterministic, auditable action using Bayesian inference, semantic memory, and **expected loss minimisation**.
-**Outcome:** Reduce MTTR by up to 85% with self‑healing systems, backed by fully explainable risk scores.
-> ℹ️ **This Space provides a sanitised, mock API endpoint.** The real ARF core engine is proprietary, access‑controlled, and available only to qualified pilots and enterprise customers. See the [public specification](https://arf-foundation.github.io/arf-spec/) for details.
----
-## 🚀 Start Here
-| | |
-|--|--|
-| **📚 API Docs** | [https://a-r-f-arf-sandbox-api.hf.space/docs](https://a-r-f-arf-sandbox-api.hf.space/docs) |
-| **🧪 Live Demo** | [Gradio Dashboard](https://a-r-f-arf-sandbox-api.hf.space/) |
-| **📦 Public Spec** | [github.com/arf-foundation/arf-spec](https://github.com/arf-foundation/arf-spec) |
-| **📅 Book a Call** | [Calendly](https://calendly.com/petter2025us/30min) |
----
-## 🔍 Quick Example
-```python
-import requests
-response = requests.post(
-    "https://a-r-f-arf-sandbox-api.hf.space/v1/evaluate",
-    json={
-        "service_name": "payment-gateway",
-        "event_type": "latency_spike",
-        "severity": "high",
-        "metrics": {"latency_p99": 450, "error_rate": 0.12}
-    }
-)
-print(response.json())
 ```
-The response includes a mock HealingIntent with:
-*   risk\_score: simulated failure probability
-*   risk\_factors: additive contributions from conjugate prior, hyperprior, and HMC
-*   recommended\_action: approve, deny, or escalate
-*   decision\_trace: expected losses and variance
-⚠️ **All responses from this endpoint are simulated.** The real Bayesian engine is not exposed publicly.
-🧠 Key Capabilities (Conceptual Overview)
------------------------------------------
-*   **Bayesian Risk Scoring** – Conjugate priors + HMC for calibrated uncertainty.
-*   **Semantic Memory** – FAISS‑based retrieval of similar past incidents.
-*   **Expected Loss Minimisation** – Chooses approve/deny/escalate by minimising cost-weighted risk, not static thresholds.
-*   **Multi‑Agent Orchestration** – Anomaly detection, root cause, forecasting.
-📊 Architecture
----------------
 ```text
-User Request → Policy Evaluation → Cost Estimation → Risk Scoring
-                                                    ↓
-                           HealingIntent ← Decision (Expected Loss)
 ```
-All decisions are immutable, signed, and fully traceable via ancestor\_chain and infrastructure\_intent fields.
-🔧 Local Development
---------------------
 ```bash
-docker build -t arf-api .
-docker run -p 7860:7860 arf-api
 ```
-Then open [http://localhost:7860](http://localhost:7860/) for the Gradio UI and [http://localhost:7860/api/docs](http://localhost:7860/api/docs) for the API.
-📚 About ARF
-------------
-The **Agentic Reliability Framework** is a governed, mathematically grounded advisory layer for AI infrastructure. The public specification, demo UI, and sandbox API are open‑source (Apache 2.0). **The core Bayesian engine is proprietary and access‑controlled** — available for pilot evaluation and enterprise licensing under outcome‑based pricing.
-Learn more at [github.com/arf-foundation](https://github.com/arf-foundation) and request access via petter2025us@outlook.com.

+# arf-api
+ARF API Control Plane (FastAPI)
+## Live Demo
+The API is deployed and accessible at:
+- **Base URL**: [https://a-r-f-agentic-reliability-framework-api.hf.space](https://a-r-f-agentic-reliability-framework-api.hf.space)
+- **Interactive Documentation**: [https://a-r-f-agentic-reliability-framework-api.hf.space/docs](https://a-r-f-agentic-reliability-framework-api.hf.space/docs)
+## Quick Start (Local Development)
+1. **Install dependencies**:
+```bash
+pip install -r requirements.txt
+```
+Note: `requirements.txt` installs `agentic-reliability-framework` directly from the project's Git repository.
+2. **Set environment variables** (optional, in `.env`):
+```text
+ARF_HMC_MODEL – path to HMC model JSON (default: models/hmc_model.json)
+ARF_USE_HYPERPRIORS – true/false
+API_KEY – optional (currently not enforced)
+```
+3. **Run the app locally**:
+```bash
+uvicorn app.main:app --reload --port 8000
 ```
+4. **Health check**:
+```bash
+GET http://localhost:8000/health
+```
+## Causal Explainer Endpoint
+The ARF API includes a heuristic causal explainer that evaluates the impact of proposed healing actions using deterministic rules. This module provides counterfactual reasoning without requiring a fitted causal model or external ML dependencies.
+The explainer estimates how system metrics such as latency would change if a different action were taken.
+### Mathematical Model
+The counterfactual outcome is computed as:
 ```text
+counterfactual_outcome = factual_outcome * (1 + effect_frac)
 ```
+Where:
+- `effect_frac` is a predefined impact factor based on the action type
+- effects are multiplicative
+- a fixed ±10% uncertainty interval is applied to the estimated outcome
+### Example Request
 ```bash
+curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate"   -H "Content-Type: application/json"   -d '{
+    "component": "checkout-service",
+    "latency_p99": 600,
+    "error_rate": 0.2,
+    "service_mesh": "default"
+  }'
+```
+### Example Response
+```json
+{
+  "healing_intent": {
+    "action": "restart_container",
+    "component": "checkout-service",
+    "parameters": {},
+    "justification": "Causal: If we apply restart_container instead of no_action, latency would change from 600.00 to 510.00 (Δ = -90.00). Based on heuristic causal model.",
+    "confidence": 0.85,
+    "risk_score": 0.54,
+    "status": "oss_advisory_only"
+  },
+  "causal_explanation": {
+    "factual_outcome": 600,
+    "counterfactual_outcome": 510,
+    "effect": -90,
+    "explanation_text": "If we apply restart_container instead of no_action, latency would change from 600.00 to 510.00 (Δ = -90.00). Based on heuristic causal model.",
+    "is_model_based": false,
+    "warnings": [
+      "Using heuristic causal model (no fitted SCM)."
+    ]
+  },
+  "utility_decision": {
+    "best_action": "restart_container",
+    "expected_utility": 0.5,
+    "explanation": "Heuristic decision based on latency/error thresholds"
+  }
+}
 ```
+### Important Notes
+- This endpoint is advisory only (`status = oss_advisory_only`)
+- No Structural Causal Model (SCM) is fitted
+- No machine learning models are used
+- All effects are based on predefined heuristics
+Tests
+-----
+Run `pytest`. Tests use a temporary SQLite DB (`sqlite:///./test.db`) created by the test fixtures.
+Notes
+-----
+- The governance endpoints use an in-process `RiskEngine` initialized at startup.
+- The outcome recording endpoint is not implemented in this repository and returns HTTP 501.

alembic/versions/d36deffe7fa2_add_beta_state_table_for_conjugate_.py ADDED Viewed

	@@ -0,0 +1,47 @@

+"""add beta_state table for conjugate posterior persistence
+Revision ID: d36deffe7fa2
+Revises: b2218948f541
+Create Date: 2026-05-02 20:36:04.870145
+"""
+from typing import Sequence, Union
+from alembic import op
+import sqlalchemy as sa
+# revision identifiers, used by Alembic.
+revision: str = 'd36deffe7fa2'
+down_revision: Union[str, Sequence[str], None] = 'b2218948f541'
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None
+def upgrade() -> None:
+    """Upgrade schema."""
+    # ### commands auto generated by Alembic - please adjust! ###
+    op.create_table('beta_state',
+    sa.Column('id', sa.Integer(), nullable=False),
+    sa.Column('category', sa.String(length=32), nullable=False),
+    sa.Column('alpha', sa.Float(), nullable=False),
+    sa.Column('beta', sa.Float(), nullable=False),
+    sa.Column('updated_at', sa.DateTime(), nullable=True),
+    sa.PrimaryKeyConstraint('id')
+    )
+    op.create_index(op.f('ix_beta_state_category'), 'beta_state', ['category'], unique=True)
+    op.create_index(op.f('ix_beta_state_id'), 'beta_state', ['id'], unique=False)
+    op.add_column('intent_outcomes', sa.Column('idempotency_key', sa.String(length=128), nullable=True))
+    op.create_unique_constraint(None, 'intent_outcomes', ['idempotency_key'])
+    # ### end Alembic commands ###
+def downgrade() -> None:
+    """Downgrade schema."""
+    # ### commands auto generated by Alembic - please adjust! ###
+    op.drop_constraint(None, 'intent_outcomes', type_='unique')
+    op.drop_column('intent_outcomes', 'idempotency_key')
+    op.drop_index(op.f('ix_beta_state_id'), table_name='beta_state')
+    op.drop_index(op.f('ix_beta_state_category'), table_name='beta_state')
+    op.drop_table('beta_state')
+    # ### end Alembic commands ###

app/api/deps.py CHANGED Viewed

@@ -4,66 +4,16 @@ from slowapi import Limiter
 from slowapi.util import get_remote_address
 from app.core.config import settings
-# ---------------------------------------------------------------------------
-# Local dummy implementations that replace the private engine classes.
-# They provide the same interface as the originals but perform no real work.
-# ---------------------------------------------------------------------------
-class RiskEngine:
-    def __init__(self, *args, **kwargs):
-        pass
-    def calculate_risk(self, *args, **kwargs):
-        return (0.38, "mock", {"conjugate_mean": 0.38})
-    def update_outcome(self, *args, **kwargs):
-        pass
-class DecisionEngine:
-    def __init__(self, *args, **kwargs):
-        pass
-    def select_optimal_action(self, *args, **kwargs):
-        class Result:
-            best_action = type('Action', (), {'value': 'NO_ACTION'})()
-            expected_utility = 0.0
-            alternatives = []
-            explanation = "mock"
-            raw_data = {}
-        return Result()
-    def compute_risk(self, *args, **kwargs):
-        return 0.0
-class LyapunovStabilityController:
-    def __init__(self, *args, **kwargs):
-        pass
-class CausalExplainer:
-    def __init__(self, *args, **kwargs):
-        pass
-class RAGGraphMemory:
-    def __init__(self, *args, **kwargs):
-        pass
-    def has_historical_data(self):
-        return False
-    def record_outcome(self, *args, **kwargs):
-        pass
-class ReliabilityEvent:
-    def __init__(self, component, latency_p99, error_rate, service_mesh="default"):
-        self.component = component
-        self.latency_p99 = latency_p99
-        self.error_rate = error_rate
-        self.service_mesh = service_mesh
-class HealingAction:
-    NO_ACTION = "NO_ACTION"
-    RESTART_CONTAINER = "RESTART_CONTAINER"
-    SCALE_OUT = "SCALE_OUT"
-    ROLLBACK = "ROLLBACK"
-    CIRCUIT_BREAKER = "CIRCUIT_BREAKER"
-    TRAFFIC_SHIFT = "TRAFFIC_SHIFT"
-    ALERT_TEAM = "ALERT_TEAM"
-# ---------------------------------------------------------------------------
 def get_db():
     db = SessionLocal()
     try:
@@ -72,10 +22,14 @@ def get_db():
         db.close()
-limiter = Limiter(key_func=get_remote_address, default_limits=[settings.RATE_LIMIT])
-# Singletons (now using local dummies)
 _risk_engine = None
 _decision_engine = None
 _stability_controller = None
@@ -84,8 +38,36 @@ _rag_graph = None
 def _seed_rag_graph(rag):
-    # Mock seed – no real data
-    print("RAG seed skipped (sandbox mode)", file=sys.stderr)
 def get_rag_graph():
@@ -122,4 +104,4 @@ def get_causal_explainer():
     global _causal_explainer
     if _causal_explainer is None:
         _causal_explainer = CausalExplainer()
-    return _causal_explainer

 from slowapi.util import get_remote_address
 from app.core.config import settings
+# ARF core engine imports
+from agentic_reliability_framework.core.governance.risk_engine import RiskEngine
+from agentic_reliability_framework.core.decision.decision_engine import DecisionEngine
+from agentic_reliability_framework.core.governance.stability_controller import LyapunovStabilityController
+from agentic_reliability_framework.core.governance.causal_explainer import CausalExplainer
+from agentic_reliability_framework.runtime.memory.rag_graph import RAGGraphMemory
+from agentic_reliability_framework.core.models.event import ReliabilityEvent, HealingAction
+# Dependency to get DB session
 def get_db():
     db = SessionLocal()
     try:
         db.close()
+# Rate limiter with default limit from settings
+limiter = Limiter(
+    key_func=get_remote_address,
+    default_limits=[
+        settings.RATE_LIMIT])
+# ARF engine dependencies (singletons for simplicity)
 _risk_engine = None
 _decision_engine = None
 _stability_controller = None
 def _seed_rag_graph(rag):
+    """Seed the RAG graph with historical healing action outcomes."""
+    seed_data = [
+        ("seed_restart_1", "test", HealingAction.RESTART_CONTAINER.value, True, 2),
+        ("seed_restart_2", "test", HealingAction.RESTART_CONTAINER.value, True, 3),
+        ("seed_restart_3", "test", HealingAction.RESTART_CONTAINER.value, False, 10),
+        ("seed_rollback_1", "test", HealingAction.ROLLBACK.value, True, 1),
+        ("seed_rollback_2", "test", HealingAction.ROLLBACK.value, True, 2),
+        ("seed_rollback_3", "test", HealingAction.ROLLBACK.value, False, 5),
+        ("seed_scale_1", "test", HealingAction.SCALE_OUT.value, True, 5),
+        ("seed_scale_2", "test", HealingAction.SCALE_OUT.value, False, 15),
+        ("seed_cb_1", "test", HealingAction.CIRCUIT_BREAKER.value, True, 1),
+        ("seed_cb_2", "test", HealingAction.CIRCUIT_BREAKER.value, True, 2),
+        ("seed_ts_1", "test", HealingAction.TRAFFIC_SHIFT.value, True, 4),
+        ("seed_ts_2", "test", HealingAction.TRAFFIC_SHIFT.value, False, 8),
+    ]
+    for inc_id, comp, action, success, res_time in seed_data:
+        event = ReliabilityEvent(
+            component=comp,
+            latency_p99=500,
+            error_rate=0.1,
+            service_mesh="default"
+        )
+        rag.record_outcome(
+            incident_id=inc_id,
+            event=event,
+            action_taken=action,
+            success=success,
+            resolution_time_minutes=res_time
+        )
+    print("Seeded RAG graph with historical data", file=sys.stderr)
 def get_rag_graph():
     global _causal_explainer
     if _causal_explainer is None:
         _causal_explainer = CausalExplainer()
+    return _causal_explainer

app/api/routes_admin.py CHANGED Viewed

@@ -4,25 +4,26 @@ These endpoints should be protected (e.g., by an admin API key) in production.
 """
 from fastapi import APIRouter, Depends, HTTPException, Query, Path, Body
 from pydantic import BaseModel
-from typing import Optional, List, Dict, Any
 from datetime import datetime
 import uuid
 from app.core.usage_tracker import tracker, Tier
 router = APIRouter(prefix="/admin", tags=["admin"])
 # Simple in‑memory admin key (replace with proper auth in production)
 ADMIN_API_KEY = "admin_secret_change_me"
 def verify_admin(admin_key: str = Query(..., alias="admin_key")):
     if admin_key != ADMIN_API_KEY:
         raise HTTPException(status_code=403, detail="Invalid admin key")
     return True
 class CreateKeyRequest(BaseModel):
     tier: str
 class UpdateTierRequest(BaseModel):
     tier: str
@@ -30,20 +31,20 @@ class UpdateTierRequest(BaseModel):
 @router.post("/keys", dependencies=[Depends(verify_admin)])
 async def create_api_key(req: CreateKeyRequest):
     if req.tier not in [t.value for t in Tier]:
-        raise HTTPException(status_code=400, detail=f"Invalid tier. Must be one of {[t.value for t in Tier]}")
     new_key = f"sk_live_{uuid.uuid4().hex[:24]}"
     tier_enum = Tier(req.tier)
     tracker.get_or_create_api_key(new_key, tier_enum)
     return {"api_key": new_key, "tier": req.tier}
-@router.get("/keys", dependencies=[Depends(verify_admin)])
 async def list_api_keys(limit: int = 100, offset: int = 0):
     with tracker._get_conn() as conn:
         rows = conn.execute(
-            "SELECT key, tier, created_at, last_used_at, is_active FROM api_keys ORDER BY created_at DESC LIMIT ? OFFSET ?",
             (limit, offset)
-        ).fetchall()
         keys = []
         for row in rows:
             month = tracker._get_month_key()
@@ -52,14 +53,18 @@ async def list_api_keys(limit: int = 100, offset: int = 0):
                 (row["key"], month)
             ).fetchone()
             usage = usage_row["count"] if usage_row else 0
-            keys.append({
-                "key": row["key"],
-                "tier": row["tier"],
-                "created_at": datetime.fromtimestamp(row["created_at"]).isoformat(),
-                "last_used_at": datetime.fromtimestamp(row["last_used_at"]).isoformat() if row["last_used_at"] else None,
-                "is_active": bool(row["is_active"]),
-                "current_month_usage": usage,
-            })
         return {"keys": keys, "total": len(keys)}
@@ -69,28 +74,33 @@ async def update_key_tier(
     req: UpdateTierRequest = Body(...),
 ):
     if req.tier not in [t.value for t in Tier]:
-        raise HTTPException(status_code=400, detail=f"Invalid tier. Must be one of {[t.value for t in Tier]}")
     with tracker._get_conn() as conn:
-        row = conn.execute("SELECT key FROM api_keys WHERE key = ?", (api_key,)).fetchone()
         if not row:
             raise HTTPException(status_code=404, detail="API key not found")
-        conn.execute("UPDATE api_keys SET tier = ? WHERE key = ?", (req.tier, api_key))
         conn.commit()
     return {"message": f"Tier updated to {req.tier}"}
 @router.delete("/keys/{api_key}", dependencies=[Depends(verify_admin)])
-async def deactivate_api_key(api_key: str = Path(..., description="The API key to deactivate")):
     with tracker._get_conn() as conn:
-        row = conn.execute("SELECT key FROM api_keys WHERE key = ?", (api_key,)).fetchone()
         if not row:
             raise HTTPException(status_code=404, detail="API key not found")
-        conn.execute("UPDATE api_keys SET is_active = 0 WHERE key = ?", (api_key,))
         conn.commit()
     return {"message": "API key deactivated"}
-@router.get("/audit/{api_key}", dependencies=[Depends(verify_admin)])
 async def get_audit_logs(
     api_key: str = Path(..., description="The API key to audit"),
     start_date: Optional[str] = Query(None),
@@ -103,11 +113,12 @@ async def get_audit_logs(
     return {"api_key": api_key, "logs": logs}
-@router.get("/stats", dependencies=[Depends(verify_admin)])
 async def get_global_stats():
     with tracker._get_conn() as conn:
-        total_keys = conn.execute("SELECT COUNT(*) FROM api_keys WHERE is_active = 1").fetchone()[0]
-        total_requests = conn.execute("SELECT COUNT(*) FROM usage_log").fetchone()[0]
         by_tier = conn.execute(
             "SELECT tier, COUNT(*) as count FROM usage_log GROUP BY tier"
         ).fetchall()

 """
 from fastapi import APIRouter, Depends, HTTPException, Query, Path, Body
 from pydantic import BaseModel
+from typing import Optional
 from datetime import datetime
 import uuid
 from app.core.usage_tracker import tracker, Tier
 router = APIRouter(prefix="/admin", tags=["admin"])
 # Simple in‑memory admin key (replace with proper auth in production)
 ADMIN_API_KEY = "admin_secret_change_me"
 def verify_admin(admin_key: str = Query(..., alias="admin_key")):
     if admin_key != ADMIN_API_KEY:
         raise HTTPException(status_code=403, detail="Invalid admin key")
     return True
 class CreateKeyRequest(BaseModel):
     tier: str
 class UpdateTierRequest(BaseModel):
     tier: str
 @router.post("/keys", dependencies=[Depends(verify_admin)])
 async def create_api_key(req: CreateKeyRequest):
     if req.tier not in [t.value for t in Tier]:
+        raise HTTPException(
+            status_code=400, detail=f"Invalid tier. Must be one of {[t.value for t in Tier]}")
     new_key = f"sk_live_{uuid.uuid4().hex[:24]}"
     tier_enum = Tier(req.tier)
     tracker.get_or_create_api_key(new_key, tier_enum)
     return {"api_key": new_key, "tier": req.tier}
 async def list_api_keys(limit: int = 100, offset: int = 0):
     with tracker._get_conn() as conn:
         rows = conn.execute(
+            "SELECT key, tier, created_at, last_used_at, is_active FROM api_keys ORDER BY created_at DESC LIMIT ? OFFSET ?",  # noqa: E501
             (limit, offset)
+        ).fetchall()  # noqa: E501
         keys = []
         for row in rows:
             month = tracker._get_month_key()
                 (row["key"], month)
             ).fetchone()
             usage = usage_row["count"] if usage_row else 0
+            keys.append(
+                {
+                    "key": row["key"],
+                    "tier": row["tier"],
+                    "created_at": datetime.fromtimestamp(
+                        row["created_at"]).isoformat(),
+                    "last_used_at": datetime.fromtimestamp(
+                        row["last_used_at"]).isoformat() if row["last_used_at"] else None,
+                    "is_active": bool(
+                        row["is_active"]),
+                    "current_month_usage": usage,
+                })
         return {"keys": keys, "total": len(keys)}
     req: UpdateTierRequest = Body(...),
 ):
     if req.tier not in [t.value for t in Tier]:
+        raise HTTPException(
+            status_code=400, detail=f"Invalid tier. Must be one of {[t.value for t in Tier]}")
     with tracker._get_conn() as conn:
+        row = conn.execute(
+            "SELECT key FROM api_keys WHERE key = ?", (api_key,)).fetchone()
         if not row:
             raise HTTPException(status_code=404, detail="API key not found")
+        conn.execute("UPDATE api_keys SET tier = ? WHERE key = ?",
+                     (req.tier, api_key))
         conn.commit()
     return {"message": f"Tier updated to {req.tier}"}
 @router.delete("/keys/{api_key}", dependencies=[Depends(verify_admin)])
+async def deactivate_api_key(
+        api_key: str = Path(..., description="The API key to deactivate")):
     with tracker._get_conn() as conn:
+        row = conn.execute(
+            "SELECT key FROM api_keys WHERE key = ?", (api_key,)).fetchone()
         if not row:
             raise HTTPException(status_code=404, detail="API key not found")
+        conn.execute(
+            "UPDATE api_keys SET is_active = 0 WHERE key = ?", (api_key,))
         conn.commit()
     return {"message": "API key deactivated"}
 async def get_audit_logs(
     api_key: str = Path(..., description="The API key to audit"),
     start_date: Optional[str] = Query(None),
     return {"api_key": api_key, "logs": logs}
 async def get_global_stats():
     with tracker._get_conn() as conn:
+        total_keys = conn.execute(
+            "SELECT COUNT(*) FROM api_keys WHERE is_active = 1").fetchone()[0]
+        total_requests = conn.execute(
+            "SELECT COUNT(*) FROM usage_log").fetchone()[0]
         by_tier = conn.execute(
             "SELECT tier, COUNT(*) as count FROM usage_log GROUP BY tier"
         ).fetchall()

app/api/routes_governance.py CHANGED Viewed

@@ -1,4 +1,4 @@
-from fastapi import APIRouter, Depends, HTTPException, Request, BackgroundTasks
 from fastapi.encoders import jsonable_encoder
 from sqlalchemy.orm import Session
 from app.models.infrastructure_intents import InfrastructureIntentRequest
@@ -8,26 +8,34 @@ from app.services.intent_store import save_evaluated_intent
 from app.services.outcome_service import record_outcome
 from app.api.deps import get_db
 from pydantic import BaseModel
-from typing import Optional
 import uuid
 import logging
 import time
-# Optional import from protected core engine – not available in public Spaces
 try:
-    from agentic_reliability_framework.core.models.event import ReliabilityEvent
 except ImportError:
-    # Local fallback for public sandbox deployments
-    class ReliabilityEvent(BaseModel):
-        component: str
-        latency_p99: float
-        error_rate: float
-        service_mesh: str = "default"
-        cpu_util: Optional[float] = None
-        memory_util: Optional[float] = None
-# ===== USAGE TRACKER IMPORTS =====
-from app.core.usage_tracker import enforce_quota, UsageRecord, tracker
 logger = logging.getLogger(__name__)
 router = APIRouter()
@@ -50,13 +58,52 @@ async def evaluate_intent_endpoint(
     intent_req: InfrastructureIntentRequest,
     background_tasks: BackgroundTasks,
     db: Session = Depends(get_db),
-    quota: dict = Depends(enforce_quota)
 ):
     start_time = time.time()
-    api_key = quota["api_key"]
-    tier = quota["tier"]
-    response_data = None
-    error_msg = None
     try:
         oss_intent = to_oss_intent(intent_req)
@@ -68,6 +115,10 @@ async def evaluate_intent_endpoint(
             policy_violations=intent_req.policy_violations
         )
         deterministic_id = str(uuid.uuid4())
         api_payload = jsonable_encoder(intent_req.model_dump())
         oss_payload = jsonable_encoder(oss_intent.model_dump())
@@ -85,36 +136,39 @@ async def evaluate_intent_endpoint(
         result["intent_id"] = deterministic_id
         response_data = result
-        if tracker:
-            record = UsageRecord(
-                api_key=api_key,
-                tier=tier,
-                timestamp=time.time(),
-                endpoint="/api/v1/intents/evaluate",
-                request_body=intent_req.model_dump(),
-                response=response_data,
-                processing_ms=(time.time() - start_time) * 1000,
             )
-            await tracker.increment_usage_async(record, background_tasks)
         return response_data
     except HTTPException:
         raise
     except Exception as e:
         error_msg = str(e)
         logger.exception("Error in evaluate_intent_endpoint")
-        if tracker:
-            record = UsageRecord(
-                api_key=api_key,
-                tier=tier,
-                timestamp=time.time(),
-                endpoint="/api/v1/intents/evaluate",
-                request_body=intent_req.model_dump(),
-                error=error_msg,
-                processing_ms=(time.time() - start_time) * 1000,
-            )
-            await tracker.increment_usage_async(record, background_tasks)
         raise HTTPException(status_code=500, detail=error_msg)
@@ -122,9 +176,14 @@ async def evaluate_intent_endpoint(
 async def record_outcome_endpoint(
     request: Request,
     outcome: OutcomeRequest,
-    db: Session = Depends(get_db)
 ):
-    # No usage tracking for outcomes (doesn't count against quota)
     try:
         risk_engine = request.app.state.risk_engine
         outcome_record = record_outcome(
@@ -133,8 +192,27 @@ async def record_outcome_endpoint(
             success=outcome.success,
             recorded_by=outcome.recorded_by,
             notes=outcome.notes,
-            risk_engine=risk_engine
         )
         return {"message": "Outcome recorded", "outcome_id": outcome_record.id}
     except Exception as e:
         raise HTTPException(status_code=500, detail=str(e))
@@ -145,13 +223,51 @@ async def evaluate_healing_decision_endpoint(
     request: Request,
     decision_req: HealingDecisionRequest,
     background_tasks: BackgroundTasks,
-    quota: dict = Depends(enforce_quota)
 ):
     start_time = time.time()
-    api_key = quota["api_key"]
-    tier = quota["tier"]
-    response_data = None
-    error_msg = None
     try:
         policy_engine = request.app.state.policy_engine
@@ -168,34 +284,37 @@ async def evaluate_healing_decision_endpoint(
             tokenizer=tokenizer,
         )
-        if tracker:
-            record = UsageRecord(
-                api_key=api_key,
-                tier=tier,
-                timestamp=time.time(),
-                endpoint="/api/v1/healing/evaluate",
-                request_body=decision_req.model_dump(),
-                response=response_data,
-                processing_ms=(time.time() - start_time) * 1000,
-            )
-            await tracker.increment_usage_async(record, background_tasks)
         return response_data
     except HTTPException:
         raise
     except Exception as e:
         error_msg = str(e)
         logger.exception("Error in evaluate_healing_decision_endpoint")
-        if tracker:
-            record = UsageRecord(
-                api_key=api_key,
-                tier=tier,
-                timestamp=time.time(),
-                endpoint="/api/v1/healing/evaluate",
-                request_body=decision_req.model_dump(),
-                error=error_msg,
-                processing_ms=(time.time() - start_time) * 1000,
-            )
-            await tracker.increment_usage_async(record, background_tasks)
-        raise HTTPException(status_code=500, detail=error_msg)

+from fastapi import APIRouter, Depends, HTTPException, Request, BackgroundTasks, Header
 from fastapi.encoders import jsonable_encoder
 from sqlalchemy.orm import Session
 from app.models.infrastructure_intents import InfrastructureIntentRequest
 from app.services.outcome_service import record_outcome
 from app.api.deps import get_db
 from pydantic import BaseModel
 import uuid
 import logging
 import time
+from typing import Optional
+from agentic_reliability_framework.core.models.event import ReliabilityEvent
+# ===== USAGE TRACKER IMPORTS =====
+import app.core.usage_tracker
+from app.core.usage_tracker import UsageRecord
+# ===== PRICING CALCULATOR INTEGRATION =====
 try:
+    from arf_pricing_calculator.storage.buffer import add_event
+    PRICING_AVAILABLE = True
 except ImportError:
+    PRICING_AVAILABLE = False
+    add_event = None
+# ===== OpenTelemetry (optional) =====
+try:
+    from opentelemetry import trace
+    from opentelemetry.trace import Status, StatusCode
+    _tracer = trace.get_tracer(__name__)
+    OTEL_AVAILABLE = True
+except ImportError:
+    OTEL_AVAILABLE = False
+    _tracer = None
 logger = logging.getLogger(__name__)
 router = APIRouter()
     intent_req: InfrastructureIntentRequest,
     background_tasks: BackgroundTasks,
     db: Session = Depends(get_db),
+    idempotency_key: Optional[str] = Header(None, alias="Idempotency-Key"),
 ):
+    """
+    Evaluate an infrastructure intent with idempotency and atomic quota consumption.
+    """
+    # ── optional trace ──────────────────────────────────────
+    span = None
+    if OTEL_AVAILABLE and _tracer:
+        span = _tracer.start_span("governance.evaluate_intent")
+        span.set_attribute("intent_type", intent_req.intent_type)
+        span.set_attribute("environment", str(intent_req.environment))
     start_time = time.time()
+    api_key = request.headers.get("Authorization", "").replace("Bearer ", "")
+    if not api_key:
+        api_key = request.query_params.get("api_key", "unknown")
+    current_tracker = app.core.usage_tracker.tracker
+    if current_tracker is None:
+        if span:
+            span.set_status(Status(StatusCode.ERROR, "tracker unavailable"))
+            span.end()
+        raise HTTPException(status_code=503,
+                            detail="Usage tracking service unavailable")
+    record = UsageRecord(
+        api_key=api_key,
+        tier=None,
+        timestamp=start_time,
+        endpoint="/api/v1/intents/evaluate",
+        request_body=intent_req.model_dump(),
+        processing_ms=None,
+    )
+    success, existing_response = current_tracker.consume_quota_and_log(
+        record=record,
+        idempotency_key=idempotency_key
+    )
+    if not success:
+        if span:
+            span.set_attribute("idempotent_hit", True if existing_response else False)
+            span.end()
+        if existing_response:
+            return existing_response
+        else:
+            raise HTTPException(status_code=429,
+                                detail="Monthly evaluation quota exceeded")
     try:
         oss_intent = to_oss_intent(intent_req)
             policy_violations=intent_req.policy_violations
         )
+        if span:
+            span.set_attribute("risk_score", result["risk_score"])
+            span.set_attribute("deterministic_id", str(uuid.uuid4()))  # will be overwritten later, but fine for trace
         deterministic_id = str(uuid.uuid4())
         api_payload = jsonable_encoder(intent_req.model_dump())
         oss_payload = jsonable_encoder(oss_intent.model_dump())
         result["intent_id"] = deterministic_id
         response_data = result
+        if current_tracker:
+            background_tasks.add_task(
+                current_tracker._insert_audit_log,
+                UsageRecord(
+                    api_key=api_key,
+                    tier=None,
+                    timestamp=time.time(),
+                    endpoint="/api/v1/intents/evaluate/response",
+                    request_body=None,
+                    response=response_data,
+                    processing_ms=(time.time() - start_time) * 1000,
+                )
             )
+        if span:
+            span.set_attribute("intent_id", deterministic_id)
+            span.set_status(Status(StatusCode.OK))
+            span.end()
         return response_data
     except HTTPException:
+        if span:
+            span.set_status(Status(StatusCode.ERROR, "HTTP exception"))
+            span.end()
         raise
     except Exception as e:
         error_msg = str(e)
         logger.exception("Error in evaluate_intent_endpoint")
+        if span:
+            span.set_status(Status(StatusCode.ERROR, error_msg))
+            span.record_exception(e)
+            span.end()
         raise HTTPException(status_code=500, detail=error_msg)
 async def record_outcome_endpoint(
     request: Request,
     outcome: OutcomeRequest,
+    db: Session = Depends(get_db),
+    idempotency_key: Optional[str] = Header(None, alias="Idempotency-Key"),
 ):
+    """
+    Record an outcome for a previously evaluated intent.
+    Idempotent based on deterministic_id and success value (handled in service).
+    Also updates the pricing calculator's calibration buffer if available.
+    """
     try:
         risk_engine = request.app.state.risk_engine
         outcome_record = record_outcome(
             success=outcome.success,
             recorded_by=outcome.recorded_by,
             notes=outcome.notes,
+            risk_engine=risk_engine,
+            idempotency_key=idempotency_key,
         )
+        if PRICING_AVAILABLE and add_event is not None:
+            try:
+                event = {
+                    "run_id": outcome.deterministic_id,
+                    "outcome": "success" if outcome.success else "failure",
+                    "recorded_at": time.time(),
+                    "source": "arf_api_outcome"
+                }
+                add_event(event)
+                logger.info(
+                    f"Added outcome to pricing buffer for intent {
+                        outcome.deterministic_id}")
+            except Exception as e:
+                logger.warning(
+                    f"Failed to update pricing buffer for intent {
+                        outcome.deterministic_id}: {e}")
         return {"message": "Outcome recorded", "outcome_id": outcome_record.id}
     except Exception as e:
         raise HTTPException(status_code=500, detail=str(e))
     request: Request,
     decision_req: HealingDecisionRequest,
     background_tasks: BackgroundTasks,
+    idempotency_key: Optional[str] = Header(None, alias="Idempotency-Key"),
 ):
+    """
+    Evaluate a healing decision with idempotency and atomic quota consumption.
+    """
+    # ── optional trace ──────────────────────────────────────
+    span = None
+    if OTEL_AVAILABLE and _tracer:
+        span = _tracer.start_span("governance.evaluate_healing")
+        span.set_attribute("component", decision_req.event.component)
     start_time = time.time()
+    api_key = request.headers.get("Authorization", "").replace("Bearer ", "")
+    if not api_key:
+        api_key = request.query_params.get("api_key", "unknown")
+    current_tracker = app.core.usage_tracker.tracker
+    if current_tracker is None:
+        if span:
+            span.set_status(Status(StatusCode.ERROR, "tracker unavailable"))
+            span.end()
+        raise HTTPException(status_code=503,
+                            detail="Usage tracking service unavailable")
+    record = UsageRecord(
+        api_key=api_key,
+        tier=None,
+        timestamp=start_time,
+        endpoint="/api/v1/healing/evaluate",
+        request_body=decision_req.model_dump(),
+        processing_ms=None,
+    )
+    success, existing_response = current_tracker.consume_quota_and_log(
+        record=record,
+        idempotency_key=idempotency_key
+    )
+    if not success:
+        if span:
+            span.set_attribute("idempotent_hit", True if existing_response else False)
+            span.end()
+        if existing_response:
+            return existing_response
+        else:
+            raise HTTPException(status_code=429,
+                                detail="Monthly evaluation quota exceeded")
     try:
         policy_engine = request.app.state.policy_engine
             tokenizer=tokenizer,
         )
+        if span:
+            span.set_attribute("risk_score", response_data.get("risk_score", 0.0))
+            span.set_attribute("selected_action", response_data.get("selected_action", "unknown"))
+            span.set_status(Status(StatusCode.OK))
+            span.end()
+        if current_tracker:
+            background_tasks.add_task(
+                current_tracker._insert_audit_log,
+                UsageRecord(
+                    api_key=api_key,
+                    tier=None,
+                    timestamp=time.time(),
+                    endpoint="/api/v1/healing/evaluate/response",
+                    request_body=None,
+                    response=response_data,
+                    processing_ms=(time.time() - start_time) * 1000,
+                )
+            )
         return response_data
     except HTTPException:
+        if span:
+            span.set_status(Status(StatusCode.ERROR, "HTTP exception"))
+            span.end()
         raise
     except Exception as e:
         error_msg = str(e)
         logger.exception("Error in evaluate_healing_decision_endpoint")
+        if span:
+            span.set_status(Status(StatusCode.ERROR, error_msg))
+            span.record_exception(e)
+            span.end()
+        raise HTTPException(status_code=500, detail=error_msg)

app/api/routes_incidents.py CHANGED Viewed

@@ -1,86 +1,211 @@
-from app.causal_explainer import CausalExplainer
-from fastapi import APIRouter, Depends, Request, BackgroundTasks, HTTPException
-from pydantic import BaseModel
-from typing import Optional
-from enum import Enum
-import time
-import json
-# ===== USAGE TRACKER IMPORTS =====
-from app.core.usage_tracker import enforce_quota, UsageRecord, tracker
-class HealingAction(str, Enum):
-    NO_ACTION = "no_action"
-    RESTART_CONTAINER = "restart_container"
-    SCALE_OUT = "scale_out"
-    ROLLBACK = "rollback"
-    CIRCUIT_BREAKER = "circuit_breaker"
-    TRAFFIC_SHIFT = "traffic_shift"
-    ALERT_TEAM = "alert_team"
-class ReliabilityEvent(BaseModel):
-    component: str
-    latency_p99: float
-    error_rate: float
-    service_mesh: str = "default"
-    cpu_util: Optional[float] = None
-    memory_util: Optional[float] = None
 router = APIRouter()
-incident_history = []
 @router.post("/report_incident")
-async def report_incident(event: ReliabilityEvent):
-    incident_history.append(event.dict())
     return {"status": "recorded"}
 @router.post("/v1/incidents/evaluate")
 async def evaluate_incident(
     request: Request,
     event: ReliabilityEvent,
     background_tasks: BackgroundTasks,
-    quota: dict = Depends(enforce_quota)
-):
     start_time = time.time()
-    api_key = quota["api_key"]
     tier = quota["tier"]
-    response_data = None
-    error_msg = None
     try:
-        # Simple risk score (heuristic)
-        risk_score = min(1.0, (event.latency_p99 / 1000.0) * 0.7 + event.error_rate * 0.3)
-        if event.latency_p99 > 500 or event.error_rate > 0.15:
-            optimal_action = HealingAction.RESTART_CONTAINER
-        else:
-            optimal_action = HealingAction.NO_ACTION
         current_state = {
             "latency": event.latency_p99,
             "error_rate": event.error_rate,
-            "last_action": {"action_type": "no_action"}
         }
         proposed_action = {"action_type": optimal_action.value, "params": {}}
-        ce = CausalExplainer()
-        causal_exp = ce.explain_healing_intent(proposed_action, current_state, "latency")
         healing_intent = {
             "action": optimal_action.value,
             "component": event.component,
-            "parameters": proposed_action["params"],
-            "justification": f"Causal: {causal_exp.explanation_text}",
-            "confidence": 0.85,
-            "risk_score": risk_score,
-            "status": "oss_advisory_only"
         }
         response_data = {
             "healing_intent": healing_intent,
             "causal_explanation": {
                 "factual_outcome": causal_exp.factual_outcome,
@@ -88,42 +213,49 @@ async def evaluate_incident(
                 "effect": causal_exp.effect,
                 "explanation_text": causal_exp.explanation_text,
                 "is_model_based": causal_exp.is_model_based,
-                "warnings": causal_exp.warnings
             },
             "utility_decision": {
                 "best_action": optimal_action.value,
                 "expected_utility": 0.5,
-                "explanation": "Heuristic decision based on latency/error thresholds"
-            }
         }
         # Asynchronous usage logging
         if tracker:
             record = UsageRecord(
                 api_key=api_key,
                 tier=tier,
                 timestamp=time.time(),
                 endpoint="/v1/incidents/evaluate",
-                request_body=event.dict(),
                 response=response_data,
                 processing_ms=(time.time() - start_time) * 1000,
             )
             await tracker.increment_usage_async(record, background_tasks)
         return response_data
     except HTTPException:
         raise
-    except Exception as e:
-        error_msg = str(e)
-        # Log failure in background
         if tracker:
             record = UsageRecord(
                 api_key=api_key,
                 tier=tier,
                 timestamp=time.time(),
                 endpoint="/v1/incidents/evaluate",
-                request_body=event.dict(),
                 error=error_msg,
                 processing_ms=(time.time() - start_time) * 1000,
             )

+"""
+Incident evaluation endpoints — backward‑compatible Bayesian reroute.
+This module provides two incident‑related routes:
+* ``POST /api/v1/report_incident``
+    Stores a ``ReliabilityEvent`` in an in‑memory history for auditing
+    and debugging.
+* ``POST /api/v1/v1/incidents/evaluate``    **(deprecated)**
+    Former heuristic endpoint now **rerouted to the full Bayesian risk
+    engine**.  All callers should migrate to
+    ``POST /api/v1/intents/evaluate``, which returns richer metadata
+    including CUDL uncertainty decomposition and decision traces.
+The local model duplicates (``ReliabilityEvent``, ``HealingAction``)
+have been removed; all types are imported from the canonical ARF core
+framework (``agentic_reliability_framework.core.models.event``).
+"""
+from __future__ import annotations
+import logging
+import time
+from typing import Optional
+from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException, Request
+from agentic_reliability_framework.core.models.event import (
+    HealingAction,
+    ReliabilityEvent,
+)
+from app.causal_explainer import CausalExplainer
+from app.core.usage_tracker import UsageRecord, enforce_quota, tracker
+logger = logging.getLogger(__name__)
 router = APIRouter()
+# ---------------------------------------------------------------------------
+# In‑memory incident store  (for auditing / debugging only)
+# ---------------------------------------------------------------------------
+incident_history: list[dict] = []
+# ---------------------------------------------------------------------------
+# POST /api/v1/report_incident
+# ---------------------------------------------------------------------------
 @router.post("/report_incident")
+async def report_incident(event: ReliabilityEvent) -> dict[str, str]:
+    """
+    Record a ``ReliabilityEvent`` in the in‑memory incident history.
+    This endpoint is used by internal monitoring tools to feed incident
+    data into the causal explainer and downstream analysis.  The event
+    is stored as a JSON‑safe dictionary and is **not** persisted across
+    API restarts.
+    Parameters
+    ----------
+    event : ReliabilityEvent
+        The reliability event to record.  Must include at minimum
+        ``component``, ``latency_p99``, ``error_rate``, and
+        ``service_mesh``.
+    Returns
+    -------
+    dict
+        A simple acknowledgement ``{"status": "recorded"}``.
+    """
+    incident_history.append(event.model_dump(mode="json"))
     return {"status": "recorded"}
+# ---------------------------------------------------------------------------
+# POST /api/v1/v1/incidents/evaluate  (deprecated)
+# ---------------------------------------------------------------------------
 @router.post("/v1/incidents/evaluate")
 async def evaluate_incident(
     request: Request,
     event: ReliabilityEvent,
     background_tasks: BackgroundTasks,
+    quota: dict = Depends(enforce_quota),
+) -> dict:
+    """
+    Evaluate an incident using the **Bayesian risk engine**.
+    .. deprecated:: 0.6.0
+        Use ``POST /api/v1/intents/evaluate`` instead.  This endpoint
+        will be removed in a future release.  Responses include a
+        ``deprecation_notice`` field to assist migration.
+    The following steps are performed:
+    1. Convert the ``ReliabilityEvent`` into a minimal
+       ``DeployConfigurationIntent`` via ``intent_adapter``.
+    2. Call ``risk_service.evaluate_intent()`` to obtain a Bayesian
+       risk score.
+    3. Generate a heuristic healing action based on the risk score.
+    4. Run the causal explainer for counter‑factual text.
+    5. Build a backward‑compatible response envelope.
+    Parameters
+    ----------
+    request : Request
+        The Starlette request object (used for internal state access).
+    event : ReliabilityEvent
+        The incident event containing component name, latency, error
+        rate, etc.
+    background_tasks : BackgroundTasks
+        FastAPI background‑task runner for asynchronous logging.
+    quota : dict
+        Injected by ``enforce_quota``; contains ``api_key``, ``tier``,
+        and ``remaining``.
+    Returns
+    -------
+    dict
+        A dictionary with keys:
+        * ``deprecation_notice`` (str) — migration guidance.
+        * ``healing_intent`` (dict) — action, component, risk score,
+          justification, confidence, and advisory status.
+        * ``causal_explanation`` (dict) — factual/counter‑factual
+          outcomes and explanation text.
+        * ``utility_decision`` (dict) — selected action and expected
+          utility.
+    """
     start_time = time.time()
+    api_key: str = quota["api_key"]
     tier = quota["tier"]
+    response_data: Optional[dict] = None
+    error_msg: Optional[str] = None
     try:
+        # ------------------------------------------------------------------
+        # Step 1 – Convert the event into an infrastructure intent
+        # ------------------------------------------------------------------
+        from app.services.intent_adapter import to_oss_intent
+        from app.services.risk_service import evaluate_intent
+        raw_intent = {
+            "intent_type": "deploy_config",
+            "environment": "prod",
+            "service_name": event.component,
+            "requester": "auto",
+            "change_scope": "global",
+            "deployment_target": "prod",
+            "configuration": {},
+            "provenance": {"source": "incident_evaluate"},
+        }
+        oss_intent = to_oss_intent(raw_intent)
+        # ------------------------------------------------------------------
+        # Step 2 – Bayesian risk evaluation
+        # ------------------------------------------------------------------
+        risk_engine = request.app.state.risk_engine
+        result = evaluate_intent(
+            engine=risk_engine,
+            intent=oss_intent,
+            cost_estimate=None,
+            policy_violations=[],
+        )
+        # ------------------------------------------------------------------
+        # Step 3 – Heuristic action selection based on risk threshold
+        # ------------------------------------------------------------------
+        optimal_action = (
+            HealingAction.RESTART_CONTAINER
+            if result["risk_score"] > 0.5
+            else HealingAction.NO_ACTION
+        )
+        # ------------------------------------------------------------------
+        # Step 4 – Causal explainer
+        # ------------------------------------------------------------------
+        causal_explainer = CausalExplainer()
         current_state = {
             "latency": event.latency_p99,
             "error_rate": event.error_rate,
+            "last_action": {"action_type": "no_action"},
         }
         proposed_action = {"action_type": optimal_action.value, "params": {}}
+        causal_exp = causal_explainer.explain_healing_intent(
+            proposed_action, current_state, "latency"
+        )
+        # ------------------------------------------------------------------
+        # Step 5 – Build response envelope
+        # ------------------------------------------------------------------
         healing_intent = {
             "action": optimal_action.value,
             "component": event.component,
+            "parameters": {},
+            "justification": (
+                f"Bayesian risk score: {result['risk_score']:.3f}. "
+                f"Causal: {causal_exp.explanation_text}"
+            ),
+            "confidence": 1.0 - result.get("uncertainty", 0.0),
+            "risk_score": result["risk_score"],
+            "status": "oss_advisory_only",
         }
         response_data = {
+            "deprecation_notice": (
+                "This endpoint is deprecated. Use POST /api/v1/intents/evaluate "
+                "for the full Bayesian evaluation with CUDL decomposition."
+            ),
             "healing_intent": healing_intent,
             "causal_explanation": {
                 "factual_outcome": causal_exp.factual_outcome,
                 "effect": causal_exp.effect,
                 "explanation_text": causal_exp.explanation_text,
                 "is_model_based": causal_exp.is_model_based,
+                "warnings": causal_exp.warnings,
             },
             "utility_decision": {
                 "best_action": optimal_action.value,
                 "expected_utility": 0.5,
+                "explanation": (
+                    "Decision based on Bayesian risk threshold > 0.5"
+                ),
+            },
         }
+        # ------------------------------------------------------------------
         # Asynchronous usage logging
+        # ------------------------------------------------------------------
         if tracker:
             record = UsageRecord(
                 api_key=api_key,
                 tier=tier,
                 timestamp=time.time(),
                 endpoint="/v1/incidents/evaluate",
+                request_body=event.model_dump(mode="json"),
                 response=response_data,
                 processing_ms=(time.time() - start_time) * 1000,
             )
             await tracker.increment_usage_async(record, background_tasks)
+        logger.warning(
+            "Deprecated endpoint /v1/incidents/evaluate called by key %s",
+            api_key[:8],
+        )
         return response_data
     except HTTPException:
         raise
+    except Exception as exc:
+        error_msg = str(exc)
         if tracker:
             record = UsageRecord(
                 api_key=api_key,
                 tier=tier,
                 timestamp=time.time(),
                 endpoint="/v1/incidents/evaluate",
+                request_body=event.model_dump(mode="json"),
                 error=error_msg,
                 processing_ms=(time.time() - start_time) * 1000,
             )

app/api/routes_memory.py CHANGED Viewed

@@ -11,7 +11,11 @@ async def memory_stats(request: Request):
     risk_engine = request.app.state.risk_engine
     # Check if memory exists and has the required method
-    if hasattr(risk_engine, 'memory') and hasattr(risk_engine.memory, 'get_graph_stats'):
         stats = risk_engine.memory.get_graph_stats()
         return stats
     else:

     risk_engine = request.app.state.risk_engine
     # Check if memory exists and has the required method
+    if hasattr(
+            risk_engine,
+            'memory') and hasattr(
+            risk_engine.memory,
+            'get_graph_stats'):
         stats = risk_engine.memory.get_graph_stats()
         return stats
     else:

app/api/routes_payments.py CHANGED Viewed

@@ -4,11 +4,9 @@ Payment endpoints – Stripe Checkout integration.
 import os
 import stripe
-from fastapi import APIRouter, HTTPException, Request
 from pydantic import BaseModel
-from typing import Optional
-from app.core.config import settings
 from app.core.usage_tracker import tracker, Tier
 router = APIRouter(prefix="/payments", tags=["payments"])
@@ -17,8 +15,10 @@ router = APIRouter(prefix="/payments", tags=["payments"])
 stripe.api_key = os.getenv("STRIPE_SECRET_KEY")
 STRIPE_WEBHOOK_SECRET = os.getenv("STRIPE_WEBHOOK_SECRET")
 class CheckoutRequest(BaseModel):
     api_key: str
     success_url: str
     cancel_url: str
@@ -32,14 +32,16 @@ async def create_checkout_session(req: CheckoutRequest):
     # Verify the API key exists and is free tier
     tier = tracker.get_tier(req.api_key) if tracker else None
     if tier != Tier.FREE:
-        raise HTTPException(status_code=400, detail="Only free tier keys can be upgraded")
     try:
         checkout_session = stripe.checkout.Session.create(
             payment_method_types=["card"],
             line_items=[
                 {
-                    "price": os.getenv("STRIPE_PRO_PRICE_ID"),  # e.g., "price_123"
                     "quantity": 1,
                 }
             ],

 import os
 import stripe
+from fastapi import APIRouter, HTTPException
 from pydantic import BaseModel
 from app.core.usage_tracker import tracker, Tier
 router = APIRouter(prefix="/payments", tags=["payments"])
 stripe.api_key = os.getenv("STRIPE_SECRET_KEY")
 STRIPE_WEBHOOK_SECRET = os.getenv("STRIPE_WEBHOOK_SECRET")
 class CheckoutRequest(BaseModel):
     api_key: str
     success_url: str
     cancel_url: str
     # Verify the API key exists and is free tier
     tier = tracker.get_tier(req.api_key) if tracker else None
     if tier != Tier.FREE:
+        raise HTTPException(status_code=400,
+                            detail="Only free tier keys can be upgraded")
     try:
         checkout_session = stripe.checkout.Session.create(
             payment_method_types=["card"],
             line_items=[
                 {
+                    # e.g., "price_123"
+                    "price": os.getenv("STRIPE_PRO_PRICE_ID"),
                     "quantity": 1,
                 }
             ],

app/api/routes_pricing.py ADDED Viewed

	@@ -0,0 +1,104 @@

+"""
+Pricing endpoints – integrates the ARF Bayesian pricing calculator.
+"""
+from fastapi import APIRouter, HTTPException, Depends
+from pydantic import BaseModel
+import logging
+from arf_pricing_calculator.core.pricing_engine import PricingEngine
+from arf_pricing_calculator.ingestion.questionnaire_parser import parse_input_dict
+from arf_pricing_calculator.types import PricingOutput
+from app.core.usage_tracker import enforce_quota
+logger = logging.getLogger(__name__)
+router = APIRouter()
+class PricingEstimateRequest(BaseModel):
+    """Request body for single pricing estimate."""
+    input: dict
+    customer_id: str = "default"
+    force: bool = False
+class PricingRunRequest(BaseModel):
+    """Request body for multi‑run pricing with learning."""
+    input: dict
+    customer_id: str = "default"
+    runs: int = 1
+    cooldown_hours: int = 24
+    force: bool = False
+@router.post("/pricing/estimate", response_model=PricingOutput)
+async def estimate_pricing(
+    req: PricingEstimateRequest,
+    quota: dict = Depends(enforce_quota),  # optional: enforce usage tracking
+):
+    """
+    Single pricing estimate – no learning, no buffer update.
+    """
+    try:
+        # Convert the input dict to a PricingInput object
+        pricing_input = parse_input_dict(req.input)
+        # Create engine without buffer (no learning)
+        engine = PricingEngine(calibration_buffer=[])
+        output = engine.estimate(pricing_input)
+        return output
+    except Exception as e:
+        logger.exception("Pricing estimate failed")
+        raise HTTPException(status_code=400, detail=str(e))
+@router.post("/pricing/run", response_model=list[PricingOutput])
+async def run_pricing(
+    req: PricingRunRequest,
+    quota: dict = Depends(enforce_quota),
+):
+    """
+    Multi‑run pricing with cooldown and buffer persistence.
+    Each run’s simulated outcome is added to the buffer, so subsequent runs
+    see an updated posterior.
+    """
+    # We need to reuse the same buffer across runs; we'll load it per request.
+    # For simplicity, we'll load from the default location.
+    from arf_pricing_calculator.storage.buffer import load_buffer, add_event
+    from arf_pricing_calculator.orchestration.cooldown import enforce_cooldown, is_cooldown_active
+    outputs = []
+    buffer = load_buffer()   # loads from calibration_buffer.json
+    for i in range(req.runs):
+        if not req.force and is_cooldown_active(
+                req.customer_id, req.cooldown_hours):
+            raise HTTPException(status_code=429,
+                                detail=f"Cooldown active after {i} runs")
+        pricing_input = parse_input_dict(req.input)
+        engine = PricingEngine(calibration_buffer=buffer)
+        out = engine.estimate(pricing_input)
+        # Simulate an outcome (in real use, this would come from the actual
+        # deal)
+        import random
+        outcome = "success" if random.random() > out.risk_score else "failure"  # nosec B311
+        event = {
+            "run_id": out.run_history_id,
+            "customer_id": req.customer_id,
+            "outcome": outcome,
+            "price": out.recommended_price,
+            "value": out.expected_value,
+            "risk_score": out.risk_score,
+            "run_number": i + 1,
+        }
+        add_event(event)
+        buffer = load_buffer()   # reload after update
+        outputs.append(out)
+        if i < req.runs - 1:
+            enforce_cooldown(req.customer_id, req.cooldown_hours)
+    return outputs

app/api/routes_risk.py CHANGED Viewed

@@ -9,32 +9,29 @@ router = APIRouter()
 async def get_risk():
     try:
         risk = get_system_risk()
-        if risk < 0.3:
-            status = "low"
-        elif risk < 0.6:
-            status = "moderate"
-        elif risk < 0.8:
-            status = "high"
-        else:
-            status = "critical"
-        return RiskResponse(system_risk=risk, status=status)
     except Exception as e:
         raise HTTPException(status_code=500, detail=str(e))
 @router.get("/history")
 async def get_risk_history():
-    """
-    Return dummy historical risk data for the last 24 hours.
-    Replace with real database query later.
-    """
     import random
     import datetime
     now = datetime.datetime.now()
-    data = []
-    for i in range(24, 0, -1):
-        data.append({
-            "time": (now - datetime.timedelta(hours=i)).isoformat(),
-            "risk": round(random.uniform(0.2, 0.8), 2)
-        })
     return data

 async def get_risk():
     try:
         risk = get_system_risk()
+    except NotImplementedError:
+        raise HTTPException(
+            status_code=501,
+            detail="This endpoint is deprecated and not implemented")
     except Exception as e:
         raise HTTPException(status_code=500, detail=str(e))
+    if risk < 0.3:
+        status = "low"
+    elif risk < 0.6:
+        status = "moderate"
+    elif risk < 0.8:
+        status = "high"
+    else:
+        status = "critical"
+    return RiskResponse(system_risk=risk, status=status)
 @router.get("/history")
 async def get_risk_history():
     import random
     import datetime
     now = datetime.datetime.now()
+    data = [{"time": (now - datetime.timedelta(hours=i)).isoformat(),
+             "risk": round(random.uniform(0.2, 0.8), 2)} for i in range(24, 0, -1)]
     return data

app/api/routes_users.py CHANGED Viewed

@@ -3,7 +3,6 @@ User endpoints – registration and quota information.
 """
 import uuid
-import os
 from fastapi import APIRouter, Depends, HTTPException, Request
 from slowapi import Limiter
 from slowapi.util import get_remote_address
@@ -23,7 +22,9 @@ async def register_user(request: Request):
     Rate‑limited to 5 requests per hour per IP address.
     """
     if tracker is None:
-        raise HTTPException(status_code=503, detail="Usage tracking not available")
     # Generate a new API key
     new_key = f"sk_free_{uuid.uuid4().hex[:24]}"
@@ -36,12 +37,13 @@ async def register_user(request: Request):
     return {
         "api_key": new_key,
         "tier": "free",
-        "message": "API key created. Store it securely – you won't see it again."
-    }
 @router.get("/quota")
-async def get_user_quota(request: Request, quota: dict = Depends(enforce_quota)):
     """
     Return the current user's tier and remaining evaluation quota.
     Requires API key in Authorization header.
@@ -55,17 +57,3 @@ async def get_user_quota(request: Request, quota: dict = Depends(enforce_quota))
         "remaining": remaining,
         "limit": limit,
     }
-# ===== DEBUG ENDPOINT – Remove in production =====
-@router.get("/tracker-status")
-async def tracker_status():
-    """
-    Debug endpoint to check if the usage tracker is initialised.
-    Returns the tracker object and environment variables.
-    """
-    return {
-        "tracker": str(tracker),
-        "env_tracking": os.getenv("ARF_USAGE_TRACKING"),
-        "env_db_path": os.getenv("ARF_USAGE_DB_PATH")
-    }

 """
 import uuid
 from fastapi import APIRouter, Depends, HTTPException, Request
 from slowapi import Limiter
 from slowapi.util import get_remote_address
     Rate‑limited to 5 requests per hour per IP address.
     """
     if tracker is None:
+        raise HTTPException(
+            status_code=503,
+            detail="Usage tracking not available")
     # Generate a new API key
     new_key = f"sk_free_{uuid.uuid4().hex[:24]}"
     return {
         "api_key": new_key,
         "tier": "free",
+        "message": "API key created. Store it securely – you won't see it again."}
 @router.get("/quota")
+async def get_user_quota(
+        request: Request,
+        quota: dict = Depends(enforce_quota)):
     """
     Return the current user's tier and remaining evaluation quota.
     Requires API key in Authorization header.
         "remaining": remaining,
         "limit": limit,
     }

app/api/webhooks.py CHANGED Viewed

@@ -33,7 +33,8 @@ async def stripe_webhook(request: Request):
     # Handle subscription events
     if event["type"] == "checkout.session.completed":
         session = event["data"]["object"]
-        api_key = session.get("client_reference_id") or session.get("metadata", {}).get("api_key")
         if api_key:
             update_key_tier(api_key, Tier.PRO)
     elif event["type"] == "customer.subscription.deleted":

     # Handle subscription events
     if event["type"] == "checkout.session.completed":
         session = event["data"]["object"]
+        api_key = session.get("client_reference_id") or session.get(
+            "metadata", {}).get("api_key")
         if api_key:
             update_key_tier(api_key, Tier.PRO)
     elif event["type"] == "customer.subscription.deleted":

app/core/config.py CHANGED Viewed

@@ -15,6 +15,9 @@ class Settings(BaseSettings):
     ARF_REDIS_URL: Optional[str] = None
     ARF_API_KEYS: str = "{}"  # JSON string of {key: tier}
     class Config:
         env_file = ".env"
         extra = "ignore"

     ARF_REDIS_URL: Optional[str] = None
     ARF_API_KEYS: str = "{}"  # JSON string of {key: tier}
+    # Tracing (OpenTelemetry)
+    OTEL_EXPORTER_OTLP_ENDPOINT: Optional[str] = None
     class Config:
         env_file = ".env"
         extra = "ignore"

app/core/usage_tracker.py CHANGED Viewed

@@ -1,19 +1,18 @@
 """
 Usage Tracker for ARF API – quotas, tiers, and audit logging.
-Non‑invasive, configurable, thread‑safe, and background‑task ready.
 """
-import os
 import json
 import sqlite3
 import threading
 import time
 from contextlib import contextmanager
 from datetime import datetime, timedelta
-from typing import Dict, Any, Optional, List
-from enum import Enum
 from dataclasses import dataclass
-from fastapi import BackgroundTasks
 # Optional Redis support
 try:
@@ -66,10 +65,11 @@ class UsageRecord:
 class UsageTracker:
     """
-    Thread‑safe usage tracker with SQLite storage and optional Redis for counters.
     """
-    def __init__(self, db_path: str = "arf_usage.db", redis_url: Optional[str] = None):
         self.db_path = db_path
         self._local = threading.local()
         self._init_db()
@@ -78,14 +78,17 @@ class UsageTracker:
         if redis_url and REDIS_AVAILABLE:
             self._redis_client = redis.from_url(redis_url)
         elif redis_url:
-            raise ImportError("Redis client not installed. Run: pip install redis")
     @contextmanager
     def _get_conn(self):
-        """Get a thread‑local SQLite connection."""
         if not hasattr(self._local, "conn"):
-            self._local.conn = sqlite3.connect(self.db_path, check_same_thread=False)
             self._local.conn.row_factory = sqlite3.Row
         yield self._local.conn
     def _init_db(self):
@@ -109,7 +112,8 @@ class UsageTracker:
                     request_body TEXT,
                     response TEXT,
                     error TEXT,
-                    processing_ms REAL
                 )
             """)
             conn.execute("""
@@ -124,6 +128,12 @@ class UsageTracker:
                     PRIMARY KEY (api_key, year_month)
                 )
             """)
             conn.commit()
     def _get_month_key(self) -> str:
@@ -132,7 +142,8 @@ class UsageTracker:
     def get_or_create_api_key(self, key: str, tier: Tier = Tier.FREE) -> bool:
         """Register a new API key. Returns True if key exists or was created."""
         with self._get_conn() as conn:
-            row = conn.execute("SELECT key FROM api_keys WHERE key = ?", (key,)).fetchone()
             if row:
                 return True
             conn.execute(
@@ -156,45 +167,56 @@ class UsageTracker:
     def update_api_key_tier(self, api_key: str, new_tier: Tier) -> bool:
         """Update the tier of an existing API key. Returns True if successful."""
         with self._get_conn() as conn:
-            row = conn.execute("SELECT key FROM api_keys WHERE key = ?", (api_key,)).fetchone()
             if not row:
                 return False
-            conn.execute("UPDATE api_keys SET tier = ? WHERE key = ?", (new_tier.value, api_key))
             conn.commit()
             return True
-    def get_remaining_quota(self, api_key: str, tier: Tier) -> Optional[int]:
-        """Return remaining evaluations for the month, or None if unlimited."""
         limit = tier.monthly_evaluation_limit
         if limit is None:
-            return None
-        month = self._get_month_key()
-        if self._redis_client:
-            redis_key = f"arf:quota:{api_key}:{month}"
-            count = int(self._redis_client.get(redis_key) or 0)
-            return max(0, limit - count)
         with self._get_conn() as conn:
-            row = conn.execute(
-                "SELECT count FROM monthly_counts WHERE api_key = ? AND year_month = ?",
-                (api_key, month)
-            ).fetchone()
-            count = row["count"] if row else 0
-            return max(0, limit - count)
-    def _increment_quota(self, api_key: str, tier: Tier) -> None:
-        """Increment the monthly counter (internal, synchronous)."""
-        limit = tier.monthly_evaluation_limit
-        if limit is None:
-            return
-        month = self._get_month_key()
-        if self._redis_client:
-            redis_key = f"arf:quota:{api_key}:{month}"
-            self._redis_client.incr(redis_key)
-            self._redis_client.expire(redis_key, timedelta(days=31))
-        else:
-            with self._get_conn() as conn:
                 conn.execute(
                     """INSERT INTO monthly_counts (api_key, year_month, count)
                        VALUES (?, ?, 1)
@@ -202,58 +224,190 @@ class UsageTracker:
                     (api_key, month)
                 )
                 conn.commit()
-    def _insert_audit_log(self, record: UsageRecord) -> None:
-        """Insert a single audit log (internal, synchronous)."""
         with self._get_conn() as conn:
             conn.execute(
-                """INSERT INTO usage_log
-                   (api_key, tier, timestamp, endpoint, request_body, response, error, processing_ms)
-                   VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
-                (
-                    record.api_key,
-                    record.tier.value,
-                    record.timestamp,
-                    record.endpoint,
-                    json.dumps(record.request_body) if record.request_body else None,
-                    json.dumps(record.response) if record.response else None,
-                    record.error,
-                    record.processing_ms,
-                )
             )
             conn.commit()
-    def increment_usage_sync(self, record: UsageRecord) -> bool:
         """
         Synchronously record usage and increment counter.
-        Returns True if within quota (i.e., counter was incremented), False if quota exceeded.
         """
-        tier = record.tier
-        limit = tier.monthly_evaluation_limit
-        if limit is not None:
-            remaining = self.get_remaining_quota(record.api_key, tier)
-            if remaining <= 0:
-                return False
-        self._increment_quota(record.api_key, tier)
-        self._insert_audit_log(record)
-        return True
-    async def increment_usage_async(self, record: UsageRecord, background_tasks: BackgroundTasks) -> bool:
         """
         Asynchronously record usage using FastAPI BackgroundTasks.
-        Returns True if quota allows (i.e., will be recorded), False if quota exceeded.
         """
-        tier = record.tier
         limit = tier.monthly_evaluation_limit
-        if limit is not None:
-            remaining = self.get_remaining_quota(record.api_key, tier)
-            if remaining <= 0:
-                return False
-        # Schedule the actual write in the background
-        background_tasks.add_task(self._increment_quota, record.api_key, tier)
-        background_tasks.add_task(self._insert_audit_log, record)
-        return True
     def get_audit_logs(
         self,
         api_key: str,
@@ -278,8 +432,9 @@ class UsageTracker:
             return [dict(row) for row in rows]
     def clean_old_logs(self):
-        """Delete logs older than retention period for each tier."""
         with self._get_conn() as conn:
             for tier in Tier:
                 retention_days = tier.audit_log_retention_days
                 if retention_days is None:
@@ -289,14 +444,23 @@ class UsageTracker:
                     "DELETE FROM usage_log WHERE tier = ? AND timestamp < ?",
                     (tier.value, cutoff)
                 )
             conn.commit()
-# Global instance
 tracker: Optional[UsageTracker] = None
-def init_tracker(db_path: str = "arf_usage.db", redis_url: Optional[str] = None):
     global tracker
     tracker = UsageTracker(db_path, redis_url)
@@ -308,19 +472,16 @@ def update_key_tier(api_key: str, new_tier: Tier) -> bool:
     return tracker.update_api_key_tier(api_key, new_tier)
-# FastAPI dependency to enforce quota
-from fastapi import HTTPException, Request
 async def enforce_quota(request: Request, api_key: str = None):
     """
     Dependency that checks API key and remaining quota.
-    Use in your endpoint: `quota = Depends(enforce_quota)`
-    If usage tracking is disabled, returns a default dict (no enforcement).
     """
-    # If tracker not initialised, allow all requests (fallback)
     if tracker is None:
-        return {"api_key": api_key or "disabled", "tier": Tier.FREE, "remaining": None}
     # Extract API key from header or query
     if api_key is None:
@@ -335,13 +496,16 @@ async def enforce_quota(request: Request, api_key: str = None):
     tier = tracker.get_tier(api_key)
     if tier is None:
-        raise HTTPException(status_code=403, detail="Invalid or inactive API key")
     remaining = tracker.get_remaining_quota(api_key, tier)
     if remaining is not None and remaining <= 0:
-        raise HTTPException(status_code=429, detail="Monthly evaluation quota exceeded")
-    # Store in request state for later logging
     request.state.api_key = api_key
     request.state.tier = tier
     return {"api_key": api_key, "tier": tier, "remaining": remaining}

 """
 Usage Tracker for ARF API – quotas, tiers, and audit logging.
+Thread‑safe, atomic quota consumption, idempotent, fail‑closed.
 """
 import json
 import sqlite3
 import threading
 import time
 from contextlib import contextmanager
 from datetime import datetime, timedelta
 from dataclasses import dataclass
+from typing import Dict, Any, Optional, List, Tuple
+from enum import Enum
+from fastapi import BackgroundTasks, HTTPException, Request
 # Optional Redis support
 try:
 class UsageTracker:
     """
+    Thread‑safe usage tracker with atomic quota consumption and idempotency.
     """
+    def __init__(self, db_path: str = "arf_usage.db",
+                 redis_url: Optional[str] = None):
         self.db_path = db_path
         self._local = threading.local()
         self._init_db()
         if redis_url and REDIS_AVAILABLE:
             self._redis_client = redis.from_url(redis_url)
         elif redis_url:
+            raise ImportError(
+                "Redis client not installed. Run: pip install redis")
     @contextmanager
     def _get_conn(self):
+        """Get a thread‑local SQLite connection with write‑ahead logging and immediate transactions."""
         if not hasattr(self._local, "conn"):
+            self._local.conn = sqlite3.connect(
+                self.db_path, check_same_thread=False, isolation_level=None)
             self._local.conn.row_factory = sqlite3.Row
+            self._local.conn.execute("PRAGMA journal_mode=WAL")
         yield self._local.conn
     def _init_db(self):
                     request_body TEXT,
                     response TEXT,
                     error TEXT,
+                    processing_ms REAL,
+                    idempotency_key TEXT UNIQUE
                 )
             """)
             conn.execute("""
                     PRIMARY KEY (api_key, year_month)
                 )
             """)
+            conn.execute("""
+                CREATE TABLE IF NOT EXISTS idempotency_keys (
+                    key TEXT PRIMARY KEY,
+                    consumed_at REAL NOT NULL
+                )
+            """)
             conn.commit()
     def _get_month_key(self) -> str:
     def get_or_create_api_key(self, key: str, tier: Tier = Tier.FREE) -> bool:
         """Register a new API key. Returns True if key exists or was created."""
         with self._get_conn() as conn:
+            row = conn.execute(
+                "SELECT key FROM api_keys WHERE key = ?", (key,)).fetchone()
             if row:
                 return True
             conn.execute(
     def update_api_key_tier(self, api_key: str, new_tier: Tier) -> bool:
         """Update the tier of an existing API key. Returns True if successful."""
         with self._get_conn() as conn:
+            row = conn.execute(
+                "SELECT key FROM api_keys WHERE key = ?", (api_key,)).fetchone()
             if not row:
                 return False
+            conn.execute(
+                "UPDATE api_keys SET tier = ? WHERE key = ?",
+                (new_tier.value,
+                 api_key))
             conn.commit()
             return True
+    # --------------------------------------------------------------------------
+    # Atomic quota consumption
+    # --------------------------------------------------------------------------
+    def _consume_quota_atomic_sqlite(
+            self,
+            api_key: str,
+            tier: Tier,
+            month: str) -> bool:  # noqa: E501
+        """
+        Atomically increment counter only if under limit.
+        Returns True if quota was consumed, False if limit reached.
+        """
         limit = tier.monthly_evaluation_limit
         if limit is None:
+            # Unlimited – still increment for tracking but always succeed
+            with self._get_conn() as conn:
+                conn.execute(
+                    """INSERT INTO monthly_counts (api_key, year_month, count)
+                       VALUES (?, ?, 1)
+                       ON CONFLICT(api_key, year_month) DO UPDATE SET count = count + 1""",
+                    (api_key, month)
+                )
+                conn.commit()
+            return True
+        # Use BEGIN IMMEDIATE to lock the database for the transaction
         with self._get_conn() as conn:
+            conn.execute("BEGIN IMMEDIATE")
+            try:
+                # Get current count (or 0)
+                row = conn.execute(
+                    "SELECT count FROM monthly_counts WHERE api_key = ? AND year_month = ?",
+                    (api_key, month)
+                ).fetchone()
+                current = row["count"] if row else 0
+                if current >= limit:
+                    conn.rollback()
+                    return False
+                # Increment
                 conn.execute(
                     """INSERT INTO monthly_counts (api_key, year_month, count)
                        VALUES (?, ?, 1)
                     (api_key, month)
                 )
                 conn.commit()
+                return True
+            except Exception:
+                conn.rollback()
+                raise
+    def _consume_quota_atomic_redis(
+            self,
+            api_key: str,
+            tier: Tier,
+            month: str) -> bool:
+        """Atomic Lua script for Redis: INCR only if below limit."""
+        limit = tier.monthly_evaluation_limit
+        if limit is None:
+            # Unlimited – just increment and return True
+            redis_key = f"arf:quota:{api_key}:{month}"
+            self._redis_client.incr(redis_key)
+            self._redis_client.expire(redis_key, timedelta(days=31))
+            return True
+        lua_script = """
+        local key = KEYS[1]
+        local limit = tonumber(ARGV[1])
+        local current = redis.call('GET', key)
+        if current and tonumber(current) >= limit then
+            return 0
+        end
+        local new = redis.call('INCR', key)
+        redis.call('EXPIRE', key, 2678400)  -- 31 days
+        return 1
+        """
+        redis_key = f"arf:quota:{api_key}:{month}"
+        result = self._redis_client.eval(lua_script, 1, redis_key, limit)
+        return result == 1
+    # --------------------------------------------------------------------------
+    # Idempotency handling
+    # --------------------------------------------------------------------------
+    def _is_idempotent_key_used(self, key: str) -> bool:
+        """Check if idempotency key already processed."""
+        with self._get_conn() as conn:
+            row = conn.execute(
+                "SELECT 1 FROM idempotency_keys WHERE key = ?", (key,)).fetchone()
+            return row is not None
+    def _mark_idempotent_key_used(self, key: str, ttl_seconds: int = 86400):
+        """Store idempotency key with expiration (cleanup later)."""
         with self._get_conn() as conn:
             conn.execute(
+                "INSERT INTO idempotency_keys (key, consumed_at) VALUES (?, ?)",
+                (key, time.time())
             )
             conn.commit()
+        # Optionally schedule cleanup of old keys (can be done in a background
+        # thread)
+    # --------------------------------------------------------------------------
+    # Core usage recording (atomic + idempotent)
+    # --------------------------------------------------------------------------
+    def consume_quota_and_log(
+        self,
+        record: UsageRecord,
+        idempotency_key: Optional[str] = None,
+    ) -> Tuple[bool, Optional[Dict[str, Any]]]:
+        """
+        Atomically consume quota and insert audit log.
+        Returns (success, existing_response) where existing_response is not None
+        only when idempotency_key matched a previous successful call.
+        """
+        # Idempotency check (if key provided)
+        if idempotency_key:
+            if self._is_idempotent_key_used(idempotency_key):
+                # Retrieve previous response from audit log (simplified – you may cache full response)
+                # For full idempotency, we would store the response body in idempotency table.
+                # Here we return a marker that caller should use cached
+                # response.
+                return False, {"idempotent": True,
+                               "message": "Already processed"}
+        month = self._get_month_key()
+        # Atomic quota consumption
+        if self._redis_client:
+            quota_ok = self._consume_quota_atomic_redis(
+                record.api_key, record.tier, month)
+        else:
+            quota_ok = self._consume_quota_atomic_sqlite(
+                record.api_key, record.tier, month)
+        if not quota_ok:
+            return False, None
+        # Insert audit log (with idempotency key as unique constraint)
+        try:
+            with self._get_conn() as conn:
+                conn.execute(
+                    """INSERT INTO usage_log
+                       (api_key, tier, timestamp, endpoint,
+                        request_body, response, error, processing_ms,
+                        idempotency_key)
+                       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
+                    (record.api_key,
+                     record.tier.value,
+                     record.timestamp,
+                     record.endpoint,
+                     json.dumps(
+                         record.request_body) if record.request_body else None,
+                        json.dumps(
+                         record.response) if record.response else None,
+                        record.error,
+                        record.processing_ms,
+                        idempotency_key,
+                     ))
+                conn.commit()
+        except sqlite3.IntegrityError as e:
+            # Duplicate idempotency_key – already inserted by another
+            # concurrent request
+            if "UNIQUE constraint failed: usage_log.idempotency_key" in str(e):
+                return False, {"idempotent": True,
+                               "message": "Already processed"}
+            raise
+        if idempotency_key:
+            self._mark_idempotent_key_used(idempotency_key)
+        # Removed stray # noqa: E501 comment that was wrongly indented here
+        return True, None
+    # --------------------------------------------------------------------------
+    # Legacy interface (kept for compatibility but deprecated)
+    # --------------------------------------------------------------------------
+    def increment_usage_sync(
+            self,
+            record: UsageRecord,
+            idempotency_key: Optional[str] = None) -> bool:
         """
         Synchronously record usage and increment counter.
+        Returns True if within quota and recorded, False otherwise.
+        This method now uses the atomic implementation.
         """
+        success, _ = self.consume_quota_and_log(record, idempotency_key)
+        return success
+    async def increment_usage_async(
+        self,
+        record: UsageRecord,
+        background_tasks: BackgroundTasks,
+        idempotency_key: Optional[str] = None
+    ) -> bool:
         """
         Asynchronously record usage using FastAPI BackgroundTasks.
+        Still does the atomic check synchronously, then schedules the insert.
         """
+        # First, do atomic quota check (synchronous) – we must ensure we don't double-consume.
+        # Because background tasks may run later, we still need to reserve quota now.
+        # Simplified: we call consume_quota_and_log synchronously – that defeats async benefit.
+        # Better to use a queue or Redis with background processing.
+        # For this fix, we'll use the sync method (blocking) but still support
+        # idempotency.
+        return self.increment_usage_sync(record, idempotency_key)
+    # --------------------------------------------------------------------------
+    # Quota inspection (non‑atomic, for display only)
+    # --------------------------------------------------------------------------
+    def get_remaining_quota(self, api_key: str, tier: Tier) -> Optional[int]:
+        """Return remaining evaluations for the month (non‑atomic, for info only)."""
         limit = tier.monthly_evaluation_limit
+        if limit is None:
+            return None
+        month = self._get_month_key()
+        if self._redis_client:
+            redis_key = f"arf:quota:{api_key}:{month}"
+            count = int(self._redis_client.get(redis_key) or 0)
+            return max(0, limit - count)
+        with self._get_conn() as conn:
+            row = conn.execute(
+                "SELECT count FROM monthly_counts WHERE api_key = ? AND year_month = ?",
+                (api_key, month)
+            ).fetchone()
+            count = row["count"] if row else 0
+            return max(0, limit - count)
+    # --------------------------------------------------------------------------
+    # Audit and maintenance
+    # --------------------------------------------------------------------------
     def get_audit_logs(
         self,
         api_key: str,
             return [dict(row) for row in rows]
     def clean_old_logs(self):
+        """Delete logs older than retention period for each tier, and old idempotency keys."""
         with self._get_conn() as conn:
+            # Delete old usage logs
             for tier in Tier:
                 retention_days = tier.audit_log_retention_days
                 if retention_days is None:
                     "DELETE FROM usage_log WHERE tier = ? AND timestamp < ?",
                     (tier.value, cutoff)
                 )
+            # Delete idempotency keys older than 7 days
+            cutoff = time.time() - 7 * 86400
+            conn.execute(
+                "DELETE FROM idempotency_keys WHERE consumed_at < ?", (cutoff,))
             conn.commit()
+# --------------------------------------------------------------------------
+# Global instance and FastAPI dependency (fail‑closed)
+# --------------------------------------------------------------------------
 tracker: Optional[UsageTracker] = None
+def init_tracker(
+        db_path: str = "arf_usage.db",
+        redis_url: Optional[str] = None):
+    """Initialize the global tracker. Must be called before enforce_quota."""
     global tracker
     tracker = UsageTracker(db_path, redis_url)
     return tracker.update_api_key_tier(api_key, new_tier)
 async def enforce_quota(request: Request, api_key: str = None):
     """
     Dependency that checks API key and remaining quota.
+    FAILS CLOSED: if tracker not initialised, raises HTTP 503.
     """
+    # P0 fix: No fallback that allows all requests
     if tracker is None:
+        raise HTTPException(
+            status_code=503,
+            detail="Usage tracking service not initialised. Please contact administrator.")
     # Extract API key from header or query
     if api_key is None:
     tier = tracker.get_tier(api_key)
     if tier is None:
+        raise HTTPException(
+            status_code=403,
+            detail="Invalid or inactive API key")
     remaining = tracker.get_remaining_quota(api_key, tier)
     if remaining is not None and remaining <= 0:
+        raise HTTPException(status_code=429,
+                            detail="Monthly evaluation quota exceeded")
+    # Store in request state for later logging (optional)
     request.state.api_key = api_key
     request.state.tier = tier
     return {"api_key": api_key, "tier": tier, "remaining": remaining}

app/database/models_intents.py CHANGED Viewed

@@ -1,4 +1,4 @@
-from sqlalchemy import Column, Integer, String, DateTime, Boolean, Text, JSON, ForeignKey, UniqueConstraint
 from sqlalchemy.orm import relationship
 import datetime
 from .base import Base
@@ -7,27 +7,69 @@ from .base import Base
 class IntentDB(Base):
     __tablename__ = "intents"
     id = Column(Integer, primary_key=True, index=True)
-    deterministic_id = Column(String(64), unique=True, index=True, nullable=False)
     intent_type = Column(String(64), nullable=False)
     payload = Column(JSON, nullable=False)
     oss_payload = Column(JSON, nullable=True)
     environment = Column(String(32), nullable=True)
-    created_at = Column(DateTime, default=datetime.datetime.utcnow, nullable=False)
     evaluated_at = Column(DateTime, nullable=True)
     risk_score = Column(String(32), nullable=True)
-    outcomes = relationship("OutcomeDB", back_populates="intent", cascade="all, delete-orphan")
 class OutcomeDB(Base):
     __tablename__ = "intent_outcomes"
     id = Column(Integer, primary_key=True, index=True)
-    intent_id = Column(Integer, ForeignKey("intents.id", ondelete="CASCADE"), nullable=False)
     success = Column(Boolean, nullable=False)
     recorded_by = Column(String(128), nullable=True)
     notes = Column(Text, nullable=True)
-    recorded_at = Column(DateTime, default=datetime.datetime.utcnow, nullable=False)
     intent = relationship("IntentDB", back_populates="outcomes")
     __table_args__ = (
         UniqueConstraint("intent_id", name="uq_outcome_intentid"),
     )

+from sqlalchemy import Column, Integer, String, DateTime, Boolean, Text, JSON, Float, ForeignKey, UniqueConstraint
 from sqlalchemy.orm import relationship
 import datetime
 from .base import Base
 class IntentDB(Base):
     __tablename__ = "intents"
     id = Column(Integer, primary_key=True, index=True)
+    deterministic_id = Column(
+        String(64),
+        unique=True,
+        index=True,
+        nullable=False)
     intent_type = Column(String(64), nullable=False)
     payload = Column(JSON, nullable=False)
     oss_payload = Column(JSON, nullable=True)
     environment = Column(String(32), nullable=True)
+    created_at = Column(
+        DateTime,
+        default=datetime.datetime.utcnow,
+        nullable=False)
     evaluated_at = Column(DateTime, nullable=True)
     risk_score = Column(String(32), nullable=True)
+    outcomes = relationship(
+        "OutcomeDB",
+        back_populates="intent",
+        cascade="all, delete-orphan")
 class OutcomeDB(Base):
     __tablename__ = "intent_outcomes"
     id = Column(Integer, primary_key=True, index=True)
+    intent_id = Column(
+        Integer,
+        ForeignKey(
+            "intents.id",
+            ondelete="CASCADE"),
+        nullable=False)
     success = Column(Boolean, nullable=False)
     recorded_by = Column(String(128), nullable=True)
     notes = Column(Text, nullable=True)
+    recorded_at = Column(
+        DateTime,
+        default=datetime.datetime.utcnow,
+        nullable=False)
+    idempotency_key = Column(String(128), unique=True, nullable=True)
     intent = relationship("IntentDB", back_populates="outcomes")
     __table_args__ = (
         UniqueConstraint("intent_id", name="uq_outcome_intentid"),
     )
+# ---------------------------------------------------------------------------
+# NEW: Persistence for the conjugate Bayesian state
+# ---------------------------------------------------------------------------
+class BetaStateDB(Base):
+    """
+    Stores the per‑category posterior parameters (α, β) of the BetaStore
+    so that online learning survives API restarts.
+    Only one row per ActionCategory is expected; the 'category' column is
+    unique.  Updates are performed via merge / upsert.
+    """
+    __tablename__ = "beta_state"
+    id = Column(Integer, primary_key=True, index=True)
+    category = Column(String(32), unique=True, nullable=False, index=True)
+    alpha = Column(Float, nullable=False)
+    beta = Column(Float, nullable=False)
+    updated_at = Column(
+        DateTime,
+        default=datetime.datetime.utcnow,
+        onupdate=datetime.datetime.utcnow)

app/database/session.py CHANGED Viewed

@@ -1,19 +1,6 @@
 from sqlalchemy import create_engine
-from sqlalchemy.ext.declarative import declarative_base
 from sqlalchemy.orm import sessionmaker
 from app.core.config import settings
-# Use a default SQLite database if no URL is provided
-if settings.database_url:
-    DATABASE_URL = settings.database_url
-else:
-    # Fallback to a local SQLite file (writable in the container)
-    DATABASE_URL = "sqlite:///./app.db"
-# For SQLite, we need to disable the threading check
-connect_args = {"check_same_thread": False} if DATABASE_URL.startswith("sqlite") else {}
-engine = create_engine(DATABASE_URL, connect_args=connect_args)
 SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
-Base = declarative_base()

 from sqlalchemy import create_engine
 from sqlalchemy.orm import sessionmaker
 from app.core.config import settings
+engine = create_engine(settings.database_url)
 SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)

app/main.py CHANGED Viewed

@@ -1,18 +1,42 @@
 """
-ARF API Control Plane - Main Application Entry Point
-With optional heavy dependencies and usage tracking.
 """
 import logging
 import os
 import sys
 import json
 from contextlib import asynccontextmanager
 from typing import Dict
 from fastapi import FastAPI
 from fastapi.middleware.cors import CORSMiddleware
-# Optional prometheus
 try:
     from prometheus_fastapi_instrumentator import Instrumentator
     PROMETHEUS_AVAILABLE = True
@@ -20,7 +44,7 @@ except ImportError:
     PROMETHEUS_AVAILABLE = False
     Instrumentator = None
-# Optional slowapi
 try:
     from slowapi import _rate_limit_exceeded_handler
     from slowapi.errors import RateLimitExceeded
@@ -32,7 +56,7 @@ except ImportError:
     RateLimitExceeded = None
     SlowAPIMiddleware = None
-# Optional agentic_reliability_framework (risk engine, policy engine, etc.)
 try:
     from agentic_reliability_framework.core.governance.risk_engine import RiskEngine
     from agentic_reliability_framework.core.governance.policy_engine import PolicyEngine
@@ -47,7 +71,7 @@ except ImportError:
     RAGGraphMemory = None
     MemoryConstants = None
-# ===== USAGE TRACKER =====
 from app.core.usage_tracker import init_tracker, tracker, Tier
 from app.api import (
@@ -61,6 +85,7 @@ from app.api import (
     routes_payments,
     webhooks,
     routes_users,
 )
 from app.api.deps import limiter
 from app.core.config import settings
@@ -75,18 +100,35 @@ logging.basicConfig(
 @asynccontextmanager
 async def lifespan(app: FastAPI):
     logger.info("🚀 Starting ARF API Control Plane")
     logger.debug(f"Python path: {sys.path}")
     if ARF_AVAILABLE:
         hmc_model_path = os.getenv("ARF_HMC_MODEL", "models/hmc_model.json")
         use_hyperpriors = os.getenv(
-            "ARF_USE_HYPERPRIORS",
-            "false").lower() == "true"
         logger.info(
             "Initializing RiskEngine – HMC model: %s, hyperpriors: %s",
             hmc_model_path,
-            use_hyperpriors)
         try:
             app.state.risk_engine = RiskEngine(
                 hmc_model_path=hmc_model_path,
@@ -99,6 +141,55 @@ async def lifespan(app: FastAPI):
             logger.exception("💥 Fatal error initializing RiskEngine")
             raise RuntimeError("RiskEngine initialization failed") from e
         try:
             app.state.policy_engine = PolicyEngine()
             logger.info("✅ PolicyEngine initialized successfully.")
@@ -120,12 +211,14 @@ async def lifespan(app: FastAPI):
                 from sentence_transformers import SentenceTransformer
                 logger.info(f"Loading epistemic model: {epistemic_model_name}")
                 app.state.epistemic_model = SentenceTransformer(
-                    epistemic_model_name)
                 app.state.epistemic_tokenizer = app.state.epistemic_model.tokenizer
                 logger.info("✅ Epistemic model loaded.")
             except ImportError:
                 logger.warning(
-                    "sentence-transformers not installed; epistemic signals will be zeros.")
                 app.state.epistemic_model = None
                 app.state.epistemic_tokenizer = None
             except Exception as e:
@@ -134,45 +227,94 @@ async def lifespan(app: FastAPI):
                 app.state.epistemic_tokenizer = None
         else:
             logger.info(
-                "EPISTEMIC_MODEL not set; epistemic signals will be zeros.")
             app.state.epistemic_model = None
             app.state.epistemic_tokenizer = None
     else:
         logger.warning(
-            "agentic_reliability_framework not installed; risk engine, policy engine, RAG disabled.")
-    # ===== USAGE TRACKER INITIALISATION =====
-    if os.getenv("ARF_USAGE_TRACKING", "false").lower() == "true":
         logger.info("Initialising usage tracker...")
-        # HARDCODED WRITABLE PATH – fixes 503 error
-        init_tracker(
-            db_path="/tmp/arf_usage.db",   # was os.getenv("ARF_USAGE_DB_PATH", "arf_usage.db")
-            redis_url=os.getenv("ARF_REDIS_URL")
-        )
-        # Seed initial API keys from environment variable (for testing / demo)
-        api_keys_json = os.getenv("ARF_API_KEYS", "{}")
         try:
-            api_keys = json.loads(api_keys_json)
-            for key, tier_str in api_keys.items():
-                try:
-                    tier = Tier(tier_str.lower())
-                    tracker.get_or_create_api_key(key, tier)
-                    logger.info(f"Seeded API key for tier {tier.value}")
-                except ValueError:
-                    logger.warning(f"Invalid tier '{tier_str}' for key {key}, skipping")
-        except json.JSONDecodeError:
-            logger.warning("ARF_API_KEYS environment variable is not valid JSON; skipping seeding.")
-        app.state.usage_tracker = tracker
-        logger.info("✅ Usage tracker ready.")
     else:
-        logger.info("Usage tracking disabled (ARF_USAGE_TRACKING not set to true).")
         app.state.usage_tracker = None
     yield
     logger.info("🛑 Shutting down ARF API")
 def create_app() -> FastAPI:
     app = FastAPI(
         title=settings.app_name,
         version="0.5.0",
@@ -182,6 +324,7 @@ def create_app() -> FastAPI:
         description="Agentic Reliability Framework (ARF) API",
     )
     allowed_origins = ["https://arf-frontend-sandy.vercel.app"]
     app.add_middleware(
         CORSMiddleware,
@@ -192,67 +335,64 @@ def create_app() -> FastAPI:
     )
     logger.debug("CORS middleware configured")
     if SLOWAPI_AVAILABLE:
         app.state.limiter = limiter
         app.add_exception_handler(
-            RateLimitExceeded,
-            _rate_limit_exceeded_handler)
         app.add_middleware(SlowAPIMiddleware)
         logger.debug("Rate limiter middleware configured")
     else:
         logger.debug("Rate limiter disabled (slowapi not installed)")
     if PROMETHEUS_AVAILABLE:
         Instrumentator().instrument(app).expose(app)
         logger.debug("Prometheus instrumentator configured")
     else:
-        logger.debug(
-            "Prometheus instrumentator disabled (module not installed)")
-    # Include routers
     app.include_router(
-        routes_incidents.router,
-        prefix="/api/v1",
-        tags=["incidents"])
     app.include_router(routes_risk.router, prefix="/api/v1", tags=["risk"])
     app.include_router(
-        routes_intents.router,
-        prefix="/api/v1",
-        tags=["intents"])
     app.include_router(
-        routes_history.router,
-        prefix="/api/v1",
-        tags=["history"])
     app.include_router(
-        routes_governance.router,
-        prefix="/api/v1",
-        tags=["governance"])
     app.include_router(
-        routes_memory.router,
-        prefix="/v1/memory",
-        tags=["memory"])
     app.include_router(
-        routes_admin.router,
-        prefix="/api/v1",
-        tags=["admin"])
     app.include_router(
-        routes_payments.router,
-        prefix="/api/v1",
-        tags=["payments"])
     app.include_router(
-        webhooks.router,
-        tags=["webhooks"])
     app.include_router(
-        routes_users.router,
-        prefix="/api/v1",
-        tags=["users"])
     logger.debug("All API routers included")
     @app.get("/health", tags=["health"])
     async def health() -> Dict[str, str]:
         return {"status": "ok"}
     return app
-app = create_app()

 """
+ARF API Control Plane — Main Application Entry Point
+====================================================
+The control plane serves as the HTTP layer between the **Agentic Reliability
+Framework (ARF)** core engine and external consumers (front‑end dashboard,
+enterprise clients, and monitoring infrastructure).
+It is responsible for:
+* **Lifetime management** of the Bayesian risk engine, policy engine,
+  semantic memory (RAG graph), and epistemic models.
+* **Observability** via optional OpenTelemetry tracing and Prometheus metrics
+  (the latter exposed automatically by ``prometheus-fastapi-instrumentator``
+  on ``/metrics``).
+* **Rate limiting** and **usage tracking** with atomic quota consumption.
+* **CORS** configuration for the public ARF front‑end.
+* **Database‑backed persistence** of the conjugate Bayesian posteriors so
+  that online learning survives restarts.
+* **Automated Rust enforcer canary promotion** via Wilson confidence interval
+  monitoring of the agreement counters.
+All heavy components are loaded **lazily and best‑effort** – if a dependency
+is missing the API continues to serve health‑check and status endpoints,
+degrading gracefully rather than crashing.
 """
 import logging
 import os
 import sys
 import json
+import threading
+import time as _time
 from contextlib import asynccontextmanager
 from typing import Dict
 from fastapi import FastAPI
 from fastapi.middleware.cors import CORSMiddleware
+# ── Optional: Prometheus metrics ─────────────────────────────
 try:
     from prometheus_fastapi_instrumentator import Instrumentator
     PROMETHEUS_AVAILABLE = True
     PROMETHEUS_AVAILABLE = False
     Instrumentator = None
+# ── Optional: rate‑limiting (slowapi) ────────────────────────
 try:
     from slowapi import _rate_limit_exceeded_handler
     from slowapi.errors import RateLimitExceeded
     RateLimitExceeded = None
     SlowAPIMiddleware = None
+# ── Core ARF engine (optional but essential for governance) ──
 try:
     from agentic_reliability_framework.core.governance.risk_engine import RiskEngine
     from agentic_reliability_framework.core.governance.policy_engine import PolicyEngine
     RAGGraphMemory = None
     MemoryConstants = None
+# ── Usage tracker ────────────────────────────────────────────
 from app.core.usage_tracker import init_tracker, tracker, Tier
 from app.api import (
     routes_payments,
     webhooks,
     routes_users,
+    routes_pricing,
 )
 from app.api.deps import limiter
 from app.core.config import settings
 @asynccontextmanager
 async def lifespan(app: FastAPI):
+    """
+    Application lifespan manager.
+    All initialisation that requires a running event loop (database
+    connections, model loading, etc.) happens **before** the ``yield``.
+    Cleanup (if any) happens after the ``yield``.
+    Initialisation order:
+        1. Risk engine (Bayesian scoring + HMC).
+        2. Load persisted conjugate posterior state (``beta_state`` table).
+        3. OpenTelemetry tracing (console exporter by default).
+        4. Policy engine, RAG memory, and epistemic model.
+        5. Usage tracker (SQLite / Redis).
+        6. Wilson confidence monitor for Rust enforcer canary promotion.
+    """
     logger.info("🚀 Starting ARF API Control Plane")
     logger.debug(f"Python path: {sys.path}")
+    # ── 1. Risk engine ────────────────────────────────────────
     if ARF_AVAILABLE:
         hmc_model_path = os.getenv("ARF_HMC_MODEL", "models/hmc_model.json")
         use_hyperpriors = os.getenv(
+            "ARF_USE_HYPERPRIORS", "false"
+        ).lower() == "true"
         logger.info(
             "Initializing RiskEngine – HMC model: %s, hyperpriors: %s",
             hmc_model_path,
+            use_hyperpriors,
+        )
         try:
             app.state.risk_engine = RiskEngine(
                 hmc_model_path=hmc_model_path,
             logger.exception("💥 Fatal error initializing RiskEngine")
             raise RuntimeError("RiskEngine initialization failed") from e
+        # ── 2. Persisted Bayesian state ────────────────────���──
+        try:
+            from app.database.session import SessionLocal
+            from app.database.models_intents import BetaStateDB
+            from agentic_reliability_framework.core.governance.risk_engine import ActionCategory
+            db = SessionLocal()
+            try:
+                rows = db.query(BetaStateDB).all()
+                if rows:
+                    state = {
+                        ActionCategory(row.category): (row.alpha, row.beta)
+                        for row in rows
+                    }
+                    app.state.risk_engine.beta_store.load_state(state)
+                    logger.info(
+                        "Loaded Bayesian posterior state from database (%d categories).",
+                        len(state),
+                    )
+                else:
+                    logger.info(
+                        "No persisted Bayesian state found; using default priors."
+                    )
+            finally:
+                db.close()
+        except Exception as e:
+            logger.warning(
+                "Could not load Bayesian state from database: %s", e
+            )
+        # ── 3. Tracing (OpenTelemetry) ─────────────────────────
+        try:
+            from opentelemetry import trace
+            from opentelemetry.sdk.resources import SERVICE_NAME, Resource
+            from opentelemetry.sdk.trace import TracerProvider
+            from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
+            from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
+            resource = Resource.create({SERVICE_NAME: "arf-api"})
+            provider = TracerProvider(resource=resource)
+            provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
+            trace.set_tracer_provider(provider)
+            FastAPIInstrumentor.instrument_app(app)
+            logger.info("✅ Tracing initialized (console exporter).")
+        except Exception as e:
+            logger.warning("Tracing initialization skipped: %s", e)
+        # ── 4. Policy engine, RAG, epistemic model ─────────────
         try:
             app.state.policy_engine = PolicyEngine()
             logger.info("✅ PolicyEngine initialized successfully.")
                 from sentence_transformers import SentenceTransformer
                 logger.info(f"Loading epistemic model: {epistemic_model_name}")
                 app.state.epistemic_model = SentenceTransformer(
+                    epistemic_model_name
+                )
                 app.state.epistemic_tokenizer = app.state.epistemic_model.tokenizer
                 logger.info("✅ Epistemic model loaded.")
             except ImportError:
                 logger.warning(
+                    "sentence-transformers not installed; epistemic signals will be zeros."
+                )
                 app.state.epistemic_model = None
                 app.state.epistemic_tokenizer = None
             except Exception as e:
                 app.state.epistemic_tokenizer = None
         else:
             logger.info(
+                "EPISTEMIC_MODEL not set; epistemic signals will be zeros."
+            )
             app.state.epistemic_model = None
             app.state.epistemic_tokenizer = None
     else:
         logger.warning(
+            "agentic_reliability_framework not installed; risk engine, policy engine, RAG disabled."
+        )
+    # ── 5. Usage tracker ──────────────────────────────────────
+    usage_tracking_disabled = (
+        os.getenv("ARF_USAGE_TRACKING", "true").lower() == "false"
+    )
+    if not usage_tracking_disabled:
         logger.info("Initialising usage tracker...")
         try:
+            init_tracker(
+                db_path=os.getenv("ARF_USAGE_DB_PATH", "arf_usage.db"),
+                redis_url=os.getenv("ARF_REDIS_URL"),
+            )
+            # Seed initial API keys from environment variable (for testing / demo)
+            api_keys_json = os.getenv("ARF_API_KEYS", "{}")
+            try:
+                api_keys = json.loads(api_keys_json)
+                for key, tier_str in api_keys.items():
+                    try:
+                        tier = Tier(tier_str.lower())
+                        tracker.get_or_create_api_key(key, tier)
+                        logger.info(f"Seeded API key for tier {tier.value}")
+                    except ValueError:
+                        logger.warning(
+                            f"Invalid tier '{tier_str}' for key {key}, skipping"
+                        )
+            except json.JSONDecodeError:
+                logger.warning(
+                    "ARF_API_KEYS environment variable is not valid JSON; skipping seeding."
+                )
+            app.state.usage_tracker = tracker
+            logger.info("✅ Usage tracker ready.")
+        except Exception as e:
+            logger.critical(f"Failed to initialise usage tracker: {e}")
+            raise RuntimeError("Usage tracker initialisation failed") from e
     else:
+        logger.info("Usage tracking disabled by ARF_USAGE_TRACKING=false.")
         app.state.usage_tracker = None
+    # ── 6. Wilson confidence monitor ──────────────────────────
+    try:
+        from app.services.wilson_monitor import update as wilson_update
+        from prometheus_client import REGISTRY
+        def _wilson_updater():
+            while True:
+                try:
+                    agreed = REGISTRY.get_sample_value(
+                        'arf_rust_agreement_total', {'result': 'agreed'}
+                    ) or 0.0
+                    diverged = REGISTRY.get_sample_value(
+                        'arf_rust_agreement_total', {'result': 'diverged'}
+                    ) or 0.0
+                    wilson_update(int(agreed), int(diverged))
+                except Exception as e:
+                    logger.debug("Wilson updater error: %s", e)
+                _time.sleep(300)  # every 5 minutes
+        threading.Thread(target=_wilson_updater, daemon=True).start()
+        logger.info("✅ Wilson monitor background updater started.")
+    except Exception as e:
+        logger.warning("Wilson monitor initialization skipped: %s", e)
     yield
     logger.info("🛑 Shutting down ARF API")
 def create_app() -> FastAPI:
+    """
+    Build and configure the FastAPI application.
+    Middleware order:
+        1. CORS (restricted to the public front‑end origin).
+        2. Rate limiting (if slowapi is installed).
+        3. Prometheus metrics exposition (if available).
+    All API routers are included under the ``/api/v1`` prefix except
+    memory (``/v1/memory``) and webhooks (root level).
+    A simple ``/health`` endpoint is provided for liveness probes.
+    """
     app = FastAPI(
         title=settings.app_name,
         version="0.5.0",
         description="Agentic Reliability Framework (ARF) API",
     )
+    # ── CORS ──────────────────────────────────────────────────
     allowed_origins = ["https://arf-frontend-sandy.vercel.app"]
     app.add_middleware(
         CORSMiddleware,
     )
     logger.debug("CORS middleware configured")
+    # ── Rate limiter ──────────────────────────────────────────
     if SLOWAPI_AVAILABLE:
         app.state.limiter = limiter
         app.add_exception_handler(
+            RateLimitExceeded, _rate_limit_exceeded_handler
+        )
         app.add_middleware(SlowAPIMiddleware)
         logger.debug("Rate limiter middleware configured")
     else:
         logger.debug("Rate limiter disabled (slowapi not installed)")
+    # ── Prometheus ────────────────────────────────────────────
     if PROMETHEUS_AVAILABLE:
         Instrumentator().instrument(app).expose(app)
         logger.debug("Prometheus instrumentator configured")
     else:
+        logger.debug("Prometheus instrumentator disabled (module not installed)")
+    # ── API Routers ───────────────────────────────────────────
     app.include_router(
+        routes_incidents.router, prefix="/api/v1", tags=["incidents"]
+    )
     app.include_router(routes_risk.router, prefix="/api/v1", tags=["risk"])
     app.include_router(
+        routes_intents.router, prefix="/api/v1", tags=["intents"]
+    )
     app.include_router(
+        routes_history.router, prefix="/api/v1", tags=["history"]
+    )
     app.include_router(
+        routes_governance.router, prefix="/api/v1", tags=["governance"]
+    )
     app.include_router(
+        routes_memory.router, prefix="/v1/memory", tags=["memory"]
+    )
     app.include_router(
+        routes_admin.router, prefix="/api/v1", tags=["admin"]
+    )
     app.include_router(
+        routes_payments.router, prefix="/api/v1", tags=["payments"]
+    )
     app.include_router(
+        webhooks.router, tags=["webhooks"]
+    )
     app.include_router(
+        routes_users.router, prefix="/api/v1", tags=["users"]
+    )
+    app.include_router(
+        routes_pricing.router, prefix="/api/v1", tags=["pricing"]
+    )
     logger.debug("All API routers included")
     @app.get("/health", tags=["health"])
     async def health() -> Dict[str, str]:
+        """Liveness probe – returns 200 when the application is running."""
         return {"status": "ok"}
     return app
+app = create_app()

app/models/__init__.py CHANGED Viewed

@@ -26,4 +26,4 @@ __all__ = [
     "PermissionLevel",
     "Environment",
     "ChangeScope",
-]

     "PermissionLevel",
     "Environment",
     "ChangeScope",
+]

app/models/incident_models.py CHANGED Viewed

@@ -4,10 +4,11 @@ from pydantic import BaseModel, Field
 class IncidentReport(BaseModel):
     service: str = Field(..., description="Service name")
-    signal_type: Literal["latency", "error_rate", "cpu", "memory"] = Field(..., description="Type of signal")
     value: float = Field(..., description="Measured value")
 class IncidentResponse(BaseModel):
     service: str
-    reliability: float

 class IncidentReport(BaseModel):
     service: str = Field(..., description="Service name")
+    signal_type: Literal["latency", "error_rate", "cpu",
+                         "memory"] = Field(..., description="Type of signal")
     value: float = Field(..., description="Measured value")
 class IncidentResponse(BaseModel):
     service: str
+    reliability: float

app/models/infrastructure_intents.py CHANGED Viewed

@@ -1,45 +1,12 @@
 from pydantic import BaseModel, Field, field_validator
 from typing import Optional, Literal, List, Any, Dict
-from enum import Enum
-# ---------------------------------------------------------------------------
-# Fallback enums – used when the proprietary core engine is not installed.
-# These mirror the canonical definitions from the public specification.
-# ---------------------------------------------------------------------------
-class ResourceType(str, Enum):
-    DATABASE = "database"
-    STORAGE_ACCOUNT = "storage_account"
-    VM = "vm"
-    VIRTUAL_NETWORK = "virtual_network"
-    # enterprise-only types omitted for public sandbox
-class PermissionLevel(str, Enum):
-    READ = "read"
-    WRITE = "write"
-    ADMIN = "admin"
-class Environment(str, Enum):
-    DEV = "dev"
-    STAGING = "staging"
-    PROD = "prod"
-class ChangeScope(str, Enum):
-    MINOR = "minor"
-    MAJOR = "major"
-    CRITICAL = "critical"
-# ---------------------------------------------------------------------------
-# Optional import from protected core engine – not available in public Spaces
-try:
-    from agentic_reliability_framework.core.governance.intents import (
-        ResourceType,
-        PermissionLevel,
-        Environment,
-        ChangeScope,
-    )
-except ImportError:
-    # The fallback enums defined above are used.
-    pass
 class BaseIntentRequest(BaseModel):
@@ -91,4 +58,4 @@ class DeployConfigurationRequest(BaseIntentRequest):
         return v
-InfrastructureIntentRequest = ProvisionResourceRequest | GrantAccessRequest | DeployConfigurationRequest

 from pydantic import BaseModel, Field, field_validator
 from typing import Optional, Literal, List, Any, Dict
+from agentic_reliability_framework.core.governance.intents import (
+    ResourceType,
+    PermissionLevel,
+    Environment,
+    ChangeScope,
+)
 class BaseIntentRequest(BaseModel):
         return v
+InfrastructureIntentRequest = ProvisionResourceRequest | GrantAccessRequest | DeployConfigurationRequest

app/models/intent_models.py CHANGED Viewed

@@ -11,4 +11,4 @@ class IntentSimulation(BaseModel):
 class IntentSimulationResponse(BaseModel):
     risk_score: float
-    recommendation: Literal["safe_to_execute", "requires_approval", "blocked"]

 class IntentSimulationResponse(BaseModel):
     risk_score: float
+    recommendation: Literal["safe_to_execute", "requires_approval", "blocked"]

app/models/risk_models.py CHANGED Viewed

@@ -4,4 +4,4 @@ from pydantic import BaseModel
 class RiskResponse(BaseModel):
     system_risk: float
-    status: Literal["low", "moderate", "high", "critical"]

 class RiskResponse(BaseModel):
     system_risk: float
+    status: Literal["low", "moderate", "high", "critical"]

app/services/incident_service.py CHANGED Viewed

@@ -3,5 +3,6 @@ from app.models.incident_models import IncidentReport
 def process_incident(report: IncidentReport) -> float:
-    reliability = signal_to_reliability(report.value, signal_type=report.signal_type)
     return reliability

 def process_incident(report: IncidentReport) -> float:
+    reliability = signal_to_reliability(
+        report.value, signal_type=report.signal_type)
     return reliability

app/services/intent_adapter.py CHANGED Viewed

@@ -1,66 +1,163 @@
-from pydantic import BaseModel
-from typing import Optional, Dict, Any
-# ---------------------------------------------------------------------------
-# Local fallback intent classes – mirrors the proprietary core engine's contracts
-# ---------------------------------------------------------------------------
-class ProvisionResourceIntent(BaseModel):
-    resource_type: str
-    region: str
-    size: str
-    configuration: Dict[str, Any] = {}
-    environment: str
-    requester: str
-    provenance: Dict[str, Any] = {}
-class GrantAccessIntent(BaseModel):
-    principal: str
-    permission_level: str
-    resource_scope: str
-    justification: Optional[str] = None
-    requester: str
-    provenance: Dict[str, Any] = {}
-class DeployConfigurationIntent(BaseModel):
-    service_name: str
-    change_scope: str
-    deployment_target: str
-    risk_level_hint: Optional[float] = None
-    configuration: Dict[str, Any] = {}
-    requester: str
-    provenance: Dict[str, Any] = {}
-# ---------------------------------------------------------------------------
-def to_oss_intent(api_request):
-    if api_request.intent_type == "provision_resource":
-        return ProvisionResourceIntent(
-            resource_type=api_request.resource_type.value if hasattr(api_request.resource_type, 'value') else str(api_request.resource_type),
-            region=api_request.region,
-            size=api_request.size,
-            configuration=api_request.configuration,
-            environment=api_request.environment.value if hasattr(api_request.environment, 'value') else str(api_request.environment),
-            requester=api_request.requester,
-            provenance=api_request.provenance,
-        )
-    elif api_request.intent_type == "grant_access":
-        return GrantAccessIntent(
-            principal=api_request.principal,
-            permission_level=api_request.permission_level.value if hasattr(api_request.permission_level, 'value') else str(api_request.permission_level),
-            resource_scope=api_request.resource_scope,
-            justification=api_request.justification,
-            requester=api_request.requester,
-            provenance=api_request.provenance,
-        )
-    elif api_request.intent_type == "deploy_config":
-        return DeployConfigurationIntent(
-            service_name=api_request.service_name,
-            change_scope=api_request.change_scope.value if hasattr(api_request.change_scope, 'value') else str(api_request.change_scope),
-            deployment_target=api_request.deployment_target.value if hasattr(api_request.deployment_target, 'value') else str(api_request.deployment_target),
-            risk_level_hint=api_request.risk_level_hint,
-            configuration=api_request.configuration,
-            requester=api_request.requester,
-            provenance=api_request.provenance,
-        )
     else:
-        raise ValueError(f"Unknown intent type: {api_request.intent_type}")

+"""
+Intent Adapter – converts API request payloads to ARF InfrastructureIntent objects.
+Strict validation, no dummy fallbacks. All conversions are deterministic.
+"""
+import logging
+from typing import Any, Dict
+from agentic_reliability_framework.core.governance.intents import (
+    ProvisionResourceIntent,
+    GrantAccessIntent,
+    DeployConfigurationIntent,
+    InfrastructureIntent,
+)
+logger = logging.getLogger(__name__)
+class IntentAdapterError(Exception):
+    """Raised when intent conversion fails due to invalid input."""
+    pass
+# Allowed values (from the framework's Literal definitions)
+VALID_ENVIRONMENTS = {"dev", "staging", "prod", "test"}
+VALID_RESOURCE_TYPES = {
+    "vm",
+    "storage_account",
+    "database",
+    "kubernetes_cluster",
+    "function_app",
+    "virtual_network"}
+def to_oss_intent(api_request: Any) -> InfrastructureIntent:
+    """
+    Convert an API request object to the corresponding OSS InfrastructureIntent.
+    """
+    # Extract data
+    if hasattr(api_request, "model_dump"):
+        data = api_request.model_dump()
+    elif hasattr(api_request, "dict"):
+        data = api_request.dict()
     else:
+        data = dict(api_request)
+    intent_type = data.get("intent_type")
+    if not intent_type:
+        raise IntentAdapterError("Missing 'intent_type' in request")
+    environment = data.get("environment")
+    if not environment:
+        raise IntentAdapterError("Missing 'environment' field")
+    if environment not in VALID_ENVIRONMENTS:
+        raise IntentAdapterError(
+            f"Invalid environment: {environment}. Must be one of {VALID_ENVIRONMENTS}")
+    requester = data.get("requester")
+    if not requester:
+        raise IntentAdapterError("Missing 'requester' field")
+    if intent_type == "provision_resource":
+        return _to_provision_intent(data, environment, requester)
+    elif intent_type == "grant_access":
+        return _to_grant_intent(data, requester)   # environment NOT passed
+    elif intent_type == "deploy_config":
+        return _to_deploy_intent(data, requester)  # environment NOT passed
+    else:
+        raise IntentAdapterError(f"Unknown intent_type: {intent_type}")
+def _to_provision_intent(data: Dict[str,
+                                    Any],
+                         environment: str,
+                         requester: str) -> ProvisionResourceIntent:
+    resource_type_str = data.get("resource_type")
+    if not resource_type_str:
+        raise IntentAdapterError(
+            "Missing 'resource_type' for provision_resource intent")
+    if resource_type_str not in VALID_RESOURCE_TYPES:
+        raise IntentAdapterError(f"Invalid resource_type: {resource_type_str}")
+    region = data.get("region")
+    if not region:
+        raise IntentAdapterError(
+            "Missing 'region' for provision_resource intent")
+    size = data.get("size")
+    if not size:
+        raise IntentAdapterError(
+            "Missing 'size' for provision_resource intent")
+    return ProvisionResourceIntent(
+        resource_type=resource_type_str,
+        region=region,
+        size=size,
+        environment=environment,
+        requester=requester,
+        configuration=data.get("configuration", {}),
+        provenance=data.get("provenance", {}),
+    )
+def _to_grant_intent(data: Dict[str, Any],
+                     requester: str) -> GrantAccessIntent:
+    principal = data.get("principal")
+    if not principal:
+        raise IntentAdapterError("Missing 'principal' for grant_access intent")
+    permission_level = data.get("permission_level")
+    if not permission_level:
+        raise IntentAdapterError(
+            "Missing 'permission_level' for grant_access intent")
+    resource_scope = data.get("resource_scope")
+    if not resource_scope:
+        raise IntentAdapterError(
+            "Missing 'resource_scope' for grant_access intent")
+    return GrantAccessIntent(
+        principal=principal,
+        permission_level=permission_level,
+        resource_scope=resource_scope,
+        requester=requester,
+        justification=data.get("justification", ""),
+        provenance=data.get("provenance", {}),
+    )
+def _to_deploy_intent(data: Dict[str, Any],
+                      requester: str) -> DeployConfigurationIntent:
+    service_name = data.get("service_name")
+    if not service_name:
+        raise IntentAdapterError(
+            "Missing 'service_name' for deploy_config intent")
+    change_scope = data.get("change_scope")
+    if not change_scope:
+        raise IntentAdapterError(
+            "Missing 'change_scope' for deploy_config intent")
+    deployment_target = data.get("deployment_target")
+    if not deployment_target:
+        raise IntentAdapterError(
+            "Missing 'deployment_target' for deploy_config intent")
+    # risk_level_hint expects a float; if not a number, set to None
+    risk_hint = data.get("risk_level_hint")
+    if risk_hint is not None:
+        try:
+            risk_hint = float(risk_hint)
+        except (TypeError, ValueError):
+            risk_hint = None
+    return DeployConfigurationIntent(
+        service_name=service_name,
+        change_scope=change_scope,
+        deployment_target=deployment_target,
+        requester=requester,
+        risk_level_hint=risk_hint,
+        configuration=data.get("configuration", {}),
+        provenance=data.get("provenance", {}),
+    )

app/services/intent_service.py CHANGED Viewed

@@ -7,7 +7,8 @@ logger = logging.getLogger(__name__)
 # Note: This endpoint is deprecated. Use /v1/intents/evaluate instead.
 def simulate_intent(intent: IntentSimulation) -> dict:
-    logger.warning("Deprecated endpoint /simulate_intent used. Please migrate to /v1/intents/evaluate.")
     # For backward compatibility, we still use random risk.
     risk_score = random.uniform(0, 1)
     if risk_score < 0.2:

 # Note: This endpoint is deprecated. Use /v1/intents/evaluate instead.
 def simulate_intent(intent: IntentSimulation) -> dict:
+    logger.warning(
+        "Deprecated endpoint /simulate_intent used. Please migrate to /v1/intents/evaluate.")
     # For backward compatibility, we still use random risk.
     risk_score = random.uniform(0, 1)
     if risk_score < 0.2:

app/services/intent_store.py CHANGED Viewed

@@ -13,7 +13,8 @@ def save_evaluated_intent(
     environment: str,
     risk_score: float
 ) -> IntentDB:
-    existing = db.query(IntentDB).filter(IntentDB.deterministic_id == deterministic_id).one_or_none()
     if existing:
         existing.evaluated_at = datetime.datetime.utcnow()
         existing.risk_score = str(risk_score)
@@ -38,5 +39,8 @@ def save_evaluated_intent(
     return intent
-def get_intent_by_deterministic_id(db: Session, deterministic_id: str) -> Optional[IntentDB]:
-    return db.query(IntentDB).filter(IntentDB.deterministic_id == deterministic_id).one_or_none()

     environment: str,
     risk_score: float
 ) -> IntentDB:
+    existing = db.query(IntentDB).filter(
+        IntentDB.deterministic_id == deterministic_id).one_or_none()
     if existing:
         existing.evaluated_at = datetime.datetime.utcnow()
         existing.risk_score = str(risk_score)
     return intent
+def get_intent_by_deterministic_id(
+        db: Session,
+        deterministic_id: str) -> Optional[IntentDB]:
+    return db.query(IntentDB).filter(
+        IntentDB.deterministic_id == deterministic_id).one_or_none()

app/services/outcome_service.py CHANGED Viewed

@@ -1,42 +1,53 @@
 import datetime
 import logging
 from typing import Optional, Dict, Any
 from sqlalchemy.orm import Session
-from app.database.models_intents import IntentDB, OutcomeDB
 # ---------------------------------------------------------------------------
-# Local fallback types – dummy RiskEngine and intent classes
-# ---------------------------------------------------------------------------
-class RiskEngine:
-    def update_outcome(self, intent, success):
-        pass
-class ProvisionResourceIntent:
-    def __init__(self, **kwargs):
-        for k, v in kwargs.items():
-            setattr(self, k, v)
-class GrantAccessIntent:
-    def __init__(self, **kwargs):
-        for k, v in kwargs.items():
-            setattr(self, k, v)
-class DeployConfigurationIntent:
-    def __init__(self, **kwargs):
-        for k, v in kwargs.items():
-            setattr(self, k, v)
 # ---------------------------------------------------------------------------
-logger = logging.getLogger(__name__)
 class OutcomeConflictError(Exception):
     pass
-def reconstruct_oss_intent_from_json(oss_json: Dict[str, Any]):
     intent_type = oss_json.get("intent_type")
     if intent_type == "provision_resource":
         return ProvisionResourceIntent(**oss_json)
@@ -46,22 +57,7 @@ def reconstruct_oss_intent_from_json(oss_json: Dict[str, Any]):
         return DeployConfigurationIntent(**oss_json)
     else:
         raise ValueError(
-            f"Cannot reconstruct intent from JSON: missing or unknown intent_type {intent_type}"
-        )
-def _create_dummy_intent(intent_type: str):
-    if intent_type == "ProvisionResourceIntent":
-        return ProvisionResourceIntent(
-            resource_type="vm",
-            region="eastus",
-            size="Standard_D2s_v3",
-            environment="dev",
-            requester="system"
-        )
-    else:
-        logger.warning("Dummy intent creation not implemented for %s", intent_type)
-        return None
 def record_outcome(
@@ -70,50 +66,114 @@ def record_outcome(
     success: bool,
     recorded_by: Optional[str],
     notes: Optional[str],
-    risk_engine: RiskEngine
 ) -> OutcomeDB:
-    intent = db.query(IntentDB).filter(IntentDB.deterministic_id == deterministic_id).one_or_none()
     if not intent:
         raise ValueError(f"Intent not found: {deterministic_id}")
-    existing_outcome = db.query(OutcomeDB).filter(OutcomeDB.intent_id == intent.id).one_or_none()
     if existing_outcome:
         if existing_outcome.success == success:
             return existing_outcome
-        raise OutcomeConflictError("Outcome already recorded with different result")
     outcome = OutcomeDB(
         intent_id=intent.id,
         success=bool(success),
         recorded_by=recorded_by,
         notes=notes,
-        recorded_at=datetime.datetime.utcnow()
     )
     db.add(outcome)
-    db.commit()
-    db.refresh(outcome)
-    # Reconstruct intent and update risk engine (mock)
     oss_intent = None
     if intent.oss_payload:
         try:
             oss_intent = reconstruct_oss_intent_from_json(intent.oss_payload)
         except Exception as e:
-            logger.warning(
-                "Failed to reconstruct OSS intent for %s: %s. Using dummy fallback.",
-                deterministic_id, e
-            )
-            oss_intent = _create_dummy_intent(intent.intent_type)
     else:
-        oss_intent = _create_dummy_intent(intent.intent_type)
     if oss_intent is not None:
         try:
             risk_engine.update_outcome(oss_intent, success)
         except Exception as e:
             logger.exception(
                 "Failed to update RiskEngine after recording outcome for intent %s: %s",
-                deterministic_id, e
-            )
-    return outcome

+"""Outcome recording with idempotency, no dummy fallbacks, and timezone-aware timestamps."""
 import datetime
 import logging
 from typing import Optional, Dict, Any
 from sqlalchemy.orm import Session
+from sqlalchemy.exc import IntegrityError
+from agentic_reliability_framework.core.governance.risk_engine import RiskEngine
+from agentic_reliability_framework.core.governance.intents import (
+    InfrastructureIntent,
+    ProvisionResourceIntent,
+    GrantAccessIntent,
+    DeployConfigurationIntent,
+)
+from app.database.models_intents import IntentDB, OutcomeDB, BetaStateDB
+logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
+# NEW: small helper to persist the conjugate posterior state
 # ---------------------------------------------------------------------------
+def _persist_beta_state(db: Session, risk_engine: RiskEngine) -> None:
+    """
+    Write the current Beta posterior parameters to the beta_state table.
+    This is called after every outcome update so that online learning
+    survives restarts.
+    """
+    try:
+        state = risk_engine.beta_store.get_state()
+        for cat, (alpha, beta) in state.items():
+            # Upsert: if the category already exists, update it
+            db.merge(BetaStateDB(category=cat.value, alpha=alpha, beta=beta))
+        db.commit()
+        logger.debug("Persisted Beta posterior parameters to database.")
+    except Exception as e:
+        db.rollback()
+        logger.error("Failed to persist beta state: %s", e)
 class OutcomeConflictError(Exception):
+    """Raised when an outcome already exists for the same intent with a different result."""
     pass
+def reconstruct_oss_intent_from_json(
+        oss_json: Dict[str, Any]) -> InfrastructureIntent:
+    """Reconstruct OSS intent from stored JSON. Raises ValueError on failure."""
     intent_type = oss_json.get("intent_type")
     if intent_type == "provision_resource":
         return ProvisionResourceIntent(**oss_json)
         return DeployConfigurationIntent(**oss_json)
     else:
         raise ValueError(
+            f"Cannot reconstruct intent from JSON: missing or unknown intent_type {intent_type}")
 def record_outcome(
     success: bool,
     recorded_by: Optional[str],
     notes: Optional[str],
+    risk_engine: RiskEngine,
+    idempotency_key: Optional[str] = None,
 ) -> OutcomeDB:
+    """
+    Record an outcome for a previously evaluated intent.
+    Idempotent: calling twice with the same (deterministic_id, success) returns the same record.
+    If the outcome already exists with a different success value, raises OutcomeConflictError.
+    No dummy intents are created. If the OSS intent cannot be reconstructed, the risk engine
+    is NOT updated – we log an error and still record the outcome.
+    Args:
+        db: SQLAlchemy session.
+        deterministic_id: Unique identifier of the original intent.
+        success: Whether the action succeeded (True) or failed (False).
+        recorded_by: Optional user or system identifier.
+        notes: Optional human-readable notes.
+        risk_engine: ARF risk engine instance (may be updated).
+        idempotency_key: Optional caller-provided idempotency token.
+    Returns:
+        The recorded OutcomeDB object.
+    Raises:
+        ValueError: If intent not found or reconstruction fails fatally.
+        OutcomeConflictError: If a conflicting outcome already exists.
+    """
+    # 1. Fetch the original intent record
+    intent = db.query(IntentDB).filter(
+        IntentDB.deterministic_id == deterministic_id).one_or_none()
     if not intent:
         raise ValueError(f"Intent not found: {deterministic_id}")
+    # 2. Idempotency / conflict check with database-level uniqueness
+    existing_outcome = db.query(OutcomeDB).filter(
+        OutcomeDB.intent_id == intent.id).one_or_none()
     if existing_outcome:
         if existing_outcome.success == success:
             return existing_outcome
+        db.rollback()
+        raise OutcomeConflictError(
+            f"Outcome already recorded for intent {deterministic_id} with different result "
+            f"(existing={existing_outcome.success}, new={success})"
+        )
+    # 3. Create outcome record
     outcome = OutcomeDB(
         intent_id=intent.id,
         success=bool(success),
         recorded_by=recorded_by,
         notes=notes,
+        recorded_at=datetime.datetime.now(datetime.timezone.utc),
+        idempotency_key=idempotency_key,
     )
     db.add(outcome)
+    # 4. Attempt to commit; handle duplicate key errors for idempotency
+    try:
+        db.commit()
+        db.refresh(outcome)
+    except IntegrityError as e:
+        db.rollback()
+        if "idempotency_key" in str(e) and idempotency_key:
+            existing = db.query(OutcomeDB).filter(
+                OutcomeDB.idempotency_key == idempotency_key).first()
+            if existing:
+                logger.info(
+                    "Idempotent request for key %s, returning existing outcome",
+                    idempotency_key)
+                return existing
+        raise
+    # 5. Update RiskEngine ONLY if we can reconstruct a valid OSS intent
     oss_intent = None
     if intent.oss_payload:
         try:
             oss_intent = reconstruct_oss_intent_from_json(intent.oss_payload)
         except Exception as e:
+            logger.error(
+                "Failed to reconstruct OSS intent for %s: %s. RiskEngine will NOT be updated.",
+                deterministic_id,
+                e,
+                exc_info=True)
     else:
+        logger.warning(
+            "No oss_payload stored for intent %s – cannot update RiskEngine.",
+            deterministic_id
+        )
     if oss_intent is not None:
         try:
             risk_engine.update_outcome(oss_intent, success)
+            # ----------------------------------------------------------------
+            # PERSISTENCE: after updating the conjugate posterior, write it
+            # ----------------------------------------------------------------
+            _persist_beta_state(db, risk_engine)
         except Exception as e:
             logger.exception(
                 "Failed to update RiskEngine after recording outcome for intent %s: %s",
+                deterministic_id,
+                e)
+    else:
+        logger.info(
+            "Skipped RiskEngine update for intent %s (no valid OSS intent)",
+            deterministic_id
+        )
+    return outcome

app/services/risk_service.py CHANGED Viewed

@@ -1,97 +1,376 @@
 from typing import Optional, List, Dict, Any
-from enum import Enum
-# ---------------------------------------------------------------------------
-# Local fallback types – everything needed for the sandbox mock
-# ---------------------------------------------------------------------------
-class HealingAction(str, Enum):
-    NO_ACTION = "NO_ACTION"
-    RESTART_CONTAINER = "RESTART_CONTAINER"
-    SCALE_OUT = "SCALE_OUT"
-    ROLLBACK = "ROLLBACK"
-    CIRCUIT_BREAKER = "CIRCUIT_BREAKER"
-    TRAFFIC_SHIFT = "TRAFFIC_SHIFT"
-    ALERT_TEAM = "ALERT_TEAM"
-class InfrastructureIntent:
-    pass
-class RiskEngine:
-    def calculate_risk(self, intent, cost_estimate, policy_violations):
-        # Return a mock risk score
-        return 0.35, "Mock sandbox risk", {"conjugate_mean": 0.35}
-class PolicyEngine:
-    def __init__(self):
-        self.policies = []
-        self.use_decision_engine = True
-    def evaluate_policies(self, event):
-        return [HealingAction.NO_ACTION]
-class DecisionEngine:
-    def __init__(self, **kwargs):
         pass
-    def select_optimal_action(self, actions, event, **kwargs):
-        return type('obj', (object,), {
-            'best_action': HealingAction.NO_ACTION,
-            'expected_utility': 0.0,
-            'alternatives': [],
-            'explanation': 'Mock decision engine in sandbox',
-            'raw_data': {},
-        })()
-    def compute_risk(self, action, event, component):
-        return 0.0
-class RAGGraphMemory:
-    pass
-class ReliabilityEvent:
-    component: str = "default"
-    latency_p99: float = 0.0
-    error_rate: float = 0.0
-    cpu_util: Optional[float] = None
-    memory_util: Optional[float] = None
-# ---------------------------------------------------------------------------
 def evaluate_intent(
     engine: RiskEngine,
-    intent,
     cost_estimate: Optional[float],
     policy_violations: List[str]
 ) -> dict:
-    """Mock sandbox evaluation – returns a fixed risk score."""
     return {
-        "risk_score": 0.38,
-        "explanation": "Sandbox mock: high latency detected, escalating.",
-        "contributions": {"conjugate_mean": 0.38}
     }
 def evaluate_healing_decision(
-    event,
     policy_engine: PolicyEngine,
     decision_engine: Optional[DecisionEngine] = None,
     rag_graph: Optional[RAGGraphMemory] = None,
     model=None,
     tokenizer=None,
 ) -> Dict[str, Any]:
-    """Mock sandbox healing evaluation – always returns NO_ACTION."""
-    return {
-        "risk_score": 0.0,
-        "selected_action": HealingAction.NO_ACTION.value,
-        "expected_utility": 0.0,
-        "alternatives": [],
-        "explanation": "Sandbox mock: no healing actions evaluated.",
-        "epistemic_signals": {
             "entropy": 0.0,
             "contradiction": 0.0,
             "evidence_lift": 0.0,
             "hallucination_risk": 0.0,
-        },
     }
 def get_system_risk() -> float:
-    import random
-    return round(random.uniform(0, 1), 2)

+"""
+Risk service – integrates ARF risk engine, policy engine, and decision engine.
+Deterministic, no random fallbacks, explicit error handling.
+Version: 2026-05-04 – added Prometheus metrics for observability.
+"""
+import json
+import logging
+import os
+import time
 from typing import Optional, List, Dict, Any
+from agentic_reliability_framework.core.governance.risk_engine import RiskEngine
+from agentic_reliability_framework.core.governance.intents import InfrastructureIntent
+from agentic_reliability_framework.core.models.event import ReliabilityEvent, HealingAction
+from agentic_reliability_framework.core.governance.policy_engine import PolicyEngine
+from agentic_reliability_framework.core.decision.decision_engine import DecisionEngine
+from agentic_reliability_framework.runtime.memory.rag_graph import RAGGraphMemory
+from agentic_reliability_framework.core.research.eclipse_probe import compute_epistemic_risk
+# ── optional tracing ─────────────────────────────────────────
+try:
+    from opentelemetry import trace
+    _tracer = trace.get_tracer(__name__)
+    OTEL_AVAILABLE = True
+except ImportError:
+    OTEL_AVAILABLE = False
+    _tracer = None
+# ── Prometheus metrics (always registered; no‑op if not scraped) ─
+from prometheus_client import Counter, Histogram
+_EVAL_COUNTER = Counter(
+    "arf_evaluations_total",
+    "Total evaluation calls (intent + healing), partitioned by engine and status.",
+    ["engine", "status"],
+)
+_EVAL_DURATION = Histogram(
+    "arf_evaluation_duration_seconds",
+    "End‑to‑end latency of evaluation calls.",
+    ["engine"],
+    buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0),
+)
+_RUST_AGREEMENT = Counter(
+    "arf_rust_agreement_total",
+    "Agreement between Rust enforcer and Python policy evaluation.",
+    ["result"],  # "agreed" or "diverged"
+)
+# ── optional Rust enforcer (shadow mode) ──────────────────────
+_RUST_ENFORCER_AVAILABLE = False
+_rust_evaluator = None            # singleton per process
+_rust_policy_json: Optional[str] = None
+if os.getenv("ARF_USE_RUST_ENFORCER", "false").lower() == "true":
+    try:
+        import arf_enforcer
+        _RUST_ENFORCER_AVAILABLE = True
+    except ImportError:
         pass
+# Default OSS policy tree – mirrors the hard‑coded rules in the Python PolicyEvaluator
+# that check region, resource type, and max permission level.
+_OSS_POLICY_TREE_JSON = json.dumps({
+    "And": [
+        {"Atomic": {"RegionAllowed": {"allowed_regions": ["eastus"]}}},
+        {"Atomic": {"ResourceTypeRestricted": {
+            "forbidden_types": ["DATABASE_DROP", "FULL_ROLLOUT", "SYSTEM_SHUTDOWN", "SECRET_ROTATION"]
+        }}},
+        {"Atomic": {"MaxPermissionLevel": {"max_level": "admin"}}}
+    ]
+})
+def _ensure_rust_evaluator() -> bool:
+    """Lazy initialise the Rust policy evaluator.  Returns True on success."""
+    global _rust_evaluator, _rust_policy_json
+    if _rust_evaluator is not None:
+        return True
+    if not _RUST_ENFORCER_AVAILABLE:
+        return False
+    try:
+        _rust_policy_json = _OSS_POLICY_TREE_JSON
+        _rust_evaluator = arf_enforcer.PyPolicyEvaluator(_rust_policy_json)
+        return True
+    except Exception:
+        _rust_evaluator = None
+        return False
+logger = logging.getLogger(__name__)
 def evaluate_intent(
     engine: RiskEngine,
+    intent: InfrastructureIntent,
     cost_estimate: Optional[float],
     policy_violations: List[str]
 ) -> dict:
+    """
+    Evaluate an infrastructure intent using the Bayesian risk engine.
+    Optionally shadows the policy evaluation with the Rust enforcer when
+    the environment variable ARF_USE_RUST_ENFORCER is set to "true".
+    Any divergence is logged and counted as a Prometheus metric.
+    Parameters
+    ----------
+    engine : RiskEngine
+        Initialised ARF Bayesian risk engine.
+    intent : InfrastructureIntent
+        The infrastructure request to evaluate.
+    cost_estimate : float or None
+        Estimated monthly cost (used by cost‑threshold policies).
+    policy_violations : list[str]
+        Pre‑computed policy violation strings (from the Python evaluator).
+    Returns
+    -------
+    dict
+        Keys: risk_score, explanation, contributions.
+    """
+    t0 = time.monotonic()
+    span = None
+    if OTEL_AVAILABLE and _tracer:
+        span = _tracer.start_span("risk_service.evaluate_intent")
+        span.set_attribute("intent_type", type(intent).__name__)
+    # ── Shadow Rust enforcer (best‑effort, non‑blocking) ──────
+    if _RUST_ENFORCER_AVAILABLE and _ensure_rust_evaluator():
+        try:
+            rust_intent = {
+                "action": getattr(intent, "intent_type", "unknown"),
+                "component": getattr(intent, "service_name", "unknown"),
+                "region": getattr(intent, "region", None),
+                "resource_type": getattr(intent, "resource_type", None),
+                "permission_level": getattr(intent, "permission_level", None),
+                "extra": {}
+            }
+            rust_raw = _rust_evaluator.evaluate(
+                json.dumps(rust_intent), cost_estimate
+            )
+            rust_violations = json.loads(rust_raw)
+            agreed = set(rust_violations) == set(policy_violations)
+            _RUST_AGREEMENT.labels(result="agreed" if agreed else "diverged").inc()
+            if not agreed:
+                msg = (
+                    "Rust enforcer divergence: "
+                    f"Rust={sorted(rust_violations)} Python={sorted(policy_violations)}"
+                )
+                logger.warning(msg)
+                if span:
+                    span.add_event("rust_enforcer_divergence", {
+                        "rust_violations": rust_violations,
+                        "python_violations": policy_violations
+                    })
+        except Exception as exc:
+            logger.debug("Rust enforcer shadow evaluation failed: %s", exc)
+    # ── Core risk evaluation ──────────────────────────────────
+    # ── Automated canary promotion ──────────────────────────
+    if _RUST_ENFORCER_AVAILABLE and os.getenv("ARF_RUST_CANARY", "false").lower() == "true":
+        try:
+            from prometheus_client import REGISTRY
+            lower = REGISTRY.get_sample_value("arf_rust_agreement_lower_bound", {})
+            if lower is not None and lower > 0.9999:
+                policy_violations = rust_violations
+                if span:
+                    span.set_attribute("rust_enforcer_active", True)
+        except Exception:
+            pass
+    try:
+        score, explanation, contributions = engine.calculate_risk(
+            intent=intent,
+            cost_estimate=cost_estimate,
+            policy_violations=policy_violations
+        )
+        engine_label = "python"
+        status = "success"
+    except Exception:
+        _EVAL_COUNTER.labels(engine="python", status="error").inc()
+        _EVAL_DURATION.labels(engine="python").observe(time.monotonic() - t0)
+        raise
+    _EVAL_COUNTER.labels(engine=engine_label, status=status).inc()
+    _EVAL_DURATION.labels(engine=engine_label).observe(time.monotonic() - t0)
+    if span:
+        span.set_attribute("risk_score", score)
+        if _RUST_ENFORCER_AVAILABLE:
+            span.set_attribute("rust_enforcer_available", True)
+        span.end()
     return {
+        "risk_score": score,
+        "explanation": explanation,
+        "contributions": contributions
     }
 def evaluate_healing_decision(
+    event: ReliabilityEvent,
     policy_engine: PolicyEngine,
     decision_engine: Optional[DecisionEngine] = None,
     rag_graph: Optional[RAGGraphMemory] = None,
     model=None,
     tokenizer=None,
 ) -> Dict[str, Any]:
+    """
+    Evaluate healing actions for a given reliability event using decision‑theoretic selection.
+    Includes epistemic risk signals from the eclipse probe.
+    Parameters
+    ----------
+    event : ReliabilityEvent
+        The incident event containing latency, error rate, etc.
+    policy_engine : PolicyEngine
+        The ARF healing policy engine with configured policies.
+    decision_engine : DecisionEngine, optional
+        If omitted, a default instance is created.
+    rag_graph : RAGGraphMemory, optional
+        Semantic memory for similar incident retrieval.
+    model, tokenizer : optional
+        HuggingFace model and tokenizer for epistemic risk computation.
+    Returns
+    -------
+    dict
+        Keys: risk_score, selected_action, expected_utility, alternatives,
+        explanation, epistemic_signals.
+    """
+    t0 = time.monotonic()
+    span = None
+    if OTEL_AVAILABLE and _tracer:
+        span = _tracer.start_span("risk_service.evaluate_healing")
+        span.set_attribute("component", event.component)
+    # If decision_engine not provided, try to get from policy_engine
+    if decision_engine is None and hasattr(policy_engine, 'decision_engine'):
+        decision_engine = policy_engine.decision_engine
+    # If still None, create a minimal one (global stats only)
+    if decision_engine is None:
+        logger.debug("No DecisionEngine provided; creating default instance")
+        decision_engine = DecisionEngine(rag_graph=rag_graph)
+    # Get raw candidate actions (by temporarily disabling decision engine)
+    orig_use = policy_engine.use_decision_engine
+    try:
+        policy_engine.use_decision_engine = False
+        raw_actions = policy_engine.evaluate_policies(event)
+    finally:
+        policy_engine.use_decision_engine = orig_use
+    # If no actions, return NO_ACTION
+    if not raw_actions or raw_actions == [HealingAction.NO_ACTION]:
+        if span:
+            span.set_attribute("selected_action", HealingAction.NO_ACTION.value)
+            span.end()
+        _EVAL_COUNTER.labels(engine="python", status="success").inc()
+        _EVAL_DURATION.labels(engine="python").observe(time.monotonic() - t0)
+        return {
+            "risk_score": 0.0,
+            "selected_action": HealingAction.NO_ACTION.value,
+            "expected_utility": 0.0,
+            "alternatives": [],
+            "explanation": "No candidate actions triggered.",
+            "epistemic_signals": None,
+        }
+    # Build reasoning text from policies that triggered the actions
+    reasoning_parts = []
+    for policy in policy_engine.policies:
+        if any(a in policy.actions for a in raw_actions):
+            conditions_str = ", ".join(
+                f"{c.metric} {c.operator} {c.threshold}" for c in policy.conditions
+            )
+            reasoning_parts.append(
+                f"Policy {policy.name} triggered by {conditions_str} → actions {[a.value for a in policy.actions]}"
+            )
+    reasoning_text = " ".join(reasoning_parts)
+    # Build evidence text from the event
+    evidence_text = (
+        f"Component: {event.component}, "
+        f"latency_p99: {event.latency_p99}, "
+        f"error_rate: {event.error_rate}, "
+        f"cpu_util: {event.cpu_util}, "
+        f"memory_util: {event.memory_util}"
+    )
+    # Compute epistemic signals (if model/tokenizer provided)
+    epistemic_signals = None
+    if model is not None and tokenizer is not None:
+        try:
+            epistemic_signals = compute_epistemic_risk(
+                reasoning_text, evidence_text, model, tokenizer
+            )
+        except Exception as e:
+            logger.error(f"Failed to compute epistemic risk: {e}")
+            epistemic_signals = {
+                "entropy": 0.0,
+                "contradiction": 0.0,
+                "evidence_lift": 0.0,
+                "hallucination_risk": 0.0,
+            }
+    else:
+        logger.debug("Epistemic model/tokenizer not provided; using zero signals")
+        epistemic_signals = {
             "entropy": 0.0,
             "contradiction": 0.0,
             "evidence_lift": 0.0,
             "hallucination_risk": 0.0,
+        }
+    # Run decision engine to get best action and alternatives
+    decision = decision_engine.select_optimal_action(
+        raw_actions, event, component=event.component,
+        epistemic_signals=epistemic_signals
+    )
+    # Extract risk of the selected action
+    risk_score = None
+    for alt in decision.alternatives:
+        if alt.action == decision.best_action:
+            risk_score = alt.risk
+            break
+    if risk_score is None:
+        # Compute risk separately
+        risk_score = decision_engine.compute_risk(
+            decision.best_action, event, event.component)
+    # Format alternatives (top 3 only)
+    alt_list = []
+    for alt in decision.alternatives[:3]:
+        alt_list.append({
+            "action": alt.action.value,
+            "expected_utility": alt.utility,
+            "risk": alt.risk,
+        })
+    # ── Metrics & span finalisation ───────────────────────────
+    _EVAL_COUNTER.labels(engine="python", status="success").inc()
+    _EVAL_DURATION.labels(engine="python").observe(time.monotonic() - t0)
+    if span:
+        span.set_attribute("risk_score", risk_score)
+        span.set_attribute("selected_action", decision.best_action.value)
+        span.set_attribute("expected_utility", decision.expected_utility)
+        span.end()
+    return {
+        "risk_score": risk_score,
+        "selected_action": decision.best_action.value,
+        "expected_utility": decision.expected_utility,
+        "alternatives": alt_list,
+        "explanation": decision.explanation,
+        "raw_decision": decision.raw_data,
+        "epistemic_signals": epistemic_signals,
     }
 def get_system_risk() -> float:
+    """
+    Return an aggregated risk score across all monitored components.
+    This is a placeholder – the endpoint is deprecated.
+    Raises NotImplementedError to avoid random fallback.
+    """
+    raise NotImplementedError(
+        "get_system_risk is deprecated. Use component‑level risk evaluation instead."
+    )

app/services/wilson_monitor.py ADDED Viewed

	@@ -0,0 +1,56 @@

+# Wilson confidence interval monitor for Rust enforcer agreement
+from prometheus_client import Gauge
+import math
+LOWER_BOUND = Gauge(
+    "arf_rust_agreement_lower_bound",
+    "Lower 99.9% Wilson bound on agreement rate",
+)
+def wilson_lower(success, total, z=3.291):
+    """
+    Compute the lower bound of the Wilson confidence interval
+    for a binomial proportion.
+    Parameters
+    ----------
+    success : int
+        Number of agreed evaluations.
+    total : int
+        Total number of shadow evaluations (agreed + diverged).
+    z : float
+        Z‑score for the desired confidence level (default 3.291 for 99.9%).
+    Returns
+    -------
+    float
+        Lower bound of the Wilson interval, clamped to [0, 1].
+    """
+    if total == 0:
+        return 0.0
+    p = success / total
+    n = total
+    denom = 1 + z**2 / n
+    center = (p + z**2 / (2 * n)) / denom
+    margin = z * math.sqrt(p * (1 - p) / n + z**2 / (4 * n**2)) / denom
+    return max(0.0, center - margin)
+def update(agreed, diverged):
+    """
+    Query the Prometheus agreement counters and set the lower‑bound gauge.
+    This function is called periodically by the background thread started
+    in the API lifespan (see `app/main.py`).
+    Parameters
+    ----------
+    agreed : int
+        Current value of `arf_rust_agreement_total{result="agreed"}`.
+    diverged : int
+        Current value of `arf_rust_agreement_total{result="diverged"}`.
+    """
+    lower = wilson_lower(agreed, agreed + diverged)
+    LOWER_BOUND.set(lower)

docker-compose.test.yml ADDED Viewed

	@@ -0,0 +1,12 @@

+version: '3.8'
+services:
+  postgres:
+    image: postgres:15-alpine
+    environment:
+      POSTGRES_USER: testuser
+      POSTGRES_PASSWORD: testpass
+      POSTGRES_DB: testdb
+    ports:
+      - "5432:5432"
+    tmpfs: /var/lib/postgresql/data

docs/authentication.md ADDED Viewed

	@@ -0,0 +1,25 @@

+# Authentication
+This page describes how to authenticate with the ARF API.
+Current status
+- There is no route-level or global authentication enforced by the API code in this repository. The API routes (including governance endpoints) do not validate API keys, tokens, or other credentials.
+What the code provides
+- The configuration model (app/core/config.py) exposes an optional `api_key` setting. This can be provided via environment variables or a `.env` file (the BaseSettings `env_file` is configured to read `.env`).
+What this means for you
+- Setting `API_KEY` in a `.env` file or environment variable will populate the `settings.api_key`, but the current route implementations do not check this value.
+- If you require authentication, add a FastAPI dependency or middleware that checks `settings.api_key` (or another auth mechanism) and then apply it to routes or include it in a dependency override.
+Suggested minimal approach to enable API key checking
+- Implement a dependency in `app.api.deps` (e.g., `get_api_key`) that compares a header value to `settings.api_key` and raise `HTTPException(401)` when missing/invalid.
+- Add that dependency to routers or individual endpoints where auth is required.
+Notes
+- Tests and example code in this repo currently run without auth.

docs/development.md ADDED Viewed

	@@ -0,0 +1,55 @@

+# Development
+This page explains how to set up the ARF API for local development.
+Requirements
+- Python 3.10+ (match your environment)
+- A virtual environment
+- The project's Python dependencies (see `requirements.txt`). Note: `agentic-reliability-framework` is installed from a Git URL in `requirements.txt`.
+Quick start
+1. Clone the repository:
+   git clone https://github.com/petter2025us/arf-api.git
+   cd arf-api
+2. Create and activate a virtualenv, then install dependencies:
+   python -m venv .venv
+   source .venv/bin/activate  # or .\.venv\Scripts\activate on Windows
+   pip install -r requirements.txt
+3. Configure environment variables (optional):
+   - The project uses pydantic-settings with `env_file = ".env"` (see `app/core/config.py`). Create a `.env` file to set values locally.
+   Relevant environment variables used by the code:
+   - ARF_HMC_MODEL (default: `models/hmc_model.json`) — path to HMC model JSON used by RiskEngine.
+   - ARF_USE_HYPERPRIORS (default: `false`) — set to `true` to enable hyperprior behavior.
+   - API_KEY (optional) — will populate `settings.api_key` but note that routes currently do not enforce authentication.
+   - DATABASE_URL (optional) — configuration option in settings; tests use a local SQLite DB by default.
+4. Run the app with Uvicorn for development:
+   uvicorn app.main:app --reload --port 8000
+   - The application mounts routes under the `/api/v1` prefix and exposes a health endpoint at `/health`.
+Running tests
+- Tests use an on-disk SQLite test database (`sqlite:///./test.db`) created by the test fixtures (`tests/conftest.py`).
+- To run tests:
+   pytest
+- The test fixtures override the dependency that provides DB sessions so tests run against the test database.
+Notes on the RiskEngine
+- The app initializes a `RiskEngine` instance at startup (in `app.main`) using environment variables noted above. The engine instance is stored in `app.state.risk_engine` and is used by the governance endpoints.
+Further development
+- If you add persistent intent storage or authentication, update tests and dependency overrides accordingly.

docs/docs_endpoints.md ADDED Viewed

	@@ -0,0 +1,314 @@

+# API Endpoints
+This document describes the main ARF API endpoints and the request/response contracts used by the control plane.
+## POST `/api/v1/v1/incidents/evaluate`
+Evaluates a reported incident and returns a heuristic healing recommendation, a counterfactual causal explanation, and a simplified utility decision.
+This endpoint is **advisory only**. It does not apply remediation, mutate infrastructure, or execute any healing action.
+### Purpose
+The endpoint takes a current incident snapshot, estimates risk, chooses a deterministic action, and explains the expected effect of that action on latency using a heuristic counterfactual model.
+The implementation is intentionally simple:
+- no fitted Structural Causal Model is used
+- no machine learning model is required
+- no historical training step is performed
+- no action execution is triggered
+### Request schema
+The request body must match the `ReliabilityEvent` model.
+```json
+{
+  "component": "string",
+  "latency_p99": "number",
+  "error_rate": "number",
+  "service_mesh": "string",
+  "cpu_util": "number | null",
+  "memory_util": "number | null"
+}
+```
+#### Fields
+`component`
+: Name of the service or component being evaluated.
+`latency_p99`
+: The current 99th percentile latency value. The endpoint uses this value both for risk scoring and for the causal explanation.
+`error_rate`
+: The current error rate. The endpoint uses this value both for risk scoring and for the deterministic action threshold.
+`service_mesh`
+: Optional service mesh name. Defaults to `"default"`.
+`cpu_util`
+: Optional CPU utilization value. Present in the request model, but not used by the current decision logic.
+`memory_util`
+: Optional memory utilization value. Present in the request model, but not used by the current decision logic.
+### Response schema
+The endpoint returns a JSON object with three top-level sections.
+```json
+{
+  "healing_intent": {
+    "action": "string",
+    "component": "string",
+    "parameters": {},
+    "justification": "string",
+    "confidence": 0.85,
+    "risk_score": 0.0,
+    "status": "oss_advisory_only"
+  },
+  "causal_explanation": {
+    "factual_outcome": 0.0,
+    "counterfactual_outcome": 0.0,
+    "effect": 0.0,
+    "explanation_text": "string",
+    "is_model_based": false,
+    "warnings": ["string"]
+  },
+  "utility_decision": {
+    "best_action": "string",
+    "expected_utility": 0.5,
+    "explanation": "string"
+  }
+}
+```
+#### `healing_intent`
+`action`
+: The selected action. In the current implementation this is either `restart_container` or `no_action`.
+`component`
+: The input component name.
+`parameters`
+: Action parameters. The current implementation returns an empty object.
+`justification`
+: Human-readable explanation built from the causal explanation.
+`confidence`
+: Fixed confidence value returned by the endpoint. The current implementation uses `0.85`.
+`risk_score`
+: Heuristic risk score computed from latency and error rate.
+`status`
+: Always `oss_advisory_only`, indicating that the response is informational and not executable.
+#### `causal_explanation`
+`factual_outcome`
+: The observed outcome value from the request context. The endpoint uses `latency_p99` as the explained metric.
+`counterfactual_outcome`
+: The estimated value under the proposed alternative action.
+`effect`
+: The difference between counterfactual and factual outcomes.
+`explanation_text`
+: Natural-language explanation of the counterfactual effect.
+`is_model_based`
+: Always `false` in the current implementation.
+`warnings`
+: A list of warning strings. The current implementation includes a warning that the causal model is heuristic and not SCM-based.
+#### `utility_decision`
+`best_action`
+: The selected action, repeated for convenience.
+`expected_utility`
+: Fixed utility value returned by the current implementation. The endpoint uses `0.5`.
+`explanation`
+: Brief explanation that the choice came from heuristic latency and error thresholds.
+### Deterministic decision logic
+The endpoint uses the following rule to choose the action:
+```text
+optimal_action = RESTART_CONTAINER
+if latency_p99 > 500 OR error_rate > 0.15
+else NO_ACTION
+```
+In the implementation, this is encoded as:
+- `restart_container` when `latency_p99 > 500` or `error_rate > 0.15`
+- `no_action` otherwise
+No probabilistic policy or learned policy is involved.
+### Heuristic risk score
+The risk score is computed as:
+```text
+risk = min(1.0, (latency_p99 / 1000) * 0.7 + error_rate * 0.3)
+```
+Properties of this score:
+- normalized to the interval `[0, 1]`
+- weighted more heavily toward latency than error rate
+- clipped at `1.0`
+### Counterfactual model
+The causal explainer uses a deterministic multiplicative heuristic:
+```text
+counterfactual_outcome = factual_outcome * (1 + effect_frac)
+```
+Where:
+- `factual_outcome` is the observed metric value
+- `effect_frac` is read from a fixed internal action-impact table
+- the effect is multiplicative, not additive
+For latency, the current action-impact mapping includes the following examples:
+- `restart_container` → `latency_effect = -0.15`
+- `scale_out` → `latency_effect = -0.20`
+- `rollback` → `latency_effect = -0.25`
+- `circuit_breaker` → `latency_effect = -0.05`
+- `traffic_shift` → `latency_effect = -0.10`
+- `alert_team` → `latency_effect = 0.0`
+- `no_action` → `latency_effect = 0.0`
+For error rate, the table includes a separate `error_rate_effect` per action, but the current endpoint calls the explainer with `outcome_metric="latency"`, so the returned counterfactual explanation is latency-based.
+### Uncertainty interval
+The explainer applies a fixed uncertainty margin of ±10% around the estimated effect.
+Let:
+```text
+effect = counterfactual_outcome - factual_outcome
+ci_half = abs(effect) * 0.1
+confidence_interval = (counterfactual_outcome - ci_half, counterfactual_outcome + ci_half)
+```
+This interval is heuristic only. It is not a calibrated statistical confidence interval.
+### How the endpoint uses the explainer
+The endpoint constructs a local state object and passes it to the explainer:
+- `current_state["latency"] = event.latency_p99`
+- `current_state["error_rate"] = event.error_rate`
+- `current_state["last_action"] = {"action_type": "no_action"}`
+It then creates:
+- `proposed_action = {"action_type": optimal_action.value, "params": {}}`
+and calls:
+```text
+CausalExplainer().explain_healing_intent(proposed_action, current_state, "latency")
+```
+The resulting explanation is embedded into the `healing_intent` response.
+### Validation and error behavior
+The endpoint uses Pydantic validation through the `ReliabilityEvent` model.
+Expected behavior:
+- valid requests return HTTP 200
+- invalid request bodies are rejected by FastAPI/Pydantic before the handler logic runs
+The current implementation does not define a custom error schema for validation failures.
+### Advisory-only behavior
+The response includes:
+```json
+"status": "oss_advisory_only"
+```
+This means:
+- the endpoint recommends an action
+- it does not perform the action
+- it does not mutate incident state
+- it does not trigger remediation workflows by itself
+### Notes on implementation scope
+The current endpoint is intentionally narrow:
+- it bases the action choice on only two fields: `latency_p99` and `error_rate`
+- it ignores `cpu_util`, `memory_util`, and `service_mesh` in the decision logic
+- it always uses the latency metric in the causal explainer call
+- it returns a fixed `expected_utility` value of `0.5`
+### Example request
+```bash
+curl -X POST "http://localhost:8000/api/v1/v1/incidents/evaluate"   -H "Content-Type: application/json"   -d '{
+    "component": "payment-service",
+    "latency_p99": 450,
+    "error_rate": 0.25,
+    "service_mesh": "default",
+    "cpu_util": 0.85,
+    "memory_util": 0.90
+  }'
+```
+### Example response shape
+```json
+{
+  "healing_intent": {
+    "action": "restart_container",
+    "component": "payment-service",
+    "parameters": {},
+    "justification": "Causal: If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
+    "confidence": 0.85,
+    "risk_score": 0.4575,
+    "status": "oss_advisory_only"
+  },
+  "causal_explanation": {
+    "factual_outcome": 450,
+    "counterfactual_outcome": 382.5,
+    "effect": -67.5,
+    "explanation_text": "If we apply restart_container instead of no_action, latency would change from 450.00 to 382.50 (Δ = -67.50). Based on heuristic causal model.",
+    "is_model_based": false,
+    "warnings": [
+      "Using heuristic causal model (no fitted SCM)."
+    ]
+  },
+  "utility_decision": {
+    "best_action": "restart_container",
+    "expected_utility": 0.5,
+    "explanation": "Heuristic decision based on latency/error thresholds"
+  }
+}
+```
+### Cross-reference
+See `docs/examples.md` for a worked numerical example and `README.md` for a shorter overview.

docs/endpoints.md ADDED Viewed

	@@ -0,0 +1,34 @@

+# API Endpoints
+This page lists all available API endpoints.
+General
+- All API routers are mounted under the `/api/v1` prefix (see `app.main`).
+- Health endpoint is available at `/health`.
+Health
+- GET /health
+  - Returns: `{ "status": "ok" }`
+  - Purpose: basic liveness/health check.
+Governance (risk/intent evaluation)
+- POST /api/v1/intents/evaluate
+  - Description: Evaluate an infrastructure intent and return a risk score and explanation.
+  - Body: an InfrastructureIntentRequest JSON object (see the model in `app.models.infrastructure_intents`).
+  - Behaviour: The endpoint converts the incoming intent to an OSS intent and calls into the locally initialized RiskEngine (`app.state.risk_engine`).
+  - Errors: May return 500 if evaluation fails.
+- POST /api/v1/intents/outcome
+  - Description: Record the observed outcome of an executed intent to update priors.
+  - Behaviour: Not implemented in this repository; the endpoint returns a `501 Not Implemented` (the current implementation raises a 501 indicating outcome recording is not yet implemented).
+Other routers
+- The application also registers routers for incidents, risk, intents, and history at `/api/v1` (see `app.main`). Consult the respective modules in `app.api` for their exact endpoints and payloads.
+Notes
+- The governance evaluation relies on a `RiskEngine` instance initialized at app startup (see `app.main`) which reads `ARF_HMC_MODEL` and `ARF_USE_HYPERPRIORS` environment variables.

docs/examples.md ADDED Viewed

	@@ -0,0 +1,54 @@

+# Examples
+This page provides usage examples for the ARF API.
+Check health
+curl example:
+curl http://localhost:8000/health
+Response:
+{
+  "status": "ok"
+}
+Evaluate an intent (governance)
+- Endpoint: POST /api/v1/intents/evaluate
+- Content-Type: application/json
+Example payload (minimal illustrative example — adapt to the `InfrastructureIntentRequest` model used by the project):
+{
+  "id": "intent-123",
+  "description": "Example infrastructure change",
+  "estimated_cost": 100.0,
+  "policy_violations": []
+}
+Curl example:
+curl -X POST http://localhost:8000/api/v1/intents/evaluate \
+  -H "Content-Type: application/json" \
+  -d '{"id":"intent-123","description":"Example","estimated_cost":100.0,"policy_violations":[]} '
+Python (requests) example:
+import requests
+payload = {
+    "id": "intent-123",
+    "description": "Example infrastructure change",
+    "estimated_cost": 100.0,
+    "policy_violations": []
+}
+resp = requests.post("http://localhost:8000/api/v1/intents/evaluate", json=payload)
+print(resp.status_code, resp.text)
+Notes
+- The evaluate endpoint uses an in-process `RiskEngine` (initialized in `app.main`) to compute risk and explanations.
+- The `/api/v1/intents/outcome` endpoint exists but currently returns 501 Not Implemented — outcome recording/storage is incomplete in this repo.

docs/index.md ADDED Viewed

	@@ -0,0 +1,16 @@

+# ARF API Control Plane
+Welcome to the ARF API documentation.
+Overview
+- This repository implements the ARF API Control Plane (FastAPI) — the application mounts a number of routers under `/api/v1` and exposes a health endpoint at `/health`.
+- App version (from app.main): 0.2.0
+Important notes
+- A `RiskEngine` is initialized at app startup and stored at `app.state.risk_engine`. The engine reads `ARF_HMC_MODEL` and `ARF_USE_HYPERPRIORS` environment variables.
+- Authentication: there is an optional `api_key` in configuration, but request handlers do not currently enforce authentication.
+- The `/api/v1/intents/outcome` endpoint exists but returns 501 Not Implemented; intent outcome recording/storage is not yet implemented.
+See the other documentation pages for development instructions, endpoints, and examples.

monitor.sh ADDED Viewed

	@@ -0,0 +1,18 @@

+#!/bin/bash
+URL_FILE="/workspaces/arf-api/current_url.txt"
+LOG_FILE="/workspaces/arf-api/monitor.log"
+if [ ! -f "$URL_FILE" ]; then
+    echo "$(date): No URL file found. Exiting." >> "$LOG_FILE"
+    exit 1
+fi
+CURRENT_URL=$(cat "$URL_FILE")
+if ! curl -s -f "$CURRENT_URL/health" > /dev/null; then
+    echo "$(date): Tunnel down. Restarting..." >> "$LOG_FILE"
+    /workspaces/arf-api/start.sh
+else
+    echo "$(date): Tunnel OK." >> "$LOG_FILE"
+fi

render.yaml ADDED Viewed

	@@ -0,0 +1,19 @@

+services:
+  - type: web
+    name: arf-api
+    runtime: python
+    buildCommand: pip install -r requirements.txt
+    startCommand: uvicorn app.main:app --host 0.0.0.0 --port $PORT
+    envVars:
+      - key: DATABASE_URL
+        fromDatabase:
+          name: arf-db
+          property: connectionString
+      - key: API_KEY
+        sync: false
+      - key: ENVIRONMENT
+        value: production
+databases:
+  - name: arf-db
+    databaseName: arf
+    user: arf_user

requirements-dev.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+pytest-cov>=7.0.0
+jsonschema>=4.0.0
+pytest-asyncio>=0.24.0

requirements.txt CHANGED Viewed

@@ -1,8 +1,10 @@
 fastapi==0.115.12
 uvicorn[standard]==0.34.0
-pydantic==2.12.5
 pytest==8.3.5
-pytest-cov>=6.0.0
 httpx==0.28.1
 alembic
 pydantic-settings
@@ -11,9 +13,11 @@ psycopg2-binary==2.9.10
 slowapi==0.1.9
 prometheus-fastapi-instrumentator==7.1.0
 flake8==7.2.0
-cryptography
 sentence-transformers>=2.2.0
 scikit-learn
-redis>=4.0.0
 stripe>=9.0.0
-pandas

 fastapi==0.115.12
 uvicorn[standard]==0.34.0
+pydantic>=2.13.2
+agentic-reliability-framework @ git+https://github.com/arf-foundation/agentic-reliability-framework@main
+arf-pricing-calculator @ git+https://github.com/arf-foundation/ARF-Bayesian-Pricing-Calculator@main
+pytest==8.3.5
 pytest==8.3.5
 httpx==0.28.1
 alembic
 pydantic-settings
 slowapi==0.1.9
 prometheus-fastapi-instrumentator==7.1.0
 flake8==7.2.0
+cryptography==47.0.0
 sentence-transformers>=2.2.0
 scikit-learn
+redis>=4.0.0   # optional, for faster counters
 stripe>=9.0.0
+opentelemetry-api>=1.20.0
+opentelemetry-sdk>=1.20.0
+opentelemetry-instrumentation-fastapi>=0.50b0

runtime.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ python-3.12.3
2	+ # force fresh build

seed_rag_data.py ADDED Viewed

	@@ -0,0 +1,67 @@

+"""
+Seed RAG graph with historical healing action success rates.
+Run once before starting the API server.
+"""
+import sys
+import os
+sys.path.append(os.path.dirname(__file__))
+from app.core.deps import get_rag_graph
+from agentic_reliability_framework.core.models.event import HealingAction
+def seed_historical_data():
+    rag = get_rag_graph()
+    # Define seed incidents (each with an outcome)
+    seed_data = [
+        # restart_container successes
+        {"incident_id": "seed_restart_1", "component": "test", "action": HealingAction.RESTART_CONTAINER.value, "success": True, "resolution_time_minutes": 2},
+        {"incident_id": "seed_restart_2", "component": "test", "action": HealingAction.RESTART_CONTAINER.value, "success": True, "resolution_time_minutes": 3},
+        {"incident_id": "seed_restart_3", "component": "test", "action": HealingAction.RESTART_CONTAINER.value, "success": False, "resolution_time_minutes": 10},
+        # rollback successes
+        {"incident_id": "seed_rollback_1", "component": "test", "action": HealingAction.ROLLBACK.value, "success": True, "resolution_time_minutes": 1},
+        {"incident_id": "seed_rollback_2", "component": "test", "action": HealingAction.ROLLBACK.value, "success": True, "resolution_time_minutes": 2},
+        {"incident_id": "seed_rollback_3", "component": "test", "action": HealingAction.ROLLBACK.value, "success": False, "resolution_time_minutes": 5},
+        # scale_out successes
+        {"incident_id": "seed_scale_1", "component": "test", "action": HealingAction.SCALE_OUT.value, "success": True, "resolution_time_minutes": 5},
+        {"incident_id": "seed_scale_2", "component": "test", "action": HealingAction.SCALE_OUT.value, "success": False, "resolution_time_minutes": 15},
+        # circuit_breaker successes
+        {"incident_id": "seed_cb_1", "component": "test", "action": HealingAction.CIRCUIT_BREAKER.value, "success": True, "resolution_time_minutes": 1},
+        {"incident_id": "seed_cb_2", "component": "test", "action": HealingAction.CIRCUIT_BREAKER.value, "success": True, "resolution_time_minutes": 2},
+        # traffic_shift successes
+        {"incident_id": "seed_ts_1", "component": "test", "action": HealingAction.TRAFFIC_SHIFT.value, "success": True, "resolution_time_minutes": 4},
+        {"incident_id": "seed_ts_2", "component": "test", "action": HealingAction.TRAFFIC_SHIFT.value, "success": False, "resolution_time_minutes": 8},
+    ]
+    # Add each outcome to the RAG graph
+    for item in seed_data:
+        # Create a dummy reliability event (simplified)
+        from agentic_reliability_framework.core.models.event import ReliabilityEvent
+        event = ReliabilityEvent(
+            component=item["component"],
+            latency_p99=500,  # placeholder
+            error_rate=0.1,
+            service_mesh="default"
+        )
+        # Record the outcome
+        rag.record_outcome(
+            incident_id=item["incident_id"],
+            event=event,
+            action_taken=item["action"],
+            success=item["success"],
+            resolution_time_minutes=item["resolution_time_minutes"]
+        )
+        print(f"Seeded: {item['action']} -> success={item['success']}")
+    print(f"Seeded {len(seed_data)} historical outcomes.")
+    print(f"Stats per action:")
+    for action in HealingAction:
+        stats = rag.get_historical_effectiveness(action.value, component_filter="test")
+        print(f"  {action.value}: uses={stats['total_uses']}, success_rate={stats['success_rate']:.2f}, avg_time={stats['avg_resolution_time_minutes']:.1f} min")
+if __name__ == "__main__":
+    seed_historical_data()

start.sh ADDED Viewed

	@@ -0,0 +1,68 @@

+#!/bin/bash
+# Set paths
+BACKEND_DIR="/workspaces/arf-api"
+FRONTEND_DIR="/workspaces/arf-frontend"
+VENV_ACTIVATE="$BACKEND_DIR/venv/bin/activate"
+CLOUDFLARED=$(which cloudflared 2>/dev/null || echo "/usr/local/bin/cloudflared")
+# Kill any existing processes
+echo "🛑 Stopping existing uvicorn and cloudflared..."
+pkill -f uvicorn
+pkill -f cloudflared
+sleep 2
+# Start uvicorn
+echo "🚀 Starting uvicorn..."
+cd "$BACKEND_DIR"
+source "$VENV_ACTIVATE"
+uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload &
+sleep 3
+# Verify uvicorn is running
+if ! curl -s http://localhost:8000/health >/dev/null; then
+    echo "❌ uvicorn failed to start. Exiting."
+    exit 1
+fi
+echo "✅ uvicorn is running."
+# Start cloudflared and capture URL
+echo "🌐 Starting cloudflared tunnel..."
+TEMP_FILE=$(mktemp)
+$CLOUDFLARED tunnel --url http://localhost:8000 2>&1 | tee "$TEMP_FILE" &
+# Wait for URL to appear
+echo "⏳ Waiting for tunnel URL..."
+URL=""
+for i in {1..30}; do
+    URL=$(grep -oP 'https://[a-z0-9-]+\.trycloudflare\.com' "$TEMP_FILE" | head -1)
+    if [ -n "$URL" ]; then
+        break
+    fi
+    sleep 1
+done
+if [ -z "$URL" ]; then
+    echo "❌ Failed to get tunnel URL."
+    exit 1
+fi
+echo "✅ Tunnel URL: $URL"
+# Save URL for monitoring (used by monitor.sh)
+echo "$URL" > /workspaces/arf-api/current_url.txt
+# Update Vercel environment variable
+echo "🔧 Updating Vercel environment variable..."
+cd "$FRONTEND_DIR"
+if command -v vercel &>/dev/null; then
+    vercel env rm NEXT_PUBLIC_API_URL production -y
+    echo "$URL" | vercel env add NEXT_PUBLIC_API_URL production
+    echo "🔄 Redeploying frontend..."
+    vercel --prod
+else
+    echo "⚠️  Vercel CLI not installed. Please install it with: npm i -g vercel"
+    echo "Then manually update the env var to: $URL"
+fi
+echo "🎉 All done! Your new URL is: $URL"
+echo "Frontend will be updated shortly. Check https://arf-frontend-sandy.vercel.app"

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,128 @@

+"""
+pytest configuration and fixtures for ARF API tests.
+"""
+from app.core.usage_tracker import enforce_quota, Tier
+from app.api.deps import get_db
+from app.database.base import Base
+from app.main import app as fastapi_app
+from sqlalchemy.orm import sessionmaker
+from sqlalchemy import create_engine
+from fastapi.testclient import TestClient
+import app.core.usage_tracker
+import os
+import pytest
+# ===== STEP 1: Set environment variables BEFORE any app imports =====
+os.environ["ARF_USAGE_TRACKING"] = "false"
+# Force the correct database URL for tests
+os.environ["DATABASE_URL"] = "postgresql://postgres:postgres@localhost:5432/testdb"
+os.environ["TEST_DATABASE_URL"] = "postgresql://postgres:postgres@localhost:5432/testdb"
+# Additional PostgreSQL environment variables to prevent fallback to
+# system user
+os.environ["PGUSER"] = "postgres"
+os.environ["PGPASSWORD"] = "postgres"
+os.environ["PGHOST"] = "localhost"
+os.environ["PGPORT"] = "5432"
+os.environ["PGDATABASE"] = "testdb"
+# ===== STEP 2: Mock the tracker module BEFORE importing app =====
+class MockTracker:
+    def get_tier(self, api_key):
+        from app.core.usage_tracker import Tier
+        return Tier.PRO
+    def get_remaining_quota(self, api_key, tier):
+        return 1000
+    def consume_quota_and_log(self, record, idempotency_key=None):
+        return (True, None)
+    def increment_usage_sync(self, record, idempotency_key=None):
+        return True
+    def get_or_create_api_key(self, key, tier):
+        return True
+    def update_api_key_tier(self, key, tier):
+        return True
+    def _insert_audit_log(self, record):
+        pass
+# Replace the tracker at the module level
+app.core.usage_tracker.tracker = MockTracker()
+# ===== STEP 3: Import app and database modules =====
+# Force model registration (prevents "no such table" errors)
+# Use the environment variable for the database URL (already set)
+TEST_DATABASE_URL = os.getenv(
+    "TEST_DATABASE_URL",
+    "postgresql://postgres:postgres@localhost:5432/testdb")
+if TEST_DATABASE_URL.startswith("postgresql"):
+    engine = create_engine(TEST_DATABASE_URL)
+else:
+    engine = create_engine(
+        TEST_DATABASE_URL, connect_args={
+            "check_same_thread": False})
+TestingSessionLocal = sessionmaker(
+    autocommit=False,
+    autoflush=False,
+    bind=engine)
+def override_get_db():
+    db = TestingSessionLocal()
+    try:
+        yield db
+    finally:
+        db.close()
+fastapi_app.dependency_overrides[get_db] = override_get_db
+# Override enforce_quota dependency
+async def mock_enforce_quota(request, api_key=None):
+    return {"api_key": "test_key", "tier": Tier.PRO, "remaining": 1000}
+fastapi_app.dependency_overrides[enforce_quota] = mock_enforce_quota
+@pytest.fixture(scope="session", autouse=True)
+def setup_database():
+    """Create tables before any tests run."""
+    Base.metadata.create_all(bind=engine)
+    yield
+    Base.metadata.drop_all(bind=engine)
+@pytest.fixture(scope="session")
+def client():
+    with TestClient(fastapi_app) as test_client:
+        yield test_client
+@pytest.fixture(scope="function")
+def db_session():
+    """Provide a clean database session for each test."""
+    Base.metadata.create_all(bind=engine)
+    session = TestingSessionLocal()
+    yield session
+    session.rollback()
+    session.close()
+    Base.metadata.drop_all(bind=engine)

tests/test_deps.py ADDED Viewed

	@@ -0,0 +1,15 @@

+import pytest
+from unittest.mock import patch, MagicMock
+from app.api.deps import get_db
+def test_get_db_closes_session():
+    mock_session = MagicMock()
+    with patch('app.api.deps.SessionLocal', return_value=mock_session):
+        db_gen = get_db()
+        db = next(db_gen)
+        assert db == mock_session
+        # Simulate an exception during request handling
+        with pytest.raises(Exception):
+            db_gen.throw(Exception("test error"))
+        mock_session.close.assert_called_once()

tests/test_governance.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""
+Tests for governance endpoints: /api/v1/intents/evaluate
+"""
+def test_evaluate_provision_intent(client):
+    payload = {
+        "intent_type": "provision_resource",
+        "environment": "prod",
+        "resource_type": "database",
+        "region": "eastus",
+        "size": "Standard",
+        "estimated_cost": 1200,
+        "policy_violations": [],
+        "requester": "alice",
+        "provenance": {},
+        "configuration": {}
+    }
+    response = client.post("/api/v1/intents/evaluate", json=payload)
+    assert response.status_code == 200, response.text
+    data = response.json()
+    assert "risk_score" in data
+def test_evaluate_grant_access(client):
+    payload = {
+        "intent_type": "grant_access",
+        "environment": "dev",
+        "principal": "bob",
+        "permission_level": "read",
+        "resource_scope": "/subscriptions/123",
+        "estimated_cost": None,
+        "policy_violations": [],
+        "requester": "alice",
+        "provenance": {},
+        "justification": "test"
+    }
+    response = client.post("/api/v1/intents/evaluate", json=payload)
+    assert response.status_code == 200, response.text
+    data = response.json()
+    assert "risk_score" in data
+def test_evaluate_deploy_config(client):
+    payload = {
+        "intent_type": "deploy_config",
+        "environment": "staging",
+        "service_name": "payments-api",
+        "change_scope": "canary",
+        "deployment_target": "staging",
+        "estimated_cost": 20,
+        "policy_violations": [],
+        "requester": "alice",
+        "provenance": {},
+        "configuration": {}
+    }
+    response = client.post("/api/v1/intents/evaluate", json=payload)
+    assert response.status_code == 200, response.text
+    data = response.json()
+    assert "risk_score" in data
+def test_invalid_intent_type(client):
+    payload = {
+        "intent_type": "UnknownIntent",
+        "environment": "prod",
+        "requester": "alice",
+        "provenance": {}
+    }
+    response = client.post("/api/v1/intents/evaluate", json=payload)
+    assert response.status_code == 422

tests/test_healing_endpoint.py ADDED Viewed

	@@ -0,0 +1,21 @@

+from fastapi.testclient import TestClient
+from app.main import app
+client = TestClient(app)
+def test_healing_evaluate_endpoint():
+    payload = {
+        "event": {
+            "component": "my-service",
+            "latency_p99": 450.0,
+            "error_rate": 0.25,
+            "service_mesh": "default",
+            "cpu_util": 0.85,
+            "memory_util": 0.90
+        }
+    }
+    response = client.post("/api/v1/healing/evaluate", json=payload)
+    assert response.status_code == 200, f"Expected 200, got {
+        response.status_code}: {
+        response.text}"