SPARKNET / docs /CLOUD_ARCHITECTURE.md
MHamdan's picture
Initial commit: SPARKNET framework
d520909

A newer version of the Streamlit SDK is available: 1.53.1

Upgrade

SPARKNET Cloud Architecture

This document outlines the cloud-ready architecture for deploying SPARKNET on AWS.

Overview

SPARKNET is designed with a modular architecture that supports both local development and cloud deployment. The system can scale from a single developer machine to enterprise-grade cloud infrastructure.

Local Development Stack

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Local Machine                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   Ollama    β”‚  β”‚  ChromaDB   β”‚  β”‚  File I/O   β”‚ β”‚
β”‚  β”‚   (LLM)     β”‚  β”‚  (Vector)   β”‚  β”‚  (Storage)  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚          β”‚               β”‚               β”‚          β”‚
β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                          β”‚                          β”‚
β”‚                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚                 β”‚    SPARKNET     β”‚                 β”‚
β”‚                 β”‚   Application   β”‚                 β”‚
β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

AWS Cloud Architecture

Target Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           AWS Cloud                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   API GW    │──────│   Lambda    │──────│    Step Functions   β”‚ β”‚
β”‚  β”‚  (REST)     β”‚      β”‚  (Compute)  β”‚      β”‚   (Orchestration)   β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚         β”‚                    β”‚                       β”‚              β”‚
β”‚         β”‚                    β”‚                       β”‚              β”‚
β”‚         β–Ό                    β–Ό                       β–Ό              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚     S3      β”‚      β”‚   Bedrock   β”‚      β”‚   OpenSearch        β”‚ β”‚
β”‚  β”‚  (Storage)  β”‚      β”‚   (LLM)     β”‚      β”‚   (Vector Store)    β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Textract   β”‚      β”‚   Titan     β”‚      β”‚     DynamoDB        β”‚ β”‚
β”‚  β”‚   (OCR)     β”‚      β”‚ (Embeddings)β”‚      β”‚    (Metadata)       β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Mapping

Local Component AWS Service Purpose
File I/O S3 Document storage
PaddleOCR/Tesseract Textract OCR extraction
Ollama LLM Bedrock (Claude/Titan) Text generation
Ollama Embeddings Titan Embeddings Vector embeddings
ChromaDB OpenSearch Serverless Vector search
SQLite (optional) DynamoDB Metadata storage
Python Process Lambda Compute
CLI API Gateway HTTP interface

Migration Strategy

Phase 1: Storage Migration

# Abstract storage interface
class StorageAdapter:
    def put(self, key: str, data: bytes) -> str: ...
    def get(self, key: str) -> bytes: ...
    def delete(self, key: str) -> bool: ...

# Local implementation
class LocalStorageAdapter(StorageAdapter):
    def __init__(self, base_path: str):
        self.base_path = Path(base_path)

# S3 implementation
class S3StorageAdapter(StorageAdapter):
    def __init__(self, bucket: str):
        self.client = boto3.client('s3')
        self.bucket = bucket

Phase 2: OCR Migration

# Abstract OCR interface
class OCREngine:
    def recognize(self, image: np.ndarray) -> OCRResult: ...

# Local: PaddleOCR
class PaddleOCREngine(OCREngine): ...

# Cloud: Textract
class TextractEngine(OCREngine):
    def __init__(self):
        self.client = boto3.client('textract')

    def recognize(self, image: np.ndarray) -> OCRResult:
        response = self.client.detect_document_text(
            Document={'Bytes': image_bytes}
        )
        return self._convert_response(response)

Phase 3: LLM Migration

# Abstract LLM interface
class LLMAdapter:
    def generate(self, prompt: str) -> str: ...

# Local: Ollama
class OllamaAdapter(LLMAdapter): ...

# Cloud: Bedrock
class BedrockAdapter(LLMAdapter):
    def __init__(self, model_id: str = "anthropic.claude-3-sonnet"):
        self.client = boto3.client('bedrock-runtime')
        self.model_id = model_id

    def generate(self, prompt: str) -> str:
        response = self.client.invoke_model(
            modelId=self.model_id,
            body=json.dumps({"prompt": prompt})
        )
        return response['body']

Phase 4: Vector Store Migration

# Abstract vector store interface (already implemented)
class VectorStore:
    def add_chunks(self, chunks, embeddings): ...
    def search(self, query_embedding, top_k): ...

# Local: ChromaDB (already implemented)
class ChromaVectorStore(VectorStore): ...

# Cloud: OpenSearch
class OpenSearchVectorStore(VectorStore):
    def __init__(self, endpoint: str, index: str):
        self.client = OpenSearch(hosts=[endpoint])
        self.index = index

    def search(self, query_embedding, top_k):
        response = self.client.search(
            index=self.index,
            body={
                "knn": {
                    "embedding": {
                        "vector": query_embedding,
                        "k": top_k
                    }
                }
            }
        )
        return self._convert_results(response)

AWS Services Deep Dive

Amazon S3

  • Purpose: Document storage and processed results
  • Structure:
    s3://sparknet-documents/
    β”œβ”€β”€ raw/                    # Original documents
    β”‚   └── {doc_id}/
    β”‚       └── document.pdf
    β”œβ”€β”€ processed/              # Processed results
    β”‚   └── {doc_id}/
    β”‚       β”œβ”€β”€ metadata.json
    β”‚       β”œβ”€β”€ chunks.json
    β”‚       └── pages/
    β”‚           β”œβ”€β”€ page_0.png
    β”‚           └── page_1.png
    └── cache/                  # Processing cache
    

Amazon Textract

  • Purpose: OCR extraction with layout analysis
  • Features:
    • Document text detection
    • Table extraction
    • Form extraction
    • Handwriting recognition

Amazon Bedrock

  • Purpose: LLM inference
  • Models:
    • Claude 3.5 Sonnet (primary)
    • Titan Text (cost-effective)
    • Titan Embeddings (vectors)

Amazon OpenSearch Serverless

  • Purpose: Vector search and retrieval
  • Configuration:
    {
      "index": "sparknet-vectors",
      "settings": {
        "index.knn": true,
        "index.knn.space_type": "cosinesimil"
      },
      "mappings": {
        "properties": {
          "embedding": {
            "type": "knn_vector",
            "dimension": 1024
          }
        }
      }
    }
    

AWS Lambda

  • Purpose: Serverless compute
  • Functions:
    • process-document: Document processing pipeline
    • extract-fields: Field extraction
    • rag-query: RAG query handling
    • index-document: Vector indexing

AWS Step Functions

  • Purpose: Workflow orchestration
  • Workflow:
    {
      "StartAt": "ProcessDocument",
      "States": {
        "ProcessDocument": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:process-document",
          "Next": "IndexChunks"
        },
        "IndexChunks": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:index-document",
          "End": true
        }
      }
    }
    

Cost Optimization

Tiered Processing

Tier Use Case Services Cost
Basic Simple OCR Textract + Titan $
Standard Full pipeline + Claude Haiku $$
Premium Complex analysis + Claude Sonnet $$$

Caching Strategy

  1. Document Cache: S3 with lifecycle policies
  2. Embedding Cache: ElastiCache (Redis)
  3. Query Cache: Lambda@Edge

Security

IAM Policies

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::sparknet-documents/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "textract:DetectDocumentText",
        "textract:AnalyzeDocument"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Resource": "arn:aws:bedrock:*::foundation-model/*"
    }
  ]
}

Data Encryption

  • S3: Server-side encryption (SSE-S3 or SSE-KMS)
  • OpenSearch: Encryption at rest
  • Lambda: Environment variable encryption

Deployment

Infrastructure as Code (Terraform)

# S3 Bucket
resource "aws_s3_bucket" "documents" {
  bucket = "sparknet-documents"
}

# Lambda Function
resource "aws_lambda_function" "processor" {
  function_name = "sparknet-processor"
  runtime       = "python3.11"
  handler       = "handler.process"
  memory_size   = 1024
  timeout       = 300
}

# OpenSearch Serverless
resource "aws_opensearchserverless_collection" "vectors" {
  name = "sparknet-vectors"
  type = "VECTORSEARCH"
}

CI/CD Pipeline

# GitHub Actions
name: Deploy SPARKNET

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Deploy Lambda
        run: |
          aws lambda update-function-code \
            --function-name sparknet-processor \
            --zip-file fileb://package.zip

Monitoring

CloudWatch Metrics

  • Lambda invocations and duration
  • S3 request counts
  • OpenSearch query latency
  • Bedrock token usage

Dashboards

  • Processing throughput
  • Error rates
  • Cost tracking
  • Vector store statistics

Next Steps

  1. Implement Storage Abstraction: Create S3 adapter
  2. Add Textract Engine: Implement AWS OCR
  3. Create Bedrock Adapter: LLM migration
  4. Deploy OpenSearch: Vector store setup
  5. Build Lambda Functions: Serverless compute
  6. Setup Step Functions: Workflow orchestration
  7. Configure CI/CD: Automated deployment