Spaces:

MHamdan
/

SPARKNET

Sleeping

App Files Files Community

SPARKNET / docs /CLOUD_ARCHITECTURE.md

MHamdan

Initial commit: SPARKNET framework

d520909 12 days ago

preview code

raw

history blame contribute delete

12.6 kB

A newer version of the Streamlit SDK is available: 1.53.1

Upgrade

SPARKNET Cloud Architecture

This document outlines the cloud-ready architecture for deploying SPARKNET on AWS.

Overview

SPARKNET is designed with a modular architecture that supports both local development and cloud deployment. The system can scale from a single developer machine to enterprise-grade cloud infrastructure.

Local Development Stack

┌─────────────────────────────────────────────────────┐
│                    Local Machine                     │
├─────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │   Ollama    │  │  ChromaDB   │  │  File I/O   │ │
│  │   (LLM)     │  │  (Vector)   │  │  (Storage)  │ │
│  └─────────────┘  └─────────────┘  └─────────────┘ │
│          │               │               │          │
│          └───────────────┼───────────────┘          │
│                          │                          │
│                 ┌────────┴────────┐                 │
│                 │    SPARKNET     │                 │
│                 │   Application   │                 │
│                 └─────────────────┘                 │
└─────────────────────────────────────────────────────┘

AWS Cloud Architecture

Target Architecture

┌────────────────────────────────────────────────────────────────────┐
│                           AWS Cloud                                 │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────┐      ┌─────────────┐      ┌─────────────────────┐ │
│  │   API GW    │──────│   Lambda    │──────│    Step Functions   │ │
│  │  (REST)     │      │  (Compute)  │      │   (Orchestration)   │ │
│  └─────────────┘      └─────────────┘      └─────────────────────┘ │
│         │                    │                       │              │
│         │                    │                       │              │
│         ▼                    ▼                       ▼              │
│  ┌─────────────┐      ┌─────────────┐      ┌─────────────────────┐ │
│  │     S3      │      │   Bedrock   │      │   OpenSearch        │ │
│  │  (Storage)  │      │   (LLM)     │      │   (Vector Store)    │ │
│  └─────────────┘      └─────────────┘      └─────────────────────┘ │
│                                                                     │
│  ┌─────────────┐      ┌─────────────┐      ┌─────────────────────┐ │
│  │  Textract   │      │   Titan     │      │     DynamoDB        │ │
│  │   (OCR)     │      │ (Embeddings)│      │    (Metadata)       │ │
│  └─────────────┘      └─────────────┘      └─────────────────────┘ │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Component Mapping

Local Component	AWS Service	Purpose
File I/O	S3	Document storage
PaddleOCR/Tesseract	Textract	OCR extraction
Ollama LLM	Bedrock (Claude/Titan)	Text generation
Ollama Embeddings	Titan Embeddings	Vector embeddings
ChromaDB	OpenSearch Serverless	Vector search
SQLite (optional)	DynamoDB	Metadata storage
Python Process	Lambda	Compute
CLI	API Gateway	HTTP interface

Migration Strategy

Phase 1: Storage Migration

# Abstract storage interface
class StorageAdapter:
    def put(self, key: str, data: bytes) -> str: ...
    def get(self, key: str) -> bytes: ...
    def delete(self, key: str) -> bool: ...

# Local implementation
class LocalStorageAdapter(StorageAdapter):
    def __init__(self, base_path: str):
        self.base_path = Path(base_path)

# S3 implementation
class S3StorageAdapter(StorageAdapter):
    def __init__(self, bucket: str):
        self.client = boto3.client('s3')
        self.bucket = bucket

Phase 2: OCR Migration

# Abstract OCR interface
class OCREngine:
    def recognize(self, image: np.ndarray) -> OCRResult: ...

# Local: PaddleOCR
class PaddleOCREngine(OCREngine): ...

# Cloud: Textract
class TextractEngine(OCREngine):
    def __init__(self):
        self.client = boto3.client('textract')

    def recognize(self, image: np.ndarray) -> OCRResult:
        response = self.client.detect_document_text(
            Document={'Bytes': image_bytes}
        )
        return self._convert_response(response)

Phase 3: LLM Migration

# Abstract LLM interface
class LLMAdapter:
    def generate(self, prompt: str) -> str: ...

# Local: Ollama
class OllamaAdapter(LLMAdapter): ...

# Cloud: Bedrock
class BedrockAdapter(LLMAdapter):
    def __init__(self, model_id: str = "anthropic.claude-3-sonnet"):
        self.client = boto3.client('bedrock-runtime')
        self.model_id = model_id

    def generate(self, prompt: str) -> str:
        response = self.client.invoke_model(
            modelId=self.model_id,
            body=json.dumps({"prompt": prompt})
        )
        return response['body']

Phase 4: Vector Store Migration

# Abstract vector store interface (already implemented)
class VectorStore:
    def add_chunks(self, chunks, embeddings): ...
    def search(self, query_embedding, top_k): ...

# Local: ChromaDB (already implemented)
class ChromaVectorStore(VectorStore): ...

# Cloud: OpenSearch
class OpenSearchVectorStore(VectorStore):
    def __init__(self, endpoint: str, index: str):
        self.client = OpenSearch(hosts=[endpoint])
        self.index = index

    def search(self, query_embedding, top_k):
        response = self.client.search(
            index=self.index,
            body={
                "knn": {
                    "embedding": {
                        "vector": query_embedding,
                        "k": top_k
                    }
                }
            }
        )
        return self._convert_results(response)

AWS Services Deep Dive

Amazon S3

Purpose: Document storage and processed results

Structure:

s3://sparknet-documents/
├── raw/                    # Original documents
│   └── {doc_id}/
│       └── document.pdf
├── processed/              # Processed results
│   └── {doc_id}/
│       ├── metadata.json
│       ├── chunks.json
│       └── pages/
│           ├── page_0.png
│           └── page_1.png
└── cache/                  # Processing cache

Amazon Textract

Purpose: OCR extraction with layout analysis
Features:
- Document text detection
- Table extraction
- Form extraction
- Handwriting recognition

Amazon Bedrock

Purpose: LLM inference
Models:
- Claude 3.5 Sonnet (primary)
- Titan Text (cost-effective)
- Titan Embeddings (vectors)

Amazon OpenSearch Serverless

Purpose: Vector search and retrieval

Configuration:

{
  "index": "sparknet-vectors",
  "settings": {
    "index.knn": true,
    "index.knn.space_type": "cosinesimil"
  },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 1024
      }
    }
  }
}

AWS Lambda

Purpose: Serverless compute
Functions:
- process-document: Document processing pipeline
- extract-fields: Field extraction
- rag-query: RAG query handling
- index-document: Vector indexing

AWS Step Functions

Purpose: Workflow orchestration

Workflow:

{
  "StartAt": "ProcessDocument",
  "States": {
    "ProcessDocument": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:process-document",
      "Next": "IndexChunks"
    },
    "IndexChunks": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:index-document",
      "End": true
    }
  }
}

Cost Optimization

Tiered Processing

Tier	Use Case	Services	Cost
Basic	Simple OCR	Textract + Titan	$
Standard	Full pipeline	+ Claude Haiku	$$
Premium	Complex analysis	+ Claude Sonnet	$$$

Caching Strategy

Document Cache: S3 with lifecycle policies
Embedding Cache: ElastiCache (Redis)
Query Cache: Lambda@Edge

Security

IAM Policies

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::sparknet-documents/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "textract:DetectDocumentText",
        "textract:AnalyzeDocument"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Resource": "arn:aws:bedrock:*::foundation-model/*"
    }
  ]
}

Data Encryption

S3: Server-side encryption (SSE-S3 or SSE-KMS)
OpenSearch: Encryption at rest
Lambda: Environment variable encryption

Deployment

Infrastructure as Code (Terraform)

# S3 Bucket
resource "aws_s3_bucket" "documents" {
  bucket = "sparknet-documents"
}

# Lambda Function
resource "aws_lambda_function" "processor" {
  function_name = "sparknet-processor"
  runtime       = "python3.11"
  handler       = "handler.process"
  memory_size   = 1024
  timeout       = 300
}

# OpenSearch Serverless
resource "aws_opensearchserverless_collection" "vectors" {
  name = "sparknet-vectors"
  type = "VECTORSEARCH"
}

CI/CD Pipeline

# GitHub Actions
name: Deploy SPARKNET

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Deploy Lambda
        run: |
          aws lambda update-function-code \
            --function-name sparknet-processor \
            --zip-file fileb://package.zip

Monitoring

CloudWatch Metrics

Lambda invocations and duration
S3 request counts
OpenSearch query latency
Bedrock token usage

Dashboards

Processing throughput
Error rates
Cost tracking
Vector store statistics

Next Steps

Implement Storage Abstraction: Create S3 adapter
Add Textract Engine: Implement AWS OCR
Create Bedrock Adapter: LLM migration
Deploy OpenSearch: Vector store setup
Build Lambda Functions: Serverless compute
Setup Step Functions: Workflow orchestration
Configure CI/CD: Automated deployment