# SPARKNET Cloud Architecture This document outlines the cloud-ready architecture for deploying SPARKNET on AWS. ## Overview SPARKNET is designed with a modular architecture that supports both local development and cloud deployment. The system can scale from a single developer machine to enterprise-grade cloud infrastructure. ## Local Development Stack ``` ┌─────────────────────────────────────────────────────┐ │ Local Machine │ ├─────────────────────────────────────────────────────┤ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Ollama │ │ ChromaDB │ │ File I/O │ │ │ │ (LLM) │ │ (Vector) │ │ (Storage) │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ │ │ └───────────────┼───────────────┘ │ │ │ │ │ ┌────────┴────────┐ │ │ │ SPARKNET │ │ │ │ Application │ │ │ └─────────────────┘ │ └─────────────────────────────────────────────────────┘ ``` ## AWS Cloud Architecture ### Target Architecture ``` ┌────────────────────────────────────────────────────────────────────┐ │ AWS Cloud │ ├────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │ │ API GW │──────│ Lambda │──────│ Step Functions │ │ │ │ (REST) │ │ (Compute) │ │ (Orchestration) │ │ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │ │ S3 │ │ Bedrock │ │ OpenSearch │ │ │ │ (Storage) │ │ (LLM) │ │ (Vector Store) │ │ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │ │ Textract │ │ Titan │ │ DynamoDB │ │ │ │ (OCR) │ │ (Embeddings)│ │ (Metadata) │ │ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │ │ └────────────────────────────────────────────────────────────────────┘ ``` ### Component Mapping | Local Component | AWS Service | Purpose | |----------------|-------------|---------| | File I/O | S3 | Document storage | | PaddleOCR/Tesseract | Textract | OCR extraction | | Ollama LLM | Bedrock (Claude/Titan) | Text generation | | Ollama Embeddings | Titan Embeddings | Vector embeddings | | ChromaDB | OpenSearch Serverless | Vector search | | SQLite (optional) | DynamoDB | Metadata storage | | Python Process | Lambda | Compute | | CLI | API Gateway | HTTP interface | ## Migration Strategy ### Phase 1: Storage Migration ```python # Abstract storage interface class StorageAdapter: def put(self, key: str, data: bytes) -> str: ... def get(self, key: str) -> bytes: ... def delete(self, key: str) -> bool: ... # Local implementation class LocalStorageAdapter(StorageAdapter): def __init__(self, base_path: str): self.base_path = Path(base_path) # S3 implementation class S3StorageAdapter(StorageAdapter): def __init__(self, bucket: str): self.client = boto3.client('s3') self.bucket = bucket ``` ### Phase 2: OCR Migration ```python # Abstract OCR interface class OCREngine: def recognize(self, image: np.ndarray) -> OCRResult: ... # Local: PaddleOCR class PaddleOCREngine(OCREngine): ... # Cloud: Textract class TextractEngine(OCREngine): def __init__(self): self.client = boto3.client('textract') def recognize(self, image: np.ndarray) -> OCRResult: response = self.client.detect_document_text( Document={'Bytes': image_bytes} ) return self._convert_response(response) ``` ### Phase 3: LLM Migration ```python # Abstract LLM interface class LLMAdapter: def generate(self, prompt: str) -> str: ... # Local: Ollama class OllamaAdapter(LLMAdapter): ... # Cloud: Bedrock class BedrockAdapter(LLMAdapter): def __init__(self, model_id: str = "anthropic.claude-3-sonnet"): self.client = boto3.client('bedrock-runtime') self.model_id = model_id def generate(self, prompt: str) -> str: response = self.client.invoke_model( modelId=self.model_id, body=json.dumps({"prompt": prompt}) ) return response['body'] ``` ### Phase 4: Vector Store Migration ```python # Abstract vector store interface (already implemented) class VectorStore: def add_chunks(self, chunks, embeddings): ... def search(self, query_embedding, top_k): ... # Local: ChromaDB (already implemented) class ChromaVectorStore(VectorStore): ... # Cloud: OpenSearch class OpenSearchVectorStore(VectorStore): def __init__(self, endpoint: str, index: str): self.client = OpenSearch(hosts=[endpoint]) self.index = index def search(self, query_embedding, top_k): response = self.client.search( index=self.index, body={ "knn": { "embedding": { "vector": query_embedding, "k": top_k } } } ) return self._convert_results(response) ``` ## AWS Services Deep Dive ### Amazon S3 - **Purpose**: Document storage and processed results - **Structure**: ``` s3://sparknet-documents/ ├── raw/ # Original documents │ └── {doc_id}/ │ └── document.pdf ├── processed/ # Processed results │ └── {doc_id}/ │ ├── metadata.json │ ├── chunks.json │ └── pages/ │ ├── page_0.png │ └── page_1.png └── cache/ # Processing cache ``` ### Amazon Textract - **Purpose**: OCR extraction with layout analysis - **Features**: - Document text detection - Table extraction - Form extraction - Handwriting recognition ### Amazon Bedrock - **Purpose**: LLM inference - **Models**: - Claude 3.5 Sonnet (primary) - Titan Text (cost-effective) - Titan Embeddings (vectors) ### Amazon OpenSearch Serverless - **Purpose**: Vector search and retrieval - **Configuration**: ```json { "index": "sparknet-vectors", "settings": { "index.knn": true, "index.knn.space_type": "cosinesimil" }, "mappings": { "properties": { "embedding": { "type": "knn_vector", "dimension": 1024 } } } } ``` ### AWS Lambda - **Purpose**: Serverless compute - **Functions**: - `process-document`: Document processing pipeline - `extract-fields`: Field extraction - `rag-query`: RAG query handling - `index-document`: Vector indexing ### AWS Step Functions - **Purpose**: Workflow orchestration - **Workflow**: ```json { "StartAt": "ProcessDocument", "States": { "ProcessDocument": { "Type": "Task", "Resource": "arn:aws:lambda:process-document", "Next": "IndexChunks" }, "IndexChunks": { "Type": "Task", "Resource": "arn:aws:lambda:index-document", "End": true } } } ``` ## Cost Optimization ### Tiered Processing | Tier | Use Case | Services | Cost | |------|----------|----------|------| | Basic | Simple OCR | Textract + Titan | $ | | Standard | Full pipeline | + Claude Haiku | $$ | | Premium | Complex analysis | + Claude Sonnet | $$$ | ### Caching Strategy 1. **Document Cache**: S3 with lifecycle policies 2. **Embedding Cache**: ElastiCache (Redis) 3. **Query Cache**: Lambda@Edge ## Security ### IAM Policies ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": "arn:aws:s3:::sparknet-documents/*" }, { "Effect": "Allow", "Action": [ "textract:DetectDocumentText", "textract:AnalyzeDocument" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "bedrock:InvokeModel" ], "Resource": "arn:aws:bedrock:*::foundation-model/*" } ] } ``` ### Data Encryption - S3: Server-side encryption (SSE-S3 or SSE-KMS) - OpenSearch: Encryption at rest - Lambda: Environment variable encryption ## Deployment ### Infrastructure as Code (Terraform) ```hcl # S3 Bucket resource "aws_s3_bucket" "documents" { bucket = "sparknet-documents" } # Lambda Function resource "aws_lambda_function" "processor" { function_name = "sparknet-processor" runtime = "python3.11" handler = "handler.process" memory_size = 1024 timeout = 300 } # OpenSearch Serverless resource "aws_opensearchserverless_collection" "vectors" { name = "sparknet-vectors" type = "VECTORSEARCH" } ``` ### CI/CD Pipeline ```yaml # GitHub Actions name: Deploy SPARKNET on: push: branches: [main] jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Deploy Lambda run: | aws lambda update-function-code \ --function-name sparknet-processor \ --zip-file fileb://package.zip ``` ## Monitoring ### CloudWatch Metrics - Lambda invocations and duration - S3 request counts - OpenSearch query latency - Bedrock token usage ### Dashboards - Processing throughput - Error rates - Cost tracking - Vector store statistics ## Next Steps 1. **Implement Storage Abstraction**: Create S3 adapter 2. **Add Textract Engine**: Implement AWS OCR 3. **Create Bedrock Adapter**: LLM migration 4. **Deploy OpenSearch**: Vector store setup 5. **Build Lambda Functions**: Serverless compute 6. **Setup Step Functions**: Workflow orchestration 7. **Configure CI/CD**: Automated deployment