| # SPARKNET Cloud Architecture | |
| This document outlines the cloud-ready architecture for deploying SPARKNET on AWS. | |
| ## Overview | |
| SPARKNET is designed with a modular architecture that supports both local development and cloud deployment. The system can scale from a single developer machine to enterprise-grade cloud infrastructure. | |
| ## Local Development Stack | |
| ``` | |
| ┌─────────────────────────────────────────────────────┐ | |
| │ Local Machine │ | |
| ├─────────────────────────────────────────────────────┤ | |
| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ | |
| │ │ Ollama │ │ ChromaDB │ │ File I/O │ │ | |
| │ │ (LLM) │ │ (Vector) │ │ (Storage) │ │ | |
| │ └─────────────┘ └─────────────┘ └─────────────┘ │ | |
| │ │ │ │ │ | |
| │ └───────────────┼───────────────┘ │ | |
| │ │ │ | |
| │ ┌────────┴────────┐ │ | |
| │ │ SPARKNET │ │ | |
| │ │ Application │ │ | |
| │ └─────────────────┘ │ | |
| └─────────────────────────────────────────────────────┘ | |
| ``` | |
| ## AWS Cloud Architecture | |
| ### Target Architecture | |
| ``` | |
| ┌────────────────────────────────────────────────────────────────────┐ | |
| │ AWS Cloud │ | |
| ├────────────────────────────────────────────────────────────────────┤ | |
| │ │ | |
| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ | |
| │ │ API GW │──────│ Lambda │──────│ Step Functions │ │ | |
| │ │ (REST) │ │ (Compute) │ │ (Orchestration) │ │ | |
| │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ | |
| │ │ │ │ │ | |
| │ │ │ │ │ | |
| │ ▼ ▼ ▼ │ | |
| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ | |
| │ │ S3 │ │ Bedrock │ │ OpenSearch │ │ | |
| │ │ (Storage) │ │ (LLM) │ │ (Vector Store) │ │ | |
| │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ | |
| │ │ | |
| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ | |
| │ │ Textract │ │ Titan │ │ DynamoDB │ │ | |
| │ │ (OCR) │ │ (Embeddings)│ │ (Metadata) │ │ | |
| │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ | |
| │ │ | |
| └────────────────────────────────────────────────────────────────────┘ | |
| ``` | |
| ### Component Mapping | |
| | Local Component | AWS Service | Purpose | | |
| |----------------|-------------|---------| | |
| | File I/O | S3 | Document storage | | |
| | PaddleOCR/Tesseract | Textract | OCR extraction | | |
| | Ollama LLM | Bedrock (Claude/Titan) | Text generation | | |
| | Ollama Embeddings | Titan Embeddings | Vector embeddings | | |
| | ChromaDB | OpenSearch Serverless | Vector search | | |
| | SQLite (optional) | DynamoDB | Metadata storage | | |
| | Python Process | Lambda | Compute | | |
| | CLI | API Gateway | HTTP interface | | |
| ## Migration Strategy | |
| ### Phase 1: Storage Migration | |
| ```python | |
| # Abstract storage interface | |
| class StorageAdapter: | |
| def put(self, key: str, data: bytes) -> str: ... | |
| def get(self, key: str) -> bytes: ... | |
| def delete(self, key: str) -> bool: ... | |
| # Local implementation | |
| class LocalStorageAdapter(StorageAdapter): | |
| def __init__(self, base_path: str): | |
| self.base_path = Path(base_path) | |
| # S3 implementation | |
| class S3StorageAdapter(StorageAdapter): | |
| def __init__(self, bucket: str): | |
| self.client = boto3.client('s3') | |
| self.bucket = bucket | |
| ``` | |
| ### Phase 2: OCR Migration | |
| ```python | |
| # Abstract OCR interface | |
| class OCREngine: | |
| def recognize(self, image: np.ndarray) -> OCRResult: ... | |
| # Local: PaddleOCR | |
| class PaddleOCREngine(OCREngine): ... | |
| # Cloud: Textract | |
| class TextractEngine(OCREngine): | |
| def __init__(self): | |
| self.client = boto3.client('textract') | |
| def recognize(self, image: np.ndarray) -> OCRResult: | |
| response = self.client.detect_document_text( | |
| Document={'Bytes': image_bytes} | |
| ) | |
| return self._convert_response(response) | |
| ``` | |
| ### Phase 3: LLM Migration | |
| ```python | |
| # Abstract LLM interface | |
| class LLMAdapter: | |
| def generate(self, prompt: str) -> str: ... | |
| # Local: Ollama | |
| class OllamaAdapter(LLMAdapter): ... | |
| # Cloud: Bedrock | |
| class BedrockAdapter(LLMAdapter): | |
| def __init__(self, model_id: str = "anthropic.claude-3-sonnet"): | |
| self.client = boto3.client('bedrock-runtime') | |
| self.model_id = model_id | |
| def generate(self, prompt: str) -> str: | |
| response = self.client.invoke_model( | |
| modelId=self.model_id, | |
| body=json.dumps({"prompt": prompt}) | |
| ) | |
| return response['body'] | |
| ``` | |
| ### Phase 4: Vector Store Migration | |
| ```python | |
| # Abstract vector store interface (already implemented) | |
| class VectorStore: | |
| def add_chunks(self, chunks, embeddings): ... | |
| def search(self, query_embedding, top_k): ... | |
| # Local: ChromaDB (already implemented) | |
| class ChromaVectorStore(VectorStore): ... | |
| # Cloud: OpenSearch | |
| class OpenSearchVectorStore(VectorStore): | |
| def __init__(self, endpoint: str, index: str): | |
| self.client = OpenSearch(hosts=[endpoint]) | |
| self.index = index | |
| def search(self, query_embedding, top_k): | |
| response = self.client.search( | |
| index=self.index, | |
| body={ | |
| "knn": { | |
| "embedding": { | |
| "vector": query_embedding, | |
| "k": top_k | |
| } | |
| } | |
| } | |
| ) | |
| return self._convert_results(response) | |
| ``` | |
| ## AWS Services Deep Dive | |
| ### Amazon S3 | |
| - **Purpose**: Document storage and processed results | |
| - **Structure**: | |
| ``` | |
| s3://sparknet-documents/ | |
| ├── raw/ # Original documents | |
| │ └── {doc_id}/ | |
| │ └── document.pdf | |
| ├── processed/ # Processed results | |
| │ └── {doc_id}/ | |
| │ ├── metadata.json | |
| │ ├── chunks.json | |
| │ └── pages/ | |
| │ ├── page_0.png | |
| │ └── page_1.png | |
| └── cache/ # Processing cache | |
| ``` | |
| ### Amazon Textract | |
| - **Purpose**: OCR extraction with layout analysis | |
| - **Features**: | |
| - Document text detection | |
| - Table extraction | |
| - Form extraction | |
| - Handwriting recognition | |
| ### Amazon Bedrock | |
| - **Purpose**: LLM inference | |
| - **Models**: | |
| - Claude 3.5 Sonnet (primary) | |
| - Titan Text (cost-effective) | |
| - Titan Embeddings (vectors) | |
| ### Amazon OpenSearch Serverless | |
| - **Purpose**: Vector search and retrieval | |
| - **Configuration**: | |
| ```json | |
| { | |
| "index": "sparknet-vectors", | |
| "settings": { | |
| "index.knn": true, | |
| "index.knn.space_type": "cosinesimil" | |
| }, | |
| "mappings": { | |
| "properties": { | |
| "embedding": { | |
| "type": "knn_vector", | |
| "dimension": 1024 | |
| } | |
| } | |
| } | |
| } | |
| ``` | |
| ### AWS Lambda | |
| - **Purpose**: Serverless compute | |
| - **Functions**: | |
| - `process-document`: Document processing pipeline | |
| - `extract-fields`: Field extraction | |
| - `rag-query`: RAG query handling | |
| - `index-document`: Vector indexing | |
| ### AWS Step Functions | |
| - **Purpose**: Workflow orchestration | |
| - **Workflow**: | |
| ```json | |
| { | |
| "StartAt": "ProcessDocument", | |
| "States": { | |
| "ProcessDocument": { | |
| "Type": "Task", | |
| "Resource": "arn:aws:lambda:process-document", | |
| "Next": "IndexChunks" | |
| }, | |
| "IndexChunks": { | |
| "Type": "Task", | |
| "Resource": "arn:aws:lambda:index-document", | |
| "End": true | |
| } | |
| } | |
| } | |
| ``` | |
| ## Cost Optimization | |
| ### Tiered Processing | |
| | Tier | Use Case | Services | Cost | | |
| |------|----------|----------|------| | |
| | Basic | Simple OCR | Textract + Titan | $ | | |
| | Standard | Full pipeline | + Claude Haiku | $$ | | |
| | Premium | Complex analysis | + Claude Sonnet | $$$ | | |
| ### Caching Strategy | |
| 1. **Document Cache**: S3 with lifecycle policies | |
| 2. **Embedding Cache**: ElastiCache (Redis) | |
| 3. **Query Cache**: Lambda@Edge | |
| ## Security | |
| ### IAM Policies | |
| ```json | |
| { | |
| "Version": "2012-10-17", | |
| "Statement": [ | |
| { | |
| "Effect": "Allow", | |
| "Action": [ | |
| "s3:GetObject", | |
| "s3:PutObject" | |
| ], | |
| "Resource": "arn:aws:s3:::sparknet-documents/*" | |
| }, | |
| { | |
| "Effect": "Allow", | |
| "Action": [ | |
| "textract:DetectDocumentText", | |
| "textract:AnalyzeDocument" | |
| ], | |
| "Resource": "*" | |
| }, | |
| { | |
| "Effect": "Allow", | |
| "Action": [ | |
| "bedrock:InvokeModel" | |
| ], | |
| "Resource": "arn:aws:bedrock:*::foundation-model/*" | |
| } | |
| ] | |
| } | |
| ``` | |
| ### Data Encryption | |
| - S3: Server-side encryption (SSE-S3 or SSE-KMS) | |
| - OpenSearch: Encryption at rest | |
| - Lambda: Environment variable encryption | |
| ## Deployment | |
| ### Infrastructure as Code (Terraform) | |
| ```hcl | |
| # S3 Bucket | |
| resource "aws_s3_bucket" "documents" { | |
| bucket = "sparknet-documents" | |
| } | |
| # Lambda Function | |
| resource "aws_lambda_function" "processor" { | |
| function_name = "sparknet-processor" | |
| runtime = "python3.11" | |
| handler = "handler.process" | |
| memory_size = 1024 | |
| timeout = 300 | |
| } | |
| # OpenSearch Serverless | |
| resource "aws_opensearchserverless_collection" "vectors" { | |
| name = "sparknet-vectors" | |
| type = "VECTORSEARCH" | |
| } | |
| ``` | |
| ### CI/CD Pipeline | |
| ```yaml | |
| # GitHub Actions | |
| name: Deploy SPARKNET | |
| on: | |
| push: | |
| branches: [main] | |
| jobs: | |
| deploy: | |
| runs-on: ubuntu-latest | |
| steps: | |
| - uses: actions/checkout@v3 | |
| - name: Deploy Lambda | |
| run: | | |
| aws lambda update-function-code \ | |
| --function-name sparknet-processor \ | |
| --zip-file fileb://package.zip | |
| ``` | |
| ## Monitoring | |
| ### CloudWatch Metrics | |
| - Lambda invocations and duration | |
| - S3 request counts | |
| - OpenSearch query latency | |
| - Bedrock token usage | |
| ### Dashboards | |
| - Processing throughput | |
| - Error rates | |
| - Cost tracking | |
| - Vector store statistics | |
| ## Next Steps | |
| 1. **Implement Storage Abstraction**: Create S3 adapter | |
| 2. **Add Textract Engine**: Implement AWS OCR | |
| 3. **Create Bedrock Adapter**: LLM migration | |
| 4. **Deploy OpenSearch**: Vector store setup | |
| 5. **Build Lambda Functions**: Serverless compute | |
| 6. **Setup Step Functions**: Workflow orchestration | |
| 7. **Configure CI/CD**: Automated deployment | |