SPARKNET / docs /CLOUD_ARCHITECTURE.md
MHamdan's picture
Initial commit: SPARKNET framework
d520909
# SPARKNET Cloud Architecture
This document outlines the cloud-ready architecture for deploying SPARKNET on AWS.
## Overview
SPARKNET is designed with a modular architecture that supports both local development and cloud deployment. The system can scale from a single developer machine to enterprise-grade cloud infrastructure.
## Local Development Stack
```
┌─────────────────────────────────────────────────────┐
│ Local Machine │
├─────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Ollama │ │ ChromaDB │ │ File I/O │ │
│ │ (LLM) │ │ (Vector) │ │ (Storage) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └───────────────┼───────────────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ │ SPARKNET │ │
│ │ Application │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────┘
```
## AWS Cloud Architecture
### Target Architecture
```
┌────────────────────────────────────────────────────────────────────┐
│ AWS Cloud │
├────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ API GW │──────│ Lambda │──────│ Step Functions │ │
│ │ (REST) │ │ (Compute) │ │ (Orchestration) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ S3 │ │ Bedrock │ │ OpenSearch │ │
│ │ (Storage) │ │ (LLM) │ │ (Vector Store) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Textract │ │ Titan │ │ DynamoDB │ │
│ │ (OCR) │ │ (Embeddings)│ │ (Metadata) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────┘
```
### Component Mapping
| Local Component | AWS Service | Purpose |
|----------------|-------------|---------|
| File I/O | S3 | Document storage |
| PaddleOCR/Tesseract | Textract | OCR extraction |
| Ollama LLM | Bedrock (Claude/Titan) | Text generation |
| Ollama Embeddings | Titan Embeddings | Vector embeddings |
| ChromaDB | OpenSearch Serverless | Vector search |
| SQLite (optional) | DynamoDB | Metadata storage |
| Python Process | Lambda | Compute |
| CLI | API Gateway | HTTP interface |
## Migration Strategy
### Phase 1: Storage Migration
```python
# Abstract storage interface
class StorageAdapter:
def put(self, key: str, data: bytes) -> str: ...
def get(self, key: str) -> bytes: ...
def delete(self, key: str) -> bool: ...
# Local implementation
class LocalStorageAdapter(StorageAdapter):
def __init__(self, base_path: str):
self.base_path = Path(base_path)
# S3 implementation
class S3StorageAdapter(StorageAdapter):
def __init__(self, bucket: str):
self.client = boto3.client('s3')
self.bucket = bucket
```
### Phase 2: OCR Migration
```python
# Abstract OCR interface
class OCREngine:
def recognize(self, image: np.ndarray) -> OCRResult: ...
# Local: PaddleOCR
class PaddleOCREngine(OCREngine): ...
# Cloud: Textract
class TextractEngine(OCREngine):
def __init__(self):
self.client = boto3.client('textract')
def recognize(self, image: np.ndarray) -> OCRResult:
response = self.client.detect_document_text(
Document={'Bytes': image_bytes}
)
return self._convert_response(response)
```
### Phase 3: LLM Migration
```python
# Abstract LLM interface
class LLMAdapter:
def generate(self, prompt: str) -> str: ...
# Local: Ollama
class OllamaAdapter(LLMAdapter): ...
# Cloud: Bedrock
class BedrockAdapter(LLMAdapter):
def __init__(self, model_id: str = "anthropic.claude-3-sonnet"):
self.client = boto3.client('bedrock-runtime')
self.model_id = model_id
def generate(self, prompt: str) -> str:
response = self.client.invoke_model(
modelId=self.model_id,
body=json.dumps({"prompt": prompt})
)
return response['body']
```
### Phase 4: Vector Store Migration
```python
# Abstract vector store interface (already implemented)
class VectorStore:
def add_chunks(self, chunks, embeddings): ...
def search(self, query_embedding, top_k): ...
# Local: ChromaDB (already implemented)
class ChromaVectorStore(VectorStore): ...
# Cloud: OpenSearch
class OpenSearchVectorStore(VectorStore):
def __init__(self, endpoint: str, index: str):
self.client = OpenSearch(hosts=[endpoint])
self.index = index
def search(self, query_embedding, top_k):
response = self.client.search(
index=self.index,
body={
"knn": {
"embedding": {
"vector": query_embedding,
"k": top_k
}
}
}
)
return self._convert_results(response)
```
## AWS Services Deep Dive
### Amazon S3
- **Purpose**: Document storage and processed results
- **Structure**:
```
s3://sparknet-documents/
├── raw/ # Original documents
│ └── {doc_id}/
│ └── document.pdf
├── processed/ # Processed results
│ └── {doc_id}/
│ ├── metadata.json
│ ├── chunks.json
│ └── pages/
│ ├── page_0.png
│ └── page_1.png
└── cache/ # Processing cache
```
### Amazon Textract
- **Purpose**: OCR extraction with layout analysis
- **Features**:
- Document text detection
- Table extraction
- Form extraction
- Handwriting recognition
### Amazon Bedrock
- **Purpose**: LLM inference
- **Models**:
- Claude 3.5 Sonnet (primary)
- Titan Text (cost-effective)
- Titan Embeddings (vectors)
### Amazon OpenSearch Serverless
- **Purpose**: Vector search and retrieval
- **Configuration**:
```json
{
"index": "sparknet-vectors",
"settings": {
"index.knn": true,
"index.knn.space_type": "cosinesimil"
},
"mappings": {
"properties": {
"embedding": {
"type": "knn_vector",
"dimension": 1024
}
}
}
}
```
### AWS Lambda
- **Purpose**: Serverless compute
- **Functions**:
- `process-document`: Document processing pipeline
- `extract-fields`: Field extraction
- `rag-query`: RAG query handling
- `index-document`: Vector indexing
### AWS Step Functions
- **Purpose**: Workflow orchestration
- **Workflow**:
```json
{
"StartAt": "ProcessDocument",
"States": {
"ProcessDocument": {
"Type": "Task",
"Resource": "arn:aws:lambda:process-document",
"Next": "IndexChunks"
},
"IndexChunks": {
"Type": "Task",
"Resource": "arn:aws:lambda:index-document",
"End": true
}
}
}
```
## Cost Optimization
### Tiered Processing
| Tier | Use Case | Services | Cost |
|------|----------|----------|------|
| Basic | Simple OCR | Textract + Titan | $ |
| Standard | Full pipeline | + Claude Haiku | $$ |
| Premium | Complex analysis | + Claude Sonnet | $$$ |
### Caching Strategy
1. **Document Cache**: S3 with lifecycle policies
2. **Embedding Cache**: ElastiCache (Redis)
3. **Query Cache**: Lambda@Edge
## Security
### IAM Policies
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::sparknet-documents/*"
},
{
"Effect": "Allow",
"Action": [
"textract:DetectDocumentText",
"textract:AnalyzeDocument"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel"
],
"Resource": "arn:aws:bedrock:*::foundation-model/*"
}
]
}
```
### Data Encryption
- S3: Server-side encryption (SSE-S3 or SSE-KMS)
- OpenSearch: Encryption at rest
- Lambda: Environment variable encryption
## Deployment
### Infrastructure as Code (Terraform)
```hcl
# S3 Bucket
resource "aws_s3_bucket" "documents" {
bucket = "sparknet-documents"
}
# Lambda Function
resource "aws_lambda_function" "processor" {
function_name = "sparknet-processor"
runtime = "python3.11"
handler = "handler.process"
memory_size = 1024
timeout = 300
}
# OpenSearch Serverless
resource "aws_opensearchserverless_collection" "vectors" {
name = "sparknet-vectors"
type = "VECTORSEARCH"
}
```
### CI/CD Pipeline
```yaml
# GitHub Actions
name: Deploy SPARKNET
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Deploy Lambda
run: |
aws lambda update-function-code \
--function-name sparknet-processor \
--zip-file fileb://package.zip
```
## Monitoring
### CloudWatch Metrics
- Lambda invocations and duration
- S3 request counts
- OpenSearch query latency
- Bedrock token usage
### Dashboards
- Processing throughput
- Error rates
- Cost tracking
- Vector store statistics
## Next Steps
1. **Implement Storage Abstraction**: Create S3 adapter
2. **Add Textract Engine**: Implement AWS OCR
3. **Create Bedrock Adapter**: LLM migration
4. **Deploy OpenSearch**: Vector store setup
5. **Build Lambda Functions**: Serverless compute
6. **Setup Step Functions**: Workflow orchestration
7. **Configure CI/CD**: Automated deployment