A newer version of the Streamlit SDK is available:
1.53.1
SPARKNET Cloud Architecture
This document outlines the cloud-ready architecture for deploying SPARKNET on AWS.
Overview
SPARKNET is designed with a modular architecture that supports both local development and cloud deployment. The system can scale from a single developer machine to enterprise-grade cloud infrastructure.
Local Development Stack
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Local Machine β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Ollama β β ChromaDB β β File I/O β β
β β (LLM) β β (Vector) β β (Storage) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β β
β βββββββββββββββββΌββββββββββββββββ β
β β β
β ββββββββββ΄βββββββββ β
β β SPARKNET β β
β β Application β β
β βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AWS Cloud Architecture
Target Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS Cloud β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
β β API GW ββββββββ Lambda ββββββββ Step Functions β β
β β (REST) β β (Compute) β β (Orchestration) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
β β β β β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
β β S3 β β Bedrock β β OpenSearch β β
β β (Storage) β β (LLM) β β (Vector Store) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
β β Textract β β Titan β β DynamoDB β β
β β (OCR) β β (Embeddings)β β (Metadata) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Component Mapping
| Local Component | AWS Service | Purpose |
|---|---|---|
| File I/O | S3 | Document storage |
| PaddleOCR/Tesseract | Textract | OCR extraction |
| Ollama LLM | Bedrock (Claude/Titan) | Text generation |
| Ollama Embeddings | Titan Embeddings | Vector embeddings |
| ChromaDB | OpenSearch Serverless | Vector search |
| SQLite (optional) | DynamoDB | Metadata storage |
| Python Process | Lambda | Compute |
| CLI | API Gateway | HTTP interface |
Migration Strategy
Phase 1: Storage Migration
# Abstract storage interface
class StorageAdapter:
def put(self, key: str, data: bytes) -> str: ...
def get(self, key: str) -> bytes: ...
def delete(self, key: str) -> bool: ...
# Local implementation
class LocalStorageAdapter(StorageAdapter):
def __init__(self, base_path: str):
self.base_path = Path(base_path)
# S3 implementation
class S3StorageAdapter(StorageAdapter):
def __init__(self, bucket: str):
self.client = boto3.client('s3')
self.bucket = bucket
Phase 2: OCR Migration
# Abstract OCR interface
class OCREngine:
def recognize(self, image: np.ndarray) -> OCRResult: ...
# Local: PaddleOCR
class PaddleOCREngine(OCREngine): ...
# Cloud: Textract
class TextractEngine(OCREngine):
def __init__(self):
self.client = boto3.client('textract')
def recognize(self, image: np.ndarray) -> OCRResult:
response = self.client.detect_document_text(
Document={'Bytes': image_bytes}
)
return self._convert_response(response)
Phase 3: LLM Migration
# Abstract LLM interface
class LLMAdapter:
def generate(self, prompt: str) -> str: ...
# Local: Ollama
class OllamaAdapter(LLMAdapter): ...
# Cloud: Bedrock
class BedrockAdapter(LLMAdapter):
def __init__(self, model_id: str = "anthropic.claude-3-sonnet"):
self.client = boto3.client('bedrock-runtime')
self.model_id = model_id
def generate(self, prompt: str) -> str:
response = self.client.invoke_model(
modelId=self.model_id,
body=json.dumps({"prompt": prompt})
)
return response['body']
Phase 4: Vector Store Migration
# Abstract vector store interface (already implemented)
class VectorStore:
def add_chunks(self, chunks, embeddings): ...
def search(self, query_embedding, top_k): ...
# Local: ChromaDB (already implemented)
class ChromaVectorStore(VectorStore): ...
# Cloud: OpenSearch
class OpenSearchVectorStore(VectorStore):
def __init__(self, endpoint: str, index: str):
self.client = OpenSearch(hosts=[endpoint])
self.index = index
def search(self, query_embedding, top_k):
response = self.client.search(
index=self.index,
body={
"knn": {
"embedding": {
"vector": query_embedding,
"k": top_k
}
}
}
)
return self._convert_results(response)
AWS Services Deep Dive
Amazon S3
- Purpose: Document storage and processed results
- Structure:
s3://sparknet-documents/ βββ raw/ # Original documents β βββ {doc_id}/ β βββ document.pdf βββ processed/ # Processed results β βββ {doc_id}/ β βββ metadata.json β βββ chunks.json β βββ pages/ β βββ page_0.png β βββ page_1.png βββ cache/ # Processing cache
Amazon Textract
- Purpose: OCR extraction with layout analysis
- Features:
- Document text detection
- Table extraction
- Form extraction
- Handwriting recognition
Amazon Bedrock
- Purpose: LLM inference
- Models:
- Claude 3.5 Sonnet (primary)
- Titan Text (cost-effective)
- Titan Embeddings (vectors)
Amazon OpenSearch Serverless
- Purpose: Vector search and retrieval
- Configuration:
{ "index": "sparknet-vectors", "settings": { "index.knn": true, "index.knn.space_type": "cosinesimil" }, "mappings": { "properties": { "embedding": { "type": "knn_vector", "dimension": 1024 } } } }
AWS Lambda
- Purpose: Serverless compute
- Functions:
process-document: Document processing pipelineextract-fields: Field extractionrag-query: RAG query handlingindex-document: Vector indexing
AWS Step Functions
- Purpose: Workflow orchestration
- Workflow:
{ "StartAt": "ProcessDocument", "States": { "ProcessDocument": { "Type": "Task", "Resource": "arn:aws:lambda:process-document", "Next": "IndexChunks" }, "IndexChunks": { "Type": "Task", "Resource": "arn:aws:lambda:index-document", "End": true } } }
Cost Optimization
Tiered Processing
| Tier | Use Case | Services | Cost |
|---|---|---|---|
| Basic | Simple OCR | Textract + Titan | $ |
| Standard | Full pipeline | + Claude Haiku | $$ |
| Premium | Complex analysis | + Claude Sonnet | $$$ |
Caching Strategy
- Document Cache: S3 with lifecycle policies
- Embedding Cache: ElastiCache (Redis)
- Query Cache: Lambda@Edge
Security
IAM Policies
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::sparknet-documents/*"
},
{
"Effect": "Allow",
"Action": [
"textract:DetectDocumentText",
"textract:AnalyzeDocument"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel"
],
"Resource": "arn:aws:bedrock:*::foundation-model/*"
}
]
}
Data Encryption
- S3: Server-side encryption (SSE-S3 or SSE-KMS)
- OpenSearch: Encryption at rest
- Lambda: Environment variable encryption
Deployment
Infrastructure as Code (Terraform)
# S3 Bucket
resource "aws_s3_bucket" "documents" {
bucket = "sparknet-documents"
}
# Lambda Function
resource "aws_lambda_function" "processor" {
function_name = "sparknet-processor"
runtime = "python3.11"
handler = "handler.process"
memory_size = 1024
timeout = 300
}
# OpenSearch Serverless
resource "aws_opensearchserverless_collection" "vectors" {
name = "sparknet-vectors"
type = "VECTORSEARCH"
}
CI/CD Pipeline
# GitHub Actions
name: Deploy SPARKNET
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Deploy Lambda
run: |
aws lambda update-function-code \
--function-name sparknet-processor \
--zip-file fileb://package.zip
Monitoring
CloudWatch Metrics
- Lambda invocations and duration
- S3 request counts
- OpenSearch query latency
- Bedrock token usage
Dashboards
- Processing throughput
- Error rates
- Cost tracking
- Vector store statistics
Next Steps
- Implement Storage Abstraction: Create S3 adapter
- Add Textract Engine: Implement AWS OCR
- Create Bedrock Adapter: LLM migration
- Deploy OpenSearch: Vector store setup
- Build Lambda Functions: Serverless compute
- Setup Step Functions: Workflow orchestration
- Configure CI/CD: Automated deployment