Spaces:

MHamdan
/

SPARKNET

Sleeping

File size: 12,556 Bytes

d520909

# SPARKNET Cloud Architecture

This document outlines the cloud-ready architecture for deploying SPARKNET on AWS.

## Overview

SPARKNET is designed with a modular architecture that supports both local development and cloud deployment. The system can scale from a single developer machine to enterprise-grade cloud infrastructure.

## Local Development Stack

```
┌─────────────────────────────────────────────────────┐
│                    Local Machine                     │
├─────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │   Ollama    │  │  ChromaDB   │  │  File I/O   │ │
│  │   (LLM)     │  │  (Vector)   │  │  (Storage)  │ │
│  └─────────────┘  └─────────────┘  └─────────────┘ │
│          │               │               │          │
│          └───────────────┼───────────────┘          │
│                          │                          │
│                 ┌────────┴────────┐                 │
│                 │    SPARKNET     │                 │
│                 │   Application   │                 │
│                 └─────────────────┘                 │
└─────────────────────────────────────────────────────┘
```

## AWS Cloud Architecture

### Target Architecture

```
┌────────────────────────────────────────────────────────────────────┐
│                           AWS Cloud                                 │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────┐      ┌─────────────┐      ┌─────────────────────┐ │
│  │   API GW    │──────│   Lambda    │──────│    Step Functions   │ │
│  │  (REST)     │      │  (Compute)  │      │   (Orchestration)   │ │
│  └─────────────┘      └─────────────┘      └─────────────────────┘ │
│         │                    │                       │              │
│         │                    │                       │              │
│         ▼                    ▼                       ▼              │
│  ┌─────────────┐      ┌─────────────┐      ┌─────────────────────┐ │
│  │     S3      │      │   Bedrock   │      │   OpenSearch        │ │
│  │  (Storage)  │      │   (LLM)     │      │   (Vector Store)    │ │
│  └─────────────┘      └─────────────┘      └─────────────────────┘ │
│                                                                     │
│  ┌─────────────┐      ┌─────────────┐      ┌─────────────────────┐ │
│  │  Textract   │      │   Titan     │      │     DynamoDB        │ │
│  │   (OCR)     │      │ (Embeddings)│      │    (Metadata)       │ │
│  └─────────────┘      └─────────────┘      └─────────────────────┘ │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘
```

### Component Mapping

| Local Component | AWS Service | Purpose |
|----------------|-------------|---------|
| File I/O | S3 | Document storage |
| PaddleOCR/Tesseract | Textract | OCR extraction |
| Ollama LLM | Bedrock (Claude/Titan) | Text generation |
| Ollama Embeddings | Titan Embeddings | Vector embeddings |
| ChromaDB | OpenSearch Serverless | Vector search |
| SQLite (optional) | DynamoDB | Metadata storage |
| Python Process | Lambda | Compute |
| CLI | API Gateway | HTTP interface |

## Migration Strategy

### Phase 1: Storage Migration

```python
# Abstract storage interface
class StorageAdapter:
    def put(self, key: str, data: bytes) -> str: ...
    def get(self, key: str) -> bytes: ...
    def delete(self, key: str) -> bool: ...

# Local implementation
class LocalStorageAdapter(StorageAdapter):
    def __init__(self, base_path: str):
        self.base_path = Path(base_path)

# S3 implementation
class S3StorageAdapter(StorageAdapter):
    def __init__(self, bucket: str):
        self.client = boto3.client('s3')
        self.bucket = bucket
```

### Phase 2: OCR Migration

```python
# Abstract OCR interface
class OCREngine:
    def recognize(self, image: np.ndarray) -> OCRResult: ...

# Local: PaddleOCR
class PaddleOCREngine(OCREngine): ...

# Cloud: Textract
class TextractEngine(OCREngine):
    def __init__(self):
        self.client = boto3.client('textract')

    def recognize(self, image: np.ndarray) -> OCRResult:
        response = self.client.detect_document_text(
            Document={'Bytes': image_bytes}
        )
        return self._convert_response(response)
```

### Phase 3: LLM Migration

```python
# Abstract LLM interface
class LLMAdapter:
    def generate(self, prompt: str) -> str: ...

# Local: Ollama
class OllamaAdapter(LLMAdapter): ...

# Cloud: Bedrock
class BedrockAdapter(LLMAdapter):
    def __init__(self, model_id: str = "anthropic.claude-3-sonnet"):
        self.client = boto3.client('bedrock-runtime')
        self.model_id = model_id

    def generate(self, prompt: str) -> str:
        response = self.client.invoke_model(
            modelId=self.model_id,
            body=json.dumps({"prompt": prompt})
        )
        return response['body']
```

### Phase 4: Vector Store Migration

```python
# Abstract vector store interface (already implemented)
class VectorStore:
    def add_chunks(self, chunks, embeddings): ...
    def search(self, query_embedding, top_k): ...

# Local: ChromaDB (already implemented)
class ChromaVectorStore(VectorStore): ...

# Cloud: OpenSearch
class OpenSearchVectorStore(VectorStore):
    def __init__(self, endpoint: str, index: str):
        self.client = OpenSearch(hosts=[endpoint])
        self.index = index

    def search(self, query_embedding, top_k):
        response = self.client.search(
            index=self.index,
            body={
                "knn": {
                    "embedding": {
                        "vector": query_embedding,
                        "k": top_k
                    }
                }
            }
        )
        return self._convert_results(response)
```

## AWS Services Deep Dive

### Amazon S3

- **Purpose**: Document storage and processed results
- **Structure**:
  ```
  s3://sparknet-documents/
  ├── raw/                    # Original documents
  │   └── {doc_id}/
  │       └── document.pdf
  ├── processed/              # Processed results
  │   └── {doc_id}/
  │       ├── metadata.json
  │       ├── chunks.json
  │       └── pages/
  │           ├── page_0.png
  │           └── page_1.png
  └── cache/                  # Processing cache
  ```

### Amazon Textract

- **Purpose**: OCR extraction with layout analysis
- **Features**:
  - Document text detection
  - Table extraction
  - Form extraction
  - Handwriting recognition

### Amazon Bedrock

- **Purpose**: LLM inference
- **Models**:
  - Claude 3.5 Sonnet (primary)
  - Titan Text (cost-effective)
  - Titan Embeddings (vectors)

### Amazon OpenSearch Serverless

- **Purpose**: Vector search and retrieval
- **Configuration**:
  ```json
  {
    "index": "sparknet-vectors",
    "settings": {
      "index.knn": true,
      "index.knn.space_type": "cosinesimil"
    },
    "mappings": {
      "properties": {
        "embedding": {
          "type": "knn_vector",
          "dimension": 1024
        }
      }
    }
  }
  ```

### AWS Lambda

- **Purpose**: Serverless compute
- **Functions**:
  - `process-document`: Document processing pipeline
  - `extract-fields`: Field extraction
  - `rag-query`: RAG query handling
  - `index-document`: Vector indexing

### AWS Step Functions

- **Purpose**: Workflow orchestration
- **Workflow**:
  ```json
  {
    "StartAt": "ProcessDocument",
    "States": {
      "ProcessDocument": {
        "Type": "Task",
        "Resource": "arn:aws:lambda:process-document",
        "Next": "IndexChunks"
      },
      "IndexChunks": {
        "Type": "Task",
        "Resource": "arn:aws:lambda:index-document",
        "End": true
      }
    }
  }
  ```

## Cost Optimization

### Tiered Processing

| Tier | Use Case | Services | Cost |
|------|----------|----------|------|
| Basic | Simple OCR | Textract + Titan | $ |
| Standard | Full pipeline | + Claude Haiku | $$ |
| Premium | Complex analysis | + Claude Sonnet | $$$ |

### Caching Strategy

1. **Document Cache**: S3 with lifecycle policies
2. **Embedding Cache**: ElastiCache (Redis)
3. **Query Cache**: Lambda@Edge

## Security

### IAM Policies

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::sparknet-documents/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "textract:DetectDocumentText",
        "textract:AnalyzeDocument"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Resource": "arn:aws:bedrock:*::foundation-model/*"
    }
  ]
}
```

### Data Encryption

- S3: Server-side encryption (SSE-S3 or SSE-KMS)
- OpenSearch: Encryption at rest
- Lambda: Environment variable encryption

## Deployment

### Infrastructure as Code (Terraform)

```hcl
# S3 Bucket
resource "aws_s3_bucket" "documents" {
  bucket = "sparknet-documents"
}

# Lambda Function
resource "aws_lambda_function" "processor" {
  function_name = "sparknet-processor"
  runtime       = "python3.11"
  handler       = "handler.process"
  memory_size   = 1024
  timeout       = 300
}

# OpenSearch Serverless
resource "aws_opensearchserverless_collection" "vectors" {
  name = "sparknet-vectors"
  type = "VECTORSEARCH"
}
```

### CI/CD Pipeline

```yaml
# GitHub Actions
name: Deploy SPARKNET

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Deploy Lambda
        run: |
          aws lambda update-function-code \
            --function-name sparknet-processor \
            --zip-file fileb://package.zip
```

## Monitoring

### CloudWatch Metrics

- Lambda invocations and duration
- S3 request counts
- OpenSearch query latency
- Bedrock token usage

### Dashboards

- Processing throughput
- Error rates
- Cost tracking
- Vector store statistics

## Next Steps

1. **Implement Storage Abstraction**: Create S3 adapter
2. **Add Textract Engine**: Implement AWS OCR
3. **Create Bedrock Adapter**: LLM migration
4. **Deploy OpenSearch**: Vector store setup
5. **Build Lambda Functions**: Serverless compute
6. **Setup Step Functions**: Workflow orchestration
7. **Configure CI/CD**: Automated deployment