--- title: OptiTransfer Data emoji: ◼️ sdk: static pinned: false colorFrom: gray colorTo: blue --- # OptiTransfer Data Premium web corpora for LLM pre-training, fine-tuning, RAG, and multilingual NLP. --- ## About OptiTransfer Data is the data division of [OptiTransfer AG](https://optitransfer.ch), a Swiss-registered technology company. We produce compliance-ready, quality-scored web datasets for AI teams building in regulated markets. Every dataset ships with: - Full data provenance and SHA256 verification - PII detection and redaction - Multi-dimensional quality scoring (0-100 per document) - EU AI Act and Swiss FADP compliance documentation - Croissant metadata for ML interoperability - Multiple export formats (Parquet, JSONL, language splits, RAG chunks) --- ## Available Datasets ### *.ch Swiss Web Premium (A+) 110,491 documents | 78 fields | 554 MB total | Quality score: 62.3 mean The flagship Swiss web corpus, extracted and quality-scored from the .ch ccTLD. Multilingual coverage across German (61.2%), French (19.0%), English (10.5%), Italian (4.7%), and additional languages. Nine-component quality model with full provenance chain. **Best suited for:** LLM Pre-Training, Supervised Fine-Tuning (SFT), Retrieval-Augmented Generation (RAG), Multilingual NLP, German Language Models, French Language Models, Swiss Market AI, Regulatory Compliance (EU AI Act), Domain-Specific Training, Web Corpus Research, Text Classification, Summarisation, Question Answering, Translation **Formats:** Parquet (7 shards) | JSONL (7 shards) | Language Splits (DE, FR, EN, IT) | RAG Chunks (4 files) | Repository | Description | Access | |---|---|---| | [swiss-web-premium-ch](https://huggingface.co/datasets/OptiTransferData/swiss-web-premium-ch) | 10,000-record stratified sample with full documentation and QA report | Gated (evaluation) | | [swiss-web-premium-ch-full](https://huggingface.co/datasets/OptiTransferData/swiss-web-premium-ch-full) | Complete 110,491-record production dataset | Gated (licensed) | --- ## Data Pipeline All datasets are processed through the OptiTransfer pipeline: 1. **Source Selection** -- Common Crawl filtered by ccTLD and domain trust scoring 2. **Extraction** -- Text extraction with deduplication, language detection, and structural analysis 3. **Quality Scoring** -- Nine-component quality model producing a composite 0-100 score per document 4. **Enrichment** -- Content categorisation, trust tier assignment, academic/news detection, skill tagging 5. **Compliance** -- PII scanning, redaction, and regulatory documentation 6. **Verification** -- SHA256 checksums, QA reporting, and independent audit readiness --- ## Quality Assurance Each dataset is accompanied by a full QA report covering: - Pipeline configuration and processing parameters - Quality score distributions and statistical analysis - Language detection accuracy and coverage - Content categorisation breakdown - PII detection results - Domain trust tier analysis - SHA256 integrity verification QA reports are available in both the sample and full product repositories. --- ## Licensing All datasets are available under the OptiTransfer Commercial License. Sample repositories provide gated evaluation access. Full datasets require a commercial license agreement. **Payment methods:** Bank Transfer (SEPA/SWIFT) | TWINT | Cryptocurrency (BTC / ETH / SOL) For pricing, volume licensing, or custom extraction requests, contact [data@optitransfer.ch](mailto:data@optitransfer.ch). --- ## Contact - **Email:** [data@optitransfer.ch](mailto:data@optitransfer.ch) - **Web:** [optitransfer.ch](https://optitransfer.ch) - **Location:** Switzerland