README / README.md
Optitransfer's picture
Fix Space config: add valid emoji field
3d4df71 verified
metadata
title: OptiTransfer Data
emoji: ◼️
sdk: static
pinned: false
colorFrom: gray
colorTo: blue

OptiTransfer Data

Premium web corpora for LLM pre-training, fine-tuning, RAG, and multilingual NLP.


About

OptiTransfer Data is the data division of OptiTransfer AG, a Swiss-registered technology company. We produce compliance-ready, quality-scored web datasets for AI teams building in regulated markets.

Every dataset ships with:

  • Full data provenance and SHA256 verification
  • PII detection and redaction
  • Multi-dimensional quality scoring (0-100 per document)
  • EU AI Act and Swiss FADP compliance documentation
  • Croissant metadata for ML interoperability
  • Multiple export formats (Parquet, JSONL, language splits, RAG chunks)

Available Datasets

*.ch Swiss Web Premium (A+)

110,491 documents | 78 fields | 554 MB total | Quality score: 62.3 mean

The flagship Swiss web corpus, extracted and quality-scored from the .ch ccTLD. Multilingual coverage across German (61.2%), French (19.0%), English (10.5%), Italian (4.7%), and additional languages. Nine-component quality model with full provenance chain.

Best suited for: LLM Pre-Training, Supervised Fine-Tuning (SFT), Retrieval-Augmented Generation (RAG), Multilingual NLP, German Language Models, French Language Models, Swiss Market AI, Regulatory Compliance (EU AI Act), Domain-Specific Training, Web Corpus Research, Text Classification, Summarisation, Question Answering, Translation

Formats: Parquet (7 shards) | JSONL (7 shards) | Language Splits (DE, FR, EN, IT) | RAG Chunks (4 files)

Repository Description Access
swiss-web-premium-ch 10,000-record stratified sample with full documentation and QA report Gated (evaluation)
swiss-web-premium-ch-full Complete 110,491-record production dataset Gated (licensed)

Data Pipeline

All datasets are processed through the OptiTransfer pipeline:

  1. Source Selection -- Common Crawl filtered by ccTLD and domain trust scoring
  2. Extraction -- Text extraction with deduplication, language detection, and structural analysis
  3. Quality Scoring -- Nine-component quality model producing a composite 0-100 score per document
  4. Enrichment -- Content categorisation, trust tier assignment, academic/news detection, skill tagging
  5. Compliance -- PII scanning, redaction, and regulatory documentation
  6. Verification -- SHA256 checksums, QA reporting, and independent audit readiness

Quality Assurance

Each dataset is accompanied by a full QA report covering:

  • Pipeline configuration and processing parameters
  • Quality score distributions and statistical analysis
  • Language detection accuracy and coverage
  • Content categorisation breakdown
  • PII detection results
  • Domain trust tier analysis
  • SHA256 integrity verification

QA reports are available in both the sample and full product repositories.


Licensing

All datasets are available under the OptiTransfer Commercial License. Sample repositories provide gated evaluation access. Full datasets require a commercial license agreement.

Payment methods: Bank Transfer (SEPA/SWIFT) | TWINT | Cryptocurrency (BTC / ETH / SOL)

For pricing, volume licensing, or custom extraction requests, contact data@optitransfer.ch.


Contact