ChineseFileTranslator
Translate Chinese text (Simplified, Traditional, Cantonese, Classical) inside .txt and .md files
to English. Preserves full Markdown syntax. Supports Google Translate, Microsoft Translator, and a
fully offline Helsinki-NLP MarianMT backend with vectorized batching.
Key Features
- 'Never Miss' Global Surgical Translation: Unique strategy to capture ALL Chinese while protecting structure.
- Inclusive CJK Detection: Comprehensive 32-bit Unicode coverage (Basic, Ext A-E, Symbols, Punctuation).
- Proactive Markdown Protection: Frontmatter, code blocks, links, and HTML are safely tokenized.
- Robust Placeholder Restoration: Space-lenient, case-insensitive restoration handles engine mangling.
- Unstoppable Backend Resilience: Explicit failure detection with automatic retries and non-crashing fallbacks.
- Offline First Option: Fully local Helsinki-NLP MarianMT backend with vectorized batching.
- Bilingual Mode: Optional side-by-side Chinese and English output.
- Batch Processing: Translate entire directories with recursive discovery and persistent configuration.
Project Structure
ChineseFileTranslator/
βββ chinese_file_translator.py # Main script (single-file, no extra modules)
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ .gitattributes # Git line-ending and LFS rules
βββ .gitignore # Ignored paths
βββ LICENSE # MIT License
Quickstart
1. Clone the repository
git clone https://github.com/algorembrant/ChineseFileTranslator.git
cd ChineseFileTranslator
2. Create and activate a virtual environment (recommended)
python -m venv venv
# Windows
venv\Scripts\activate
# Linux / macOS
source venv/bin/activate
3. Install core dependencies
pip install -r requirements.txt
4. (Optional) Install offline translation backend
Choose the correct PyTorch build for your system:
# CPU only
pip install torch --index-url https://download.pytorch.org/whl/cpu
# CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121
# Then install Transformers stack
pip install transformers sentencepiece sacremoses
The Helsinki-NLP/opus-mt-zh-en model (~300 MB) downloads automatically on first use.
Usage
Command Reference
| Command | Description |
|---|---|
python chinese_file_translator.py input.txt |
Translate a plain-text file (Google backend) |
python chinese_file_translator.py input.md |
Translate a Markdown file, preserve structure |
python chinese_file_translator.py input.txt -o out.txt |
Set explicit output path |
python chinese_file_translator.py input.txt --backend offline |
Use offline MarianMT model |
python chinese_file_translator.py input.txt --backend microsoft |
Use Microsoft Translator |
python chinese_file_translator.py input.txt --offline --gpu |
Offline + GPU (CUDA) |
python chinese_file_translator.py input.txt --lang simplified |
Force Simplified Chinese |
python chinese_file_translator.py input.txt --lang traditional |
Force Traditional Chinese |
python chinese_file_translator.py input.txt --bilingual |
Keep Chinese + show English |
python chinese_file_translator.py input.txt --extract-only |
Extract Chinese lines only |
python chinese_file_translator.py input.txt --stdout |
Print output to terminal |
python chinese_file_translator.py --batch ./docs/ |
Batch translate a directory |
python chinese_file_translator.py --batch ./in/ --batch-out ./out/ |
Batch with output dir |
python chinese_file_translator.py input.txt --chunk-size 2000 |
Custom chunk size |
python chinese_file_translator.py input.txt --export-history h.json |
Export history |
python chinese_file_translator.py input.txt --verbose |
Debug logging |
python chinese_file_translator.py --version |
Print version |
python chinese_file_translator.py --help |
Full help |
Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
input |
positional | β | Path to .txt or .md file |
-o / --output |
string | <name>_translated.<ext> |
Output file path |
--batch DIR |
string | β | Directory to batch translate |
--batch-out DIR |
string | same as --batch |
Output directory for batch |
--backend |
choice | google |
google, microsoft, offline |
--offline |
flag | false |
Shorthand for --backend offline |
--lang |
choice | auto |
auto, simplified, traditional |
--gpu |
flag | false |
Use CUDA for offline model |
--confidence |
float | 0.05 |
Min Chinese character ratio for detection |
--chunk-size |
int | 4000 |
Max chars per translation request |
--bilingual |
flag | false |
Output both Chinese and English |
--extract-only |
flag | false |
Save only the detected Chinese lines |
--stdout |
flag | false |
Print result to stdout |
--export-history |
string | β | Save session history to JSON |
--verbose |
flag | false |
Enable DEBUG logging |
--version |
flag | β | Show version and exit |
Configuration
The tool writes a JSON config file on first run:
~/.chinese_file_translator/config.json
Example config.json:
{
"backend": "google",
"lang": "auto",
"use_gpu": false,
"chunk_size": 4000,
"batch_size": 10,
"bilingual": false,
"microsoft_api_key": "YOUR_KEY_HERE",
"microsoft_region": "eastus",
"offline_model_dir": "~/.chinese_file_translator/models",
"output_suffix": "_translated",
"retry_attempts": 3,
"retry_delay_seconds": 1.5,
"max_history": 1000
}
Supported Chinese Variants
| Variant | Notes |
|---|---|
| Simplified Chinese | Mandarin, mainland China standard |
| Traditional Chinese | Taiwan, Hong Kong, Macau standard |
| Cantonese / Yue | Detected via CJK Unicode ranges |
| Classical Chinese | Treated as Traditional for translation |
| Mixed Chinese-English | Code-switching text handled transparently |
Translation Backends
| Backend | Requires | Speed | Quality | Internet |
|---|---|---|---|---|
| Google Translate | deep-translator |
Fast | High | Yes |
| Microsoft Translator | Azure API key + deep-translator |
Fast | High | Yes |
| Helsinki-NLP MarianMT | transformers, torch |
Medium | Good | No (after download) |
Google Translate is the default. If it fails, the tool falls back to the offline model automatically.
Technical Strategy: 'Never Miss' Logic
The tool employs a sophisticated "Global Surgical" approach to ensure no Chinese fragment is overlooked, regardless of its depth in JSON, HTML, or complex Markdown.
1. Surgical Block Extraction
Instead of line-by-line translation, the script identifies every continuous block of CJK characters (including ideographic symbols and punctuation) across the entire document. This ensures that contextually related characters are translated together for better accuracy.
2. Structural Protection
Markdown and metadata structures are tokenized using unique, collision-resistant placeholders (___MY_PROTECT_PH_{idx}___).
- YAML/TOML: Frontmatter is protected globally.
- Code Fences: Backticks and language identifiers are protected; Chinese content inside comments or strings remains translatable.
- Links & HTML: URLs and tag names are guarded, while display text is surgically translated.
3. Verification & Restoration
- Longest-First Replacement: Translated segments are restored starting from the longest strings to prevent partial match overwrites.
- Fuzzy Restoration: The restoration logic is space-lenient and case-insensitive to handle cases where online translation engines mangle the placeholder tokens.
Markdown Preservation
The following elements are meticulously protected:
| Element | Example | Protection Method |
|---|---|---|
| Front Matter | ---\ntitle: ...\n--- |
Full Tokenization |
| Fenced Code | ```python ... ``` |
Boundary Tokenization |
| Inline Code | `code` |
Full Tokenization |
| Links / Images | [text](url) |
URL Tokenization |
| HTML Tags | <div class="..."> |
Tag Tokenization |
| Symbols | ©, &#x...; |
Entity Tokenization |
Microsoft Translator Setup
- Go to Azure Cognitive Services
- Create a Translator resource (Free tier: 2M chars/month)
- Copy your API key and region
- Add them to
~/.chinese_file_translator/config.json:
{
"microsoft_api_key": "abc123...",
"microsoft_region": "eastus"
}
Then run:
python chinese_file_translator.py input.txt --backend microsoft
Files Generated
| Path | Description |
|---|---|
~/.chinese_file_translator/config.json |
Persistent settings |
~/.chinese_file_translator/history.json |
Session history log |
~/.chinese_file_translator/app.log |
Application log file |
~/.chinese_file_translator/models/ |
Offline model cache (if used) |
Author
algorembrant
License
MIT License. See LICENSE for details.