| --- |
| license: mit |
| --- |
| |
| # Entropy Coding with Equiprobable Partitioning |
|
|
| Implementation and comparison of the entropy coding algorithm using equiprobable partitioning from Han et al. (2008), compared against Huffman coding and theoretical limits. |
|
|
| ## Overview |
|
|
| This project implements two compression algorithms: |
|
|
| 1. **Equiprobable Partitioning (EP)** - The main algorithm from the paper |
| 2. **Huffman Coding** - Classical entropy coding for comparison |
|
|
| ## Algorithm Description |
|
|
| ### Enumerative Entropy Coding |
|
|
| The algorithm from Han et al. (2008) is actually an **enumerative entropy coding** method that works in three steps: |
|
|
| 1. **Encode alphabet size M** using exp-Golomb codes |
| 2. **Encode symbol counts** N(sβ), N(sβ), ..., N(s_{M-1}) using exp-Golomb codes (last count is implied) |
| 3. **Encode sequence position** among all permutations with the same symbol counts using combinatorial enumeration |
| |
| #### How It Works |
| |
| - **Step 1**: Use exp-Golomb to encode how many distinct symbols appear |
| - **Step 2**: Use exp-Golomb to encode how many times each symbol appears |
| - **Step 3**: Use lexicographic indexing to identify which specific permutation this sequence represents among all sequences with the same symbol histogram |
| |
| This is fundamentally different from simple partitioning - it's a form of **combinatorial compression** that leverages the mathematical structure of permutations. |
| |
| ### Performance Optimizations |
| |
| Key optimizations enable practical performance for datasets up to ~10,000 symbols: |
| |
| 1. **Cached Binomial Coefficients**: Uses `math.comb()` with caching to avoid recomputation |
| 2. **Binary Search**: O(log n) position reconstruction instead of linear search |
| 3. **Complement Encoding**: For frequent symbols (>50%), encode positions of other symbols instead |
| 4. **Arbitrary Precision**: Avoids integer overflow for large combinatorial values |
| |
| These optimizations achieve polynomial time complexity, making the algorithm practical for research and educational use. |
| |
| ### Why Enumerative Coding? |
| |
| The algorithm aims to achieve compression by: |
| - Separating structure (symbol counts) from content (permutation) |
| - Using optimal exp-Golomb codes for integer encoding |
| - Leveraging combinatorial mathematics for exact permutation indexing |
| - Achieving theoretical compression bounds for certain distributions |
| |
| ## Installation |
| |
| ```bash |
| # Clone or navigate to the repository |
| cd entropy-coding-equiprobable |
| |
| # Install dependencies |
| just setup |
| # or manually: uv sync |
| ``` |
| |
| ## Usage |
| |
| ### Quick Start |
| |
| ```bash |
| # Run all available commands |
| just |
| |
| # Run quick tests |
| just test |
| |
| # Run paper examples |
| just test-paper |
| |
| # Run full compression benchmark |
| just run |
| |
| # Run benchmark and generate plots (recommended) |
| just analyze |
| |
| # Generate plots and analysis |
| just plot |
| ``` |
| |
| ### Visualization |
| |
| The plotting functionality generates comprehensive analysis: |
| |
| 1. **Compression Comparison**: Side-by-side comparison of Huffman vs Enumerative methods |
| 2. **Compression Time Analysis**: Performance timing comparison between algorithms |
| 3. **Distribution Analysis**: Performance across uniform, Zipf, and geometric data |
| 4. **Efficiency Analysis**: How close each method gets to theoretical limits |
| 5. **Enumerative Timeout Analysis**: Computational complexity limitations and scaling behavior |
| |
| Plots are saved to the `plots/` directory as high-resolution PNG files. |
| |
| ## Results |
| |
| ### Compression Performance Comparison |
|  |
| |
| Comparison of compression ratios, bits per symbol, and efficiency between Huffman and Enumerative coding across different datasets. |
| |
| ### Compression Time Analysis |
|  |
| |
| Performance timing analysis showing encoding times, speed ratios, and scalability characteristics. Huffman coding is consistently 100-1000x faster. |
| |
| ### Distribution Analysis |
|  |
| |
| Performance breakdown by data distribution type (Uniform, Zipf, Geometric, English Text) showing compression ratios and efficiency metrics. |
| |
| ### Computational Complexity Analysis |
|  |
| |
| Enumerative encoding performance showing computation times, timeout patterns, and scaling limitations by dataset size and vocabulary. |
| |
| ### Command Reference |
| |
| - `just` - List available commands |
| - `just setup` - Install dependencies |
| - `just test` - Quick test with small datasets + paper examples |
| - `just test-paper` - Test examples from the paper |
| - `just run` - Full compression benchmark |
| - `just analyze` - Run full benchmark and generate plots |
| - `just plot` - Generate comparison plots |
| - `just clean` - Remove generated files |
| - `just check` - Run code quality checks |
| - `just format` - Format code |
| |
| ## Test Datasets |
| |
| The benchmark includes: |
| |
| ### I.I.D. Datasets |
| - **Small** (1K symbols): Quick testing |
| - **Medium** (10K symbols): Moderate datasets |
| - **Large** (100K symbols): Performance at scale |
| |
| ### Distributions |
| - **Uniform**: All symbols equally likely |
| - **Zipf**: Power-law distribution (realistic for text) |
| - **Geometric**: Exponentially decreasing probabilities |
| |
| ### Vocabulary Sizes |
| - **10 symbols**: Small alphabet |
| - **64 symbols**: Medium alphabet |
| - **256 symbols**: Full byte range |
| |
| ### Real Data |
| - **English text**: Downloaded from WikiText-2 via Hugging Face |
| |
| ## Results Analysis |
| |
| ### Performance Patterns |
| |
| 1. **Uniform Distributions**: EP performs poorly because there's no probability imbalance to exploit |
| 2. **Skewed Distributions**: EP performs better but still trails Huffman |
| 3. **Large Vocabularies**: EP overhead becomes significant with many symbols |
| |
| ### Computational Complexity |
| |
| The optimized enumerative entropy coding implementation achieves **polynomial time complexity** through careful algorithmic design: |
| |
| #### Time Complexity Analysis |
| - **Encoding**: O(M Γ n) where M = alphabet size, n = sequence length |
| - Symbol position finding: O(n) per symbol |
| - Combinatorial indexing: O(k) per symbol with memoization |
| - **Decoding**: O(M Γ k Γ log n) where k = average symbol count |
| - Binary search for position reconstruction: O(log n) per position |
| - Memoized binomial lookups: O(1) amortized |
| |
| #### Space Complexity |
| - **Memory Usage**: O(unique_binomial_lookups) for coefficient cache |
| - **Typical Cache Size**: < 1000 entries for most realistic datasets |
| - **No Upfront Cost**: Zero initialization time, grows only as needed |
| |
| #### Performance Characteristics |
| - **Small Datasets** (< 5000 symbols): 0.045s - 1.7s encoding time |
| - **Medium Datasets** (5000-10000 symbols): 0.3s - 15s encoding time |
| - **Large Datasets** (> 100000 symbols): May timeout (> 30s) |
| - **Performance vs Huffman**: ~259x slower on average |
| |
| #### Timeout Mechanism |
| - **Timeout Duration**: 30 seconds by default for enumerative coding |
| - **Graceful Handling**: Timeouts are logged and marked as "TIMEOUT" in results |
| - **When Timeouts Occur**: Very large sequences (> 100k symbols) with high vocabulary diversity |
| |
| The optimizations successfully transform the algorithm from exponential (naive multinomial) to polynomial complexity, making it practical for realistic data sizes. |
| |
| ### Performance Results |
| |
| From the benchmark results comparing Huffman vs Enumerative coding: |
| |
| | Dataset Type | Huffman Efficiency | Enumerative Efficiency | Speed Ratio | |
| |--------------|-------------------|------------------------|-------------| |
| | Uniform data | ~99.8% of theoretical | ~48.9% of theoretical | 259x slower | |
| | Zipf data | ~99.0-99.4% of theoretical | ~47.7-49.9% of theoretical | 100-1000x slower | |
| | Geometric data | ~98.9-99.3% of theoretical | ~49.6-49.9% of theoretical | 400-2000x slower | |
| | English text | ~99.1% of theoretical | ~48.1% of theoretical | 23x slower | |
| |
| ### Why Enumerative Underperforms |
| |
| 1. **Computational Complexity**: Combinatorial calculations become expensive for large datasets |
| 2. **Fixed Algorithm Structure**: Cannot adapt to data characteristics like Huffman's variable-length codes |
| 3. **Overhead**: Algorithm encodes structure information (alphabet, counts, positions) separately |
| 4. **Scaling Issues**: Performance degrades exponentially with dataset size and vocabulary complexity |
| |
| ## File Structure |
| |
| ``` |
| entropy-coding-equiprobable/ |
| βββ enumerative_coding.py # Core enumerative entropy coding implementation |
| βββ entropy_coding.py # Legacy compatibility and Huffman implementation |
| βββ test_compression.py # Main benchmark script with timing analysis |
| βββ test_paper_examples.py # Paper example verification |
| βββ test_enumerative.py # Basic functionality tests |
| βββ plot_results.py # Comprehensive visualization and analysis |
| βββ quick_test.py # Quick functionality test |
| βββ justfile # Command runner |
| βββ pyproject.toml # Python dependencies |
| βββ CLAUDE.md # Project-specific AI instructions |
| βββ README.md # This file |
| ``` |
| |
| ## Implementation Details |
| |
| ### Enumerative Entropy Coding |
| |
| The implementation follows the Han et al. (2008) algorithm with four main steps: |
| |
| ```python |
| def encode(self, data: List[int]) -> bytes: |
| # Step 1: Encode sequence length |
| bits += ExpGolombCoder.encode(n) |
| |
| # Step 2: Encode alphabet (size K and symbols) |
| bits += ExpGolombCoder.encode(K) |
| for symbol in sorted_symbols: |
| bits += ExpGolombCoder.encode(symbol) |
| |
| # Step 3: Encode symbol frequencies (K-1, last is implied) |
| for i in range(K - 1): |
| bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]]) |
| |
| # Step 4: Encode symbol positions using combinatorial indexing |
| for symbol in sorted_symbols[:-1]: |
| positions = find_symbol_positions(symbol, remaining_data) |
| rank = self._rank(len(remaining_data), len(positions), positions) |
| bits += ExpGolombCoder.encode(rank) |
| ``` |
| |
| ### Key Optimizations |
|
|
| ```python |
| # Complement encoding for frequent symbols |
| use_complement = k > current_n / 2 |
| if use_complement: |
| # Encode positions of OTHER symbols instead |
| complement_positions = find_complement_positions() |
| rank = self._rank(current_n, current_n - k, complement_positions) |
| |
| # Fast binomial coefficient computation with caching |
| class OptimizedBinomialTable: |
| def get(self, n: int, k: int) -> int: |
| if (n, k) in self._cache: |
| return self._cache[(n, k)] |
| result = math.comb(n, k) # Uses arbitrary precision |
| self._cache[(n, k)] = result |
| return result |
| ``` |
|
|
| ## Theoretical Analysis |
|
|
| ### Compression Bounds |
|
|
| - **Shannon Entropy**: H(X) = -Ξ£ p(x) log2 p(x) - theoretical minimum |
| - **Huffman**: Achieves H(X) β€ L_Huffman < H(X) + 1 (typically ~99% efficiency) |
| - **Enumerative**: L_Enum β₯ H(X) + overhead (typically ~49% efficiency due to structural encoding) |
|
|
| ### When Enumerative Coding Works |
|
|
| 1. **Research/theoretical applications**: When exact mathematical properties are needed |
| 2. **Educational purposes**: Understanding combinatorial compression principles |
| 3. **Small datasets**: Where computational cost is not a concern |
|
|
| ### When Enumerative Struggles |
|
|
| 1. **All practical applications**: 259x slower than Huffman with worse compression |
| 2. **Large datasets**: Exponential scaling makes it computationally prohibitive |
| 3. **Real-time systems**: Unpredictable and potentially very long encoding times |
|
|
| ## Future Optimization Opportunities |
|
|
| While the current implementation achieves practical performance for datasets up to ~10,000 symbols, several optimization strategies could further improve performance: |
|
|
| ### 1. Just-In-Time (JIT) Compilation |
| - **Target**: Critical loops in combinatorial indexing and position reconstruction |
| - **Options**: |
| - **Numba** (requires Python 3.11 due to llvmlite compatibility issues) |
| - **JAX** (better Python 3.12 support, NumPy-compatible) |
| - **PyPy** (alternative Python interpreter with JIT) |
| - **Expected Benefit**: 10-100x speedup for computational bottlenecks |
|
|
| ### 2. Algorithmic Improvements |
| - **Incremental Encoding**: Reuse computations when processing similar sequences |
| - **Approximate Methods**: Trade slight accuracy for major performance gains on very large datasets |
| - **Parallel Processing**: Distribute symbol processing across multiple cores |
|
|
| ### 3. Specialized Data Structures |
| - **Sparse Binomial Tables**: Only compute coefficients actually needed |
| - **Compressed Position Indices**: More efficient representation for position lists |
| - **Fast Integer Arithmetic**: Specialized libraries for large integer operations |
|
|
| ### 4. Memory Hierarchy Optimizations |
| - **Cache-Friendly Algorithms**: Reorganize computations to minimize cache misses |
| - **Memory Pooling**: Reduce allocation overhead for temporary arrays |
| - **Streaming Encoding**: Process very large datasets without loading entirely into memory |
|
|
| ### 5. Domain-Specific Optimizations |
| - **Text-Specific**: Leverage byte patterns and common character frequencies |
| - **Statistical Precomputation**: Pre-build tables for common distributions (Zipf, geometric) |
| - **Adaptive Thresholds**: Dynamically adjust complement encoding and timeout parameters |
|
|
| The current implementation provides a solid foundation for exploring these advanced optimizations while maintaining correctness and robustness. |
|
|
| ## References |
|
|
| - Han, Y., et al. (2008). "Entropy coding using equiprobable partitioning" |
| - Cover, T. M., & Thomas, J. A. (2006). "Elements of Information Theory" |
| - Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes" |
|
|
| ## Contributing |
|
|
| This is a research implementation. To contribute: |
|
|
| 1. Fork the repository |
| 2. Make changes following the existing code style |
| 3. Run `just check` to verify code quality |
| 4. Submit a pull request |
|
|
| ## License |
|
|
| This project is for educational and research purposes. |