Initial commit

24c19d8 unverified 9 months ago

13.8 kB

	---
	license: mit
	---

	# Entropy Coding with Equiprobable Partitioning

	Implementation and comparison of the entropy coding algorithm using equiprobable partitioning from Han et al. (2008), compared against Huffman coding and theoretical limits.

	## Overview

	This project implements two compression algorithms:

	1. Equiprobable Partitioning (EP) - The main algorithm from the paper
	2. Huffman Coding - Classical entropy coding for comparison

	## Algorithm Description

	### Enumerative Entropy Coding

	The algorithm from Han et al. (2008) is actually an enumerative entropy coding method that works in three steps:

	1. Encode alphabet size M using exp-Golomb codes
	2. Encode symbol counts N(s₁), N(s₂), ..., N(s_{M-1}) using exp-Golomb codes (last count is implied)
	3. Encode sequence position among all permutations with the same symbol counts using combinatorial enumeration

	#### How It Works

	- Step 1: Use exp-Golomb to encode how many distinct symbols appear
	- Step 2: Use exp-Golomb to encode how many times each symbol appears
	- Step 3: Use lexicographic indexing to identify which specific permutation this sequence represents among all sequences with the same symbol histogram

	This is fundamentally different from simple partitioning - it's a form of combinatorial compression that leverages the mathematical structure of permutations.

	### Performance Optimizations

	Key optimizations enable practical performance for datasets up to ~10,000 symbols:

	1. Cached Binomial Coefficients: Uses `math.comb()` with caching to avoid recomputation
	2. Binary Search: O(log n) position reconstruction instead of linear search
	3. Complement Encoding: For frequent symbols (>50%), encode positions of other symbols instead
	4. Arbitrary Precision: Avoids integer overflow for large combinatorial values

	These optimizations achieve polynomial time complexity, making the algorithm practical for research and educational use.

	### Why Enumerative Coding?

	The algorithm aims to achieve compression by:
	- Separating structure (symbol counts) from content (permutation)
	- Using optimal exp-Golomb codes for integer encoding
	- Leveraging combinatorial mathematics for exact permutation indexing
	- Achieving theoretical compression bounds for certain distributions

	## Installation

	```bash
	# Clone or navigate to the repository
	cd entropy-coding-equiprobable

	# Install dependencies
	just setup
	# or manually: uv sync
	```

	## Usage

	### Quick Start

	```bash
	# Run all available commands
	just

	# Run quick tests
	just test

	# Run paper examples
	just test-paper

	# Run full compression benchmark
	just run

	# Run benchmark and generate plots (recommended)
	just analyze

	# Generate plots and analysis
	just plot
	```

	### Visualization

	The plotting functionality generates comprehensive analysis:

	1. Compression Comparison: Side-by-side comparison of Huffman vs Enumerative methods
	2. Compression Time Analysis: Performance timing comparison between algorithms
	3. Distribution Analysis: Performance across uniform, Zipf, and geometric data
	4. Efficiency Analysis: How close each method gets to theoretical limits
	5. Enumerative Timeout Analysis: Computational complexity limitations and scaling behavior

	Plots are saved to the `plots/` directory as high-resolution PNG files.

	## Results

	### Compression Performance Comparison
	![Compression Comparison](plots/compression_comparison.png)

	Comparison of compression ratios, bits per symbol, and efficiency between Huffman and Enumerative coding across different datasets.

	### Compression Time Analysis
	![Compression Time Comparison](plots/compression_time_comparison.png)

	Performance timing analysis showing encoding times, speed ratios, and scalability characteristics. Huffman coding is consistently 100-1000x faster.

	### Distribution Analysis
	![Distribution Comparison](plots/distribution_comparison.png)

	Performance breakdown by data distribution type (Uniform, Zipf, Geometric, English Text) showing compression ratios and efficiency metrics.

	### Computational Complexity Analysis
	![Enumerative Timeout Analysis](plots/enumerative_timeout_analysis.png)

	Enumerative encoding performance showing computation times, timeout patterns, and scaling limitations by dataset size and vocabulary.

	### Command Reference

	- `just` - List available commands
	- `just setup` - Install dependencies
	- `just test` - Quick test with small datasets + paper examples
	- `just test-paper` - Test examples from the paper
	- `just run` - Full compression benchmark
	- `just analyze` - Run full benchmark and generate plots
	- `just plot` - Generate comparison plots
	- `just clean` - Remove generated files
	- `just check` - Run code quality checks
	- `just format` - Format code

	## Test Datasets

	The benchmark includes:

	### I.I.D. Datasets
	- Small (1K symbols): Quick testing
	- Medium (10K symbols): Moderate datasets
	- Large (100K symbols): Performance at scale

	### Distributions
	- Uniform: All symbols equally likely
	- Zipf: Power-law distribution (realistic for text)
	- Geometric: Exponentially decreasing probabilities

	### Vocabulary Sizes
	- 10 symbols: Small alphabet
	- 64 symbols: Medium alphabet
	- 256 symbols: Full byte range

	### Real Data
	- English text: Downloaded from WikiText-2 via Hugging Face

	## Results Analysis

	### Performance Patterns

	1. Uniform Distributions: EP performs poorly because there's no probability imbalance to exploit
	2. Skewed Distributions: EP performs better but still trails Huffman
	3. Large Vocabularies: EP overhead becomes significant with many symbols

	### Computational Complexity

	The optimized enumerative entropy coding implementation achieves polynomial time complexity through careful algorithmic design:

	#### Time Complexity Analysis
	- Encoding: O(M × n) where M = alphabet size, n = sequence length
	- Symbol position finding: O(n) per symbol
	- Combinatorial indexing: O(k) per symbol with memoization
	- Decoding: O(M × k × log n) where k = average symbol count
	- Binary search for position reconstruction: O(log n) per position
	- Memoized binomial lookups: O(1) amortized

	#### Space Complexity
	- Memory Usage: O(unique_binomial_lookups) for coefficient cache
	- Typical Cache Size: < 1000 entries for most realistic datasets
	- No Upfront Cost: Zero initialization time, grows only as needed

	#### Performance Characteristics
	- Small Datasets (< 5000 symbols): 0.045s - 1.7s encoding time
	- Medium Datasets (5000-10000 symbols): 0.3s - 15s encoding time
	- Large Datasets (> 100000 symbols): May timeout (> 30s)
	- Performance vs Huffman: ~259x slower on average

	#### Timeout Mechanism
	- Timeout Duration: 30 seconds by default for enumerative coding
	- Graceful Handling: Timeouts are logged and marked as "TIMEOUT" in results
	- When Timeouts Occur: Very large sequences (> 100k symbols) with high vocabulary diversity

	The optimizations successfully transform the algorithm from exponential (naive multinomial) to polynomial complexity, making it practical for realistic data sizes.

	### Performance Results

	From the benchmark results comparing Huffman vs Enumerative coding:

	\| Dataset Type \| Huffman Efficiency \| Enumerative Efficiency \| Speed Ratio \|
	\|--------------\|-------------------\|------------------------\|-------------\|
	\| Uniform data \| ~99.8% of theoretical \| ~48.9% of theoretical \| 259x slower \|
	\| Zipf data \| ~99.0-99.4% of theoretical \| ~47.7-49.9% of theoretical \| 100-1000x slower \|
	\| Geometric data \| ~98.9-99.3% of theoretical \| ~49.6-49.9% of theoretical \| 400-2000x slower \|
	\| English text \| ~99.1% of theoretical \| ~48.1% of theoretical \| 23x slower \|

	### Why Enumerative Underperforms

	1. Computational Complexity: Combinatorial calculations become expensive for large datasets
	2. Fixed Algorithm Structure: Cannot adapt to data characteristics like Huffman's variable-length codes
	3. Overhead: Algorithm encodes structure information (alphabet, counts, positions) separately
	4. Scaling Issues: Performance degrades exponentially with dataset size and vocabulary complexity

	## File Structure

	```
	entropy-coding-equiprobable/
	├── enumerative_coding.py # Core enumerative entropy coding implementation
	├── entropy_coding.py # Legacy compatibility and Huffman implementation
	├── test_compression.py # Main benchmark script with timing analysis
	├── test_paper_examples.py # Paper example verification
	├── test_enumerative.py # Basic functionality tests
	├── plot_results.py # Comprehensive visualization and analysis
	├── quick_test.py # Quick functionality test
	├── justfile # Command runner
	├── pyproject.toml # Python dependencies
	├── CLAUDE.md # Project-specific AI instructions
	└── README.md # This file
	```

	## Implementation Details

	### Enumerative Entropy Coding

	The implementation follows the Han et al. (2008) algorithm with four main steps:

	```python
	def encode(self, data: List[int]) -> bytes:
	# Step 1: Encode sequence length
	bits += ExpGolombCoder.encode(n)

	# Step 2: Encode alphabet (size K and symbols)
	bits += ExpGolombCoder.encode(K)
	for symbol in sorted_symbols:
	bits += ExpGolombCoder.encode(symbol)

	# Step 3: Encode symbol frequencies (K-1, last is implied)
	for i in range(K - 1):
	bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]])

	# Step 4: Encode symbol positions using combinatorial indexing
	for symbol in sorted_symbols[:-1]:
	positions = find_symbol_positions(symbol, remaining_data)
	rank = self._rank(len(remaining_data), len(positions), positions)
	bits += ExpGolombCoder.encode(rank)
	```

	### Key Optimizations

	```python
	# Complement encoding for frequent symbols
	use_complement = k > current_n / 2
	if use_complement:
	# Encode positions of OTHER symbols instead
	complement_positions = find_complement_positions()
	rank = self._rank(current_n, current_n - k, complement_positions)

	# Fast binomial coefficient computation with caching
	class OptimizedBinomialTable:
	def get(self, n: int, k: int) -> int:
	if (n, k) in self._cache:
	return self._cache[(n, k)]
	result = math.comb(n, k) # Uses arbitrary precision
	self._cache[(n, k)] = result
	return result
	```

	## Theoretical Analysis

	### Compression Bounds

	- Shannon Entropy: H(X) = -Σ p(x) log2 p(x) - theoretical minimum
	- Huffman: Achieves H(X) ≤ L_Huffman < H(X) + 1 (typically ~99% efficiency)
	- Enumerative: L_Enum ≥ H(X) + overhead (typically ~49% efficiency due to structural encoding)

	### When Enumerative Coding Works

	1. Research/theoretical applications: When exact mathematical properties are needed
	2. Educational purposes: Understanding combinatorial compression principles
	3. Small datasets: Where computational cost is not a concern

	### When Enumerative Struggles

	1. All practical applications: 259x slower than Huffman with worse compression
	2. Large datasets: Exponential scaling makes it computationally prohibitive
	3. Real-time systems: Unpredictable and potentially very long encoding times

	## Future Optimization Opportunities

	While the current implementation achieves practical performance for datasets up to ~10,000 symbols, several optimization strategies could further improve performance:

	### 1. Just-In-Time (JIT) Compilation
	- Target: Critical loops in combinatorial indexing and position reconstruction
	- Options:
	- Numba (requires Python 3.11 due to llvmlite compatibility issues)
	- JAX (better Python 3.12 support, NumPy-compatible)
	- PyPy (alternative Python interpreter with JIT)
	- Expected Benefit: 10-100x speedup for computational bottlenecks

	### 2. Algorithmic Improvements
	- Incremental Encoding: Reuse computations when processing similar sequences
	- Approximate Methods: Trade slight accuracy for major performance gains on very large datasets
	- Parallel Processing: Distribute symbol processing across multiple cores

	### 3. Specialized Data Structures
	- Sparse Binomial Tables: Only compute coefficients actually needed
	- Compressed Position Indices: More efficient representation for position lists
	- Fast Integer Arithmetic: Specialized libraries for large integer operations

	### 4. Memory Hierarchy Optimizations
	- Cache-Friendly Algorithms: Reorganize computations to minimize cache misses
	- Memory Pooling: Reduce allocation overhead for temporary arrays
	- Streaming Encoding: Process very large datasets without loading entirely into memory

	### 5. Domain-Specific Optimizations
	- Text-Specific: Leverage byte patterns and common character frequencies
	- Statistical Precomputation: Pre-build tables for common distributions (Zipf, geometric)
	- Adaptive Thresholds: Dynamically adjust complement encoding and timeout parameters

	The current implementation provides a solid foundation for exploring these advanced optimizations while maintaining correctness and robustness.

	## References

	- Han, Y., et al. (2008). "Entropy coding using equiprobable partitioning"
	- Cover, T. M., & Thomas, J. A. (2006). "Elements of Information Theory"
	- Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes"

	## Contributing

	This is a research implementation. To contribute:

	1. Fork the repository
	2. Make changes following the existing code style
	3. Run `just check` to verify code quality
	4. Submit a pull request

	## License

	This project is for educational and research purposes.