Upload README.md with huggingface_hub

df8717e verified about 1 month ago

3.97 kB

	---
	license: apache-2.0
	base_model: FireRedTeam/FireRedVAD
	tags:
	- voice-activity-detection
	- vad
	- coreml
	- apple
	- ios
	- macos
	- streaming
	- real-time
	- dfsmn
	- firered
	pipeline_tag: voice-activity-detection
	library_name: coremltools
	language:
	- multilingual
	---

	# FireRedVAD-CoreML

	Core ML conversion of [FireRedVAD](https://huggingface.co/FireRedTeam/FireRedVAD) Stream-VAD for real-time voice activity detection on Apple platforms (iOS 16+ / macOS 13+). Converted from the original PyTorch model by [FireRedTeam/FireRedVAD](https://huggingface.co/FireRedTeam/FireRedVAD).

	## Model Description

	- Original model: FireRedVAD by Xiaohongshu (小红书) FireRedTeam
	- Architecture: DFSMN (Deep Feedforward Sequential Memory Network) — 8 DFSMN blocks + 1 DNN layer
	- Variant: Stream-VAD (causal, lookahead=0), suitable for real-time streaming
	- Parameters: ~568K (extremely lightweight)
	- Model size: 2.2 MB (FP32)
	- Input: 80-dim log-Mel filterbank features (16kHz, 25ms frame, 10ms shift)
	- Output: Speech probability [0, 1] per frame
	- Language support: 100+ languages, 20+ Chinese dialects

	## Performance

	Results from the FLEURS-VAD-102 benchmark (102 languages, 9,443 audio clips):

	\| Metric \| FireRedVAD \| Silero-VAD \| TEN-VAD \| FunASR-VAD \| WebRTC-VAD \|
	\|--------\|-----------\|-----------\|---------\|-----------\|-----------\|
	\| AUC-ROC \| 99.60 \| 97.99 \| 97.81 \| - \| - \|
	\| F1 Score \| 97.57 \| 95.95 \| 95.19 \| 90.91 \| 52.30 \|
	\| False Alarm \| 2.69% \| 9.41% \| 15.47% \| 44.03% \| 2.83% \|
	\| Miss Rate \| 3.62% \| 3.95% \| 2.95% \| 0.42% \| 64.15% \|

	## Core ML Model Specification

	### Inputs

	\| Name \| Shape \| Type \| Description \|
	\|------\|-------\|------\|-------------\|
	\| `feat` \| `[1, 1..512, 80]` \| Float32 \| Log-Mel filterbank features (dynamic time axis) \|
	\| `cache_0` ~ `cache_7` \| `[1, 128, 19]` \| Float32 \| FSMN lookback cache for each of the 8 layers \|

	### Outputs

	\| Name \| Type \| Description \|
	\|------\|------\|-------------\|
	\| `probs` \| Float32 \| Speech probability, shape `[1, T, 1]` \|
	\| `new_cache_0` ~ `new_cache_7` \| Float32 \| Updated lookback cache \|

	- Minimum deployment target: iOS 16 / macOS 13
	- Compute units: CPU + Neural Engine

	## Conversion

	Converted from PyTorch using [coremltools](https://github.com/apple/coremltools) via the export script in [FireRedASR2S](https://github.com/FireRedTeam/FireRedASR2S). The Stream-VAD variant was selected for its causal (no lookahead) property, making it suitable for real-time streaming applications.

	## Usage

	```swift
	import CoreML

	// Load model
	let model = try FireRedVAD(configuration: .init())

	// Initialize caches (8 layers x [1, 128, 19])
	var caches = (0..<8).map { _ in
	try! MLMultiArray(shape: [1, 128, 19], dataType: .float32)
	}

	// Process audio frame by frame
	let input = FireRedVADInput(
	feat: fbankFeatures, // [1, T, 80]
	cache_0: caches[0], cache_1: caches[1],
	cache_2: caches[2], cache_3: caches[3],
	cache_4: caches[4], cache_5: caches[5],
	cache_6: caches[6], cache_7: caches[7]
	)
	let output = try model.prediction(input: input)
	let speechProb = output.probs // [1, T, 1]

	// Update caches for next frame
	caches = [
	output.new_cache_0, output.new_cache_1,
	output.new_cache_2, output.new_cache_3,
	output.new_cache_4, output.new_cache_5,
	output.new_cache_6, output.new_cache_7
	]
	```

	For a complete implementation with feature extraction, CMVN normalization, and speech state machine, see [FireRedASRKit](https://github.com/leaker/firered_asr).

	## References

	- [FireRedVAD (Original Model)](https://huggingface.co/FireRedTeam/FireRedVAD)
	- [FireRedASR2S GitHub](https://github.com/FireRedTeam/FireRedASR2S)
	- [FireRedASR Paper (arXiv:2501.14350)](https://arxiv.org/abs/2501.14350)
	- [DFSMN Paper (arXiv:1803.05030)](https://arxiv.org/abs/1803.05030)

	## License

	Apache 2.0, following the original [FireRedVAD](https://huggingface.co/FireRedTeam/FireRedVAD) license.