Source and computation of w2vbert2_mean_var_stats_emilia.pt
I am using DualCodec and noticed that the released checkpoints require the file:
w2vbert2_mean_var_stats_emilia.pt
After inspecting it, it contains:
{
"mean": tensor([1024]),
"var": tensor([1024])
}
I would like to understand exactly how these statistics were computed.
Are these the mean and variance of W2V-BERT hidden representations computed on the Emilia dataset?
If so, which hidden layer was used (e.g., Layer 16)?
Were the statistics computed directly on the hidden states, or on some post-processed representation?
If using different training dataset such as LibriSpeech, should separate mean/variance statistics be recomputed?
Thank you.
I will link you to a similar issue on GitHub https://github.com/jiaqili3/DualCodec/issues/5
The statistics is calculated on the same hidden layer that was used to extract the semantic features and was not post-processed. If you use some different data sets, I think separate statistics will not be needed.
Thankyou for the reply, but if we use some different w2v2 layer, then do we need to calculate new statistics??
Normally, yes
In the paper it is written that "We use normalized 16th layer w2v-BERT-2.0".
But the code written in dualcodec/dataset/processor.py is extracting layer 17
layer_idx = 15
output_idx = layer_idx + 2
the output index is 17, which will later be used in dualcodec/model_codec/trainer.py at
feat = vq_emb.hidden_states[self.cfg.semantic_model["output_idx"]] # (B, T, C)
I checked w2v-BERT-2.0 hidden representation, there were 25 layers, the 1st one was feature projection layer and the rest 24 are w2v-BERT-2.0 encoder layers.
Can you clarify this?
function from dualcodec/dataset/processor.py
def _build_semantic_model(semantic_model, mean_var_path, repcodec_model, repcodec_path):
"""Build the w2v semantic model and load pretrained weights."""
import safetensors
semantic_model = semantic_model.eval()
layer_idx = 15
output_idx = layer_idx + 2
stat_mean_var = torch.load(mean_var_path)
semantic_mean = stat_mean_var["mean"]
semantic_std = torch.sqrt(stat_mean_var["var"])
semantic_mean = semantic_mean
semantic_std = semantic_std
if repcodec_model is not None:
safetensors.torch.load_model(repcodec_model, repcodec_path)
repcodec_model = repcodec_model.eval()
# print("semantic mean: ", semantic_mean.cpu(), "semantic std: ", semantic_std.cpu())
return {
"model": semantic_model,
"layer_idx": layer_idx,
"output_idx": output_idx,
"mean": semantic_mean,
"std": semantic_std,
"repcodec_model": repcodec_model,
}
function from dualcodec/dataset/processor.py
def _extract_semantic_code(self, input_features, attention_mask):
vq_emb = self.cfg.semantic_model["model"](
input_features=input_features,
attention_mask=attention_mask,
output_hidden_states=True,
)
feat = vq_emb.hidden_states[self.cfg.semantic_model["output_idx"]] # (B, T, C)
if (
hasattr(self.cfg, "skip_semantic_normalize")
and self.cfg.skip_semantic_normalize
):
pass
else:
feat = (feat - self.cfg.semantic_model["mean"]) / self.cfg.semantic_model[
"std"
]
return feat