Source and computation of w2vbert2_mean_var_stats_emilia.pt

by Vartul27 - opened 15 days ago

Discussion

Vartul27

15 days ago

•

edited 15 days ago

I am using DualCodec and noticed that the released checkpoints require the file:

w2vbert2_mean_var_stats_emilia.pt

After inspecting it, it contains:

{
"mean": tensor([1024]),
"var": tensor([1024])
}

I would like to understand exactly how these statistics were computed.

Are these the mean and variance of W2V-BERT hidden representations computed on the Emilia dataset?
If so, which hidden layer was used (e.g., Layer 16)?
Were the statistics computed directly on the hidden states, or on some post-processed representation?
If using different training dataset such as LibriSpeech, should separate mean/variance statistics be recomputed?

Thank you.

jiaqili3

Amphion org 14 days ago

I will link you to a similar issue on GitHub https://github.com/jiaqili3/DualCodec/issues/5
The statistics is calculated on the same hidden layer that was used to extract the semantic features and was not post-processed. If you use some different data sets, I think separate statistics will not be needed.

Vartul27

13 days ago

Thankyou for the reply, but if we use some different w2v2 layer, then do we need to calculate new statistics??

jiaqili3

Amphion org 13 days ago

Normally, yes

Vartul27

about 8 hours ago

In the paper it is written that "We use normalized 16th layer w2v-BERT-2.0".

But the code written in dualcodec/dataset/processor.py is extracting layer 17
layer_idx = 15
output_idx = layer_idx + 2

the output index is 17, which will later be used in dualcodec/model_codec/trainer.py at
feat = vq_emb.hidden_states[self.cfg.semantic_model["output_idx"]] # (B, T, C)

I checked w2v-BERT-2.0 hidden representation, there were 25 layers, the 1st one was feature projection layer and the rest 24 are w2v-BERT-2.0 encoder layers.

Can you clarify this?

function from dualcodec/dataset/processor.py
def _build_semantic_model(semantic_model, mean_var_path, repcodec_model, repcodec_path):
"""Build the w2v semantic model and load pretrained weights."""
import safetensors

semantic_model = semantic_model.eval()
layer_idx = 15
output_idx = layer_idx + 2
stat_mean_var = torch.load(mean_var_path)
semantic_mean = stat_mean_var["mean"]
semantic_std = torch.sqrt(stat_mean_var["var"])
semantic_mean = semantic_mean
semantic_std = semantic_std

if repcodec_model is not None:
    safetensors.torch.load_model(repcodec_model, repcodec_path)
    repcodec_model = repcodec_model.eval()
    # print("semantic mean: ", semantic_mean.cpu(), "semantic std: ", semantic_std.cpu())
return {
    "model": semantic_model,
    "layer_idx": layer_idx,
    "output_idx": output_idx,
    "mean": semantic_mean,
    "std": semantic_std,
    "repcodec_model": repcodec_model,
}

function from dualcodec/dataset/processor.py

def _extract_semantic_code(self, input_features, attention_mask):
    vq_emb = self.cfg.semantic_model["model"](
        input_features=input_features,
        attention_mask=attention_mask,
        output_hidden_states=True,
    )
    feat = vq_emb.hidden_states[self.cfg.semantic_model["output_idx"]]  # (B, T, C)

    if (
        hasattr(self.cfg, "skip_semantic_normalize")
        and self.cfg.skip_semantic_normalize
    ):
        pass
    else:
        feat = (feat - self.cfg.semantic_model["mean"]) / self.cfg.semantic_model[
            "std"
        ]
    return feat

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment