Zero-Shot ClipBench Evaluation

Please download the supported datasets directly from the datasets host and update paths in clip_benchmark/datasets/builder.py. And run

model='PE-Core-G14-448'
DATASETS=./clip_benchmark/tasks/wds_benchmarks.txt
DATA_ROOT=DATA_ROOT/

python -m clip_benchmark.cli eval \
    --model $model \
    --pretrained $CHECKPOINT \
    --dataset "$DATASETS" \
    --dataset_root $DATA_ROOT \
    --output "./benchmark_{pretrained}_{dataset}_{num_frames}_{model}_{language}_{task}.json" \
    --force-preprocess-cfg resize_mode=squash

This script will perform zero-shot classification abd retireval benchmarks defined in clip_benchmark/tasks/wds_benchmarks.txt. Examples above includes the following tasks:

ImageNet 1K classification
ImageNet v2 classification
ImageNet Adversial classification
MS-COCO retrieval
Flickr30K retrieval
Kinetics 400 video classification
MSR-VTT video retrieval

Zero-Shot Retrieval for PE-AudioVisual

python -m clip_benchmark.cli eval \
    --model pe-av-large \
    --reweight-scale 10 \
    --dataset audiocaps-audio-video audiocaps-audio-text audiocaps-video-text clotho-v2 \
    --dataset_root $DATASETS \
    --output "./benchmark_{pretrained}_{dataset}_{num_frames}_{model}_{language}_{task}.json" \
    --batch_size 4 --no_amp

This will run zero-shot retrieval for the following tasks:

Audiocaps Audio-Video
Audiocaps Audio-Text
Audiocaps Video-Text
Clotho-V2 Audio-Text

Clotho-V2 will be downloaded from its original source and unpacked, but due to Audiocaps being a Youtube dataset, the user will need to provide the audio and video paths under $DATASETS/audiocaps/audio and $DATASETS/audiocaps/video respectively.