ICL / RL_dataset /INFOSEEK_DOWNLOAD.md
Lekr0's picture
Add files using upload-large-folder tool
90afcf2 verified
# InfoSeek Data Download
This document collects ready-to-run scripts for downloading the InfoSeek dataset into:
`/workspace/xiaobin/RL_dataset/data`
It covers:
- InfoSeek annotations
- InfoSeek KB mapping files
- InfoSeek human set
- Wiki6M text files
- OVEN image snapshot on Hugging Face
- OVEN original-source image download workflow
InfoSeek images are derived from OVEN, so image download is handled through the OVEN release pipeline.
## 1. Recommended Directory Layout
```bash
mkdir -p /workspace/xiaobin/RL_dataset/data/infoseek
mkdir -p /workspace/xiaobin/RL_dataset/data/oven_hf
mkdir -p /workspace/xiaobin/RL_dataset/data/oven_source
```
Suggested usage:
- `/workspace/xiaobin/RL_dataset/data/infoseek`: InfoSeek jsonl files
- `/workspace/xiaobin/RL_dataset/data/oven_hf`: Hugging Face image snapshot files
- `/workspace/xiaobin/RL_dataset/data/oven_source`: upstream OVEN repo for original-source image download
## 2. Proxy Workaround
If your shell is configured with an invalid local proxy such as `127.0.0.1:7890`, use one of these patterns.
Temporarily disable proxy for a single command:
```bash
env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY wget -c URL
```
Or disable proxy for the current shell session:
```bash
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
```
## 3. Download All InfoSeek Text Data With `wget`
This is the simplest full download for the released InfoSeek jsonl files.
```bash
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl
ls -lh "${TARGET_DIR}"
```
## 4. Download All InfoSeek Text Data With `curl`
Use this if `wget` is not available.
```bash
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl
ls -lh "${TARGET_DIR}"
```
## 5. Download Only Core InfoSeek Splits
If you only need the standard train/val/test annotations:
```bash
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
```
## 6. Download Only KB Mapping Files
```bash
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
```
## 7. Download Only Human Eval Set
```bash
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
```
## 8. Download Only Wiki6M Files
```bash
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl
```
Optional decompression:
```bash
gunzip -k /workspace/xiaobin/RL_dataset/data/infoseek/Wiki6M_ver_1_0.jsonl.gz
```
## 9. Download OVEN Image Snapshot From Hugging Face
Upstream OVEN now points image snapshot downloads to the gated dataset `ychenNLP/oven` on Hugging Face. Before downloading:
1. Open `https://huggingface.co/datasets/ychenNLP/oven`
2. Accept the dataset access conditions
3. Log in with the Hugging Face CLI
Install the CLI if needed:
```bash
python -m pip install -U "huggingface_hub[cli]"
```
Login:
```bash
hf auth login
```
Download the image snapshot and mapping file into `/workspace/xiaobin/RL_dataset/data/oven_hf`:
```bash
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
mkdir -p "${TARGET_DIR}"
hf download ychenNLP/oven \
--repo-type dataset \
--local-dir "${TARGET_DIR}" \
--include "shard*.tar" \
--include "all_wikipedia_images.tar" \
--include "ovenid2impath.csv"
```
Extract the snapshot tar files:
```bash
#!/usr/bin/env bash
set -euo pipefail
HF_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
IMG_DIR="/workspace/xiaobin/RL_dataset/data/infoseek/images"
mkdir -p "${IMG_DIR}"
for f in "${HF_DIR}"/shard*.tar; do
tar -xf "${f}" -C "${IMG_DIR}"
done
tar -xf "${HF_DIR}/all_wikipedia_images.tar" -C "${IMG_DIR}"
```
Notes:
- Hugging Face file listing shows `shard01.tar` to `shard08.tar` plus `all_wikipedia_images.tar`
- The compressed download is very large, roughly 293 GB based on the published file sizes
- You need additional free space for extraction
## 10. Download OVEN Images From Original Sources
This follows the upstream `oven_eval/image_downloads` workflow.
### 10.1 Clone the Upstream Repo
```bash
git clone https://github.com/edchengg/oven_eval /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval
```
### 10.2 Run All Source Download Scripts
The upstream image download directory contains these scripts:
- `download_aircraft.sh`
- `download_car196.sh`
- `download_coco.sh`
- `download_food101.sh`
- `download_gldv2.sh`
- `download_imagenet.sh`
- `download_inat.sh`
- `download_oxfordflower.sh`
- `download_sports100.sh`
- `download_sun397.sh`
- `download_textvqa.sh`
- `download_v7w.sh`
- `download_vg.sh`
Run them one by one:
```bash
#!/usr/bin/env bash
set -euo pipefail
cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads
bash download_aircraft.sh
bash download_car196.sh
bash download_coco.sh
bash download_food101.sh
bash download_gldv2.sh
bash download_imagenet.sh
bash download_inat.sh
bash download_oxfordflower.sh
bash download_sports100.sh
bash download_sun397.sh
bash download_textvqa.sh
bash download_v7w.sh
bash download_vg.sh
```
Or run them in a loop:
```bash
#!/usr/bin/env bash
set -euo pipefail
cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads
for script in download_*.sh; do
bash "${script}"
done
```
### 10.3 Download `ovenid2impath.csv`
You need `ovenid2impath.csv` for the merge step. The current recommended source is the Hugging Face dataset:
```bash
#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
mkdir -p "${TARGET_DIR}"
hf download ychenNLP/oven \
--repo-type dataset \
--local-dir "${TARGET_DIR}" \
--include "ovenid2impath.csv"
```
### 10.4 Merge Into the Final OVEN Image Layout
Run the upstream merge script after all downloads finish:
```bash
cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads
python merge_oven_images.py
```
The upstream documentation states that `merge_oven_images.py` should be run after all image download scripts complete and after `ovenid2impath.csv` is available.
## 11. Verify the Downloaded Files
Check text files:
```bash
ls -lh /workspace/xiaobin/RL_dataset/data/infoseek
```
Check Hugging Face snapshot files:
```bash
ls -lh /workspace/xiaobin/RL_dataset/data/oven_hf
```
Check extracted images:
```bash
find /workspace/xiaobin/RL_dataset/data/infoseek/images -type f | wc -l
```
## 12. Upstream References
- InfoSeek release page: `https://github.com/open-vision-language/infoseek`
- OVEN image download page: `https://github.com/edchengg/oven_eval/tree/main/image_downloads`
- Hugging Face OVEN dataset: `https://huggingface.co/datasets/ychenNLP/oven`
- Hugging Face CLI download docs: `https://huggingface.co/docs/huggingface_hub/guides/cli`