ICL / RL_dataset /INFOSEEK_DOWNLOAD.md

Add files using upload-large-folder tool

90afcf2 verified 23 days ago

9.53 kB

	# InfoSeek Data Download

	This document collects ready-to-run scripts for downloading the InfoSeek dataset into:

	`/workspace/xiaobin/RL_dataset/data`

	It covers:

	- InfoSeek annotations
	- InfoSeek KB mapping files
	- InfoSeek human set
	- Wiki6M text files
	- OVEN image snapshot on Hugging Face
	- OVEN original-source image download workflow

	InfoSeek images are derived from OVEN, so image download is handled through the OVEN release pipeline.

	## 1. Recommended Directory Layout

	```bash
	mkdir -p /workspace/xiaobin/RL_dataset/data/infoseek
	mkdir -p /workspace/xiaobin/RL_dataset/data/oven_hf
	mkdir -p /workspace/xiaobin/RL_dataset/data/oven_source
	```

	Suggested usage:

	- `/workspace/xiaobin/RL_dataset/data/infoseek`: InfoSeek jsonl files
	- `/workspace/xiaobin/RL_dataset/data/oven_hf`: Hugging Face image snapshot files
	- `/workspace/xiaobin/RL_dataset/data/oven_source`: upstream OVEN repo for original-source image download

	## 2. Proxy Workaround

	If your shell is configured with an invalid local proxy such as `127.0.0.1:7890`, use one of these patterns.

	Temporarily disable proxy for a single command:

	```bash
	env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY wget -c URL
	```

	Or disable proxy for the current shell session:

	```bash
	unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
	```

	## 3. Download All InfoSeek Text Data With `wget`

	This is the simplest full download for the released InfoSeek jsonl files.

	```bash
	#!/usr/bin/env bash
	set -euo pipefail

	TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
	mkdir -p "${TARGET_DIR}"
	cd "${TARGET_DIR}"

	wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
	wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
	wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
	wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
	wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
	wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
	wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
	wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl

	ls -lh "${TARGET_DIR}"
	```

	## 4. Download All InfoSeek Text Data With `curl`

	Use this if `wget` is not available.

	```bash
	#!/usr/bin/env bash
	set -euo pipefail

	TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
	mkdir -p "${TARGET_DIR}"
	cd "${TARGET_DIR}"

	curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
	curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
	curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
	curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
	curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
	curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
	curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
	curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl

	ls -lh "${TARGET_DIR}"
	```

	## 5. Download Only Core InfoSeek Splits

	If you only need the standard train/val/test annotations:

	```bash
	#!/usr/bin/env bash
	set -euo pipefail

	TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
	mkdir -p "${TARGET_DIR}"
	cd "${TARGET_DIR}"

	wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
	wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
	wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
	```

	## 6. Download Only KB Mapping Files

	```bash
	#!/usr/bin/env bash
	set -euo pipefail

	TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
	mkdir -p "${TARGET_DIR}"
	cd "${TARGET_DIR}"

	wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
	wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
	```

	## 7. Download Only Human Eval Set

	```bash
	#!/usr/bin/env bash
	set -euo pipefail

	TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
	mkdir -p "${TARGET_DIR}"
	cd "${TARGET_DIR}"

	wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
	```

	## 8. Download Only Wiki6M Files

	```bash
	#!/usr/bin/env bash
	set -euo pipefail

	TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
	mkdir -p "${TARGET_DIR}"
	cd "${TARGET_DIR}"

	wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
	wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl
	```

	Optional decompression:

	```bash
	gunzip -k /workspace/xiaobin/RL_dataset/data/infoseek/Wiki6M_ver_1_0.jsonl.gz
	```

	## 9. Download OVEN Image Snapshot From Hugging Face

	Upstream OVEN now points image snapshot downloads to the gated dataset `ychenNLP/oven` on Hugging Face. Before downloading:

	1. Open `https://huggingface.co/datasets/ychenNLP/oven`
	2. Accept the dataset access conditions
	3. Log in with the Hugging Face CLI

	Install the CLI if needed:

	```bash
	python -m pip install -U "huggingface_hub[cli]"
	```

	Login:

	```bash
	hf auth login
	```

	Download the image snapshot and mapping file into `/workspace/xiaobin/RL_dataset/data/oven_hf`:

	```bash
	#!/usr/bin/env bash
	set -euo pipefail

	TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
	mkdir -p "${TARGET_DIR}"

	hf download ychenNLP/oven \
	--repo-type dataset \
	--local-dir "${TARGET_DIR}" \
	--include "shard*.tar" \
	--include "all_wikipedia_images.tar" \
	--include "ovenid2impath.csv"
	```

	Extract the snapshot tar files:

	```bash
	#!/usr/bin/env bash
	set -euo pipefail

	HF_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
	IMG_DIR="/workspace/xiaobin/RL_dataset/data/infoseek/images"
	mkdir -p "${IMG_DIR}"

	for f in "${HF_DIR}"/shard*.tar; do
	tar -xf "${f}" -C "${IMG_DIR}"
	done

	tar -xf "${HF_DIR}/all_wikipedia_images.tar" -C "${IMG_DIR}"
	```

	Notes:

	- Hugging Face file listing shows `shard01.tar` to `shard08.tar` plus `all_wikipedia_images.tar`
	- The compressed download is very large, roughly 293 GB based on the published file sizes
	- You need additional free space for extraction

	## 10. Download OVEN Images From Original Sources

	This follows the upstream `oven_eval/image_downloads` workflow.

	### 10.1 Clone the Upstream Repo

	```bash
	git clone https://github.com/edchengg/oven_eval /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval
	```

	### 10.2 Run All Source Download Scripts

	The upstream image download directory contains these scripts:

	- `download_aircraft.sh`
	- `download_car196.sh`
	- `download_coco.sh`
	- `download_food101.sh`
	- `download_gldv2.sh`
	- `download_imagenet.sh`
	- `download_inat.sh`
	- `download_oxfordflower.sh`
	- `download_sports100.sh`
	- `download_sun397.sh`
	- `download_textvqa.sh`
	- `download_v7w.sh`
	- `download_vg.sh`

	Run them one by one:

	```bash
	#!/usr/bin/env bash
	set -euo pipefail

	cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads

	bash download_aircraft.sh
	bash download_car196.sh
	bash download_coco.sh
	bash download_food101.sh
	bash download_gldv2.sh
	bash download_imagenet.sh
	bash download_inat.sh
	bash download_oxfordflower.sh
	bash download_sports100.sh
	bash download_sun397.sh
	bash download_textvqa.sh
	bash download_v7w.sh
	bash download_vg.sh
	```

	Or run them in a loop:

	```bash
	#!/usr/bin/env bash
	set -euo pipefail

	cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads

	for script in download_*.sh; do
	bash "${script}"
	done
	```

	### 10.3 Download `ovenid2impath.csv`

	You need `ovenid2impath.csv` for the merge step. The current recommended source is the Hugging Face dataset:

	```bash
	#!/usr/bin/env bash
	set -euo pipefail

	TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
	mkdir -p "${TARGET_DIR}"

	hf download ychenNLP/oven \
	--repo-type dataset \
	--local-dir "${TARGET_DIR}" \
	--include "ovenid2impath.csv"
	```

	### 10.4 Merge Into the Final OVEN Image Layout

	Run the upstream merge script after all downloads finish:

	```bash
	cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads
	python merge_oven_images.py
	```

	The upstream documentation states that `merge_oven_images.py` should be run after all image download scripts complete and after `ovenid2impath.csv` is available.

	## 11. Verify the Downloaded Files

	Check text files:

	```bash
	ls -lh /workspace/xiaobin/RL_dataset/data/infoseek
	```

	Check Hugging Face snapshot files:

	```bash
	ls -lh /workspace/xiaobin/RL_dataset/data/oven_hf
	```

	Check extracted images:

	```bash
	find /workspace/xiaobin/RL_dataset/data/infoseek/images -type f \| wc -l
	```

	## 12. Upstream References

	- InfoSeek release page: `https://github.com/open-vision-language/infoseek`
	- OVEN image download page: `https://github.com/edchengg/oven_eval/tree/main/image_downloads`
	- Hugging Face OVEN dataset: `https://huggingface.co/datasets/ychenNLP/oven`
	- Hugging Face CLI download docs: `https://huggingface.co/docs/huggingface_hub/guides/cli`