noamrot
/

FuseCap_Image_Captioning

image-text-to-text

image-captioning

Model card Files Files and versions

FuseCap_Image_Captioning / README.md

noamrot's picture

Update README.md

b6793ca verified about 2 years ago

|

history blame contribute delete

2.15 kB

	---
	license: mit
	inference: false
	pipeline_tag: image-to-text
	tags:
	- image-captioning
	---
	# FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

	A framework designed to generate semantically rich image captions.

	## Resources

	- 💻 Project Page: For more details, visit the official [project page](https://rotsteinnoam.github.io/FuseCap/).

	- 📝 Read the Paper: You can find the paper [here](https://arxiv.org/abs/2305.17718).

	- 🚀 Demo: Try out our BLIP-based model [demo](https://huggingface.co/spaces/noamrot/FuseCap) trained using FuseCap.

	- 📂 Code Repository: The code for FuseCap can be found in the [GitHub repository](https://github.com/RotsteinNoam/FuseCap).

	- 🗃️ Datasets: The fused captions datasets can be accessed from [here](https://github.com/RotsteinNoam/FuseCap#datasets).

	#### Running the model

	Our BLIP-based model can be run using the following code,

	```python
	import requests
	from PIL import Image
	from transformers import BlipProcessor, BlipForConditionalGeneration
	import torch

	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
	processor = BlipProcessor.from_pretrained("noamrot/FuseCap")
	model = BlipForConditionalGeneration.from_pretrained("noamrot/FuseCap").to(device)

	img_url = 'https://huggingface.co/spaces/noamrot/FuseCap/resolve/main/bike.jpg'
	raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

	text = "a picture of "
	inputs = processor(raw_image, text, return_tensors="pt").to(device)

	out = model.generate(**inputs, num_beams = 3)
	print(processor.decode(out[0], skip_special_tokens=True))
	```

	## Upcoming Updates

	The official codebase, datasets and trained models for this project will be released soon.

	## BibTeX

	``` Citation
	@inproceedings{rotstein2024fusecap,
	title={Fusecap: Leveraging large language models for enriched fused image captions},
	author={Rotstein, Noam and Bensa{\"\i}d, David and Brody, Shaked and Ganz, Roy and Kimmel, Ron},
	booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
	pages={5689--5700},
	year={2024}
	}
	```