Real-Time Open-Vocabulary Object Detection

Tianheng Cheng^2,3,* , Lin Song^1,*,📧 , Yixiao Ge^1,2,⭐ , Wenyu Liu³ , Xinggang Wang^3,📧 , Ying Shan^1,2

¹ Tencent AI Lab, ² ARC Lab, Tencent PCG, ³ Huazhong University of Science and Technology

^*Equal Contribution ^📧 Corresponding Author ^⭐ Project Lead

CVPR 2024

arXiv Code 🤗 HuggingFace 🤗 YOLO-World-EfficientSAM

For business licensing and other related inquiries, don't hesitate to contact us.

🔥What's New

[2024-7-8]: YOLO-World now has been integrated into ComfyUI! Come and try adding YOLO-World to your workflow now! You can access it at StevenGrove/ComfyUI-YOLOWorld!
[2024-5-18]: YOLO-World models have been integrated with the FiftyOne computer vision toolkit for streamlined open-vocabulary inference across image and video datasets.
[2024-5-16]: Hey guys! Long time no see! This update contains (1) fine-tuning guide and (2) TFLite Export with INT8 Quantization.
[2024-5-9]: This update contains the real reparameterization 🪄, and it's better for fine-tuning on custom datasets and improves the training/inference efficiency 🚀!
[2024-3-18] We are excited to announce that YOLO-World has been accepted by CVPR 2024, hope to see you in Seattle! Now, YOLO-World supports prompt tuning, image prompts, high-resolution images (1280x1280), and ONNX export.
[2024-2-18] We thank @SkalskiP for developing the wonderful segmentation demo via connecting YOLO-World and EfficientSAM. You can try it now at the 🤗 HuggingFace Spaces.
[2024-2-17] We release the code & models for YOLO-World-Seg now! YOLO-World now supports open-vocabulary / zero-shot object segmentation!
[2024-2-10] We provide the fine-tuning and data details for fine-tuning YOLO-World on the COCO dataset or the custom datasets!
[2024-2-3] We support the Gradio demo now in the repo and you can build the YOLO-World demo on your own device!
[2024.2.1] We have released the code and models of YOLO-World.
[2024.1.31] The technical report of YOLO-World are available now!

🤗 Demo

Video Guide

Introduction of YOLO-World!

Thank @SkalskiP for contributing the video guide about YOLO-World!

📖 Abstract

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

YOLO-World

🌟 Highlights

YOLO-World is the next-generation of YOLO detectors, aiming for real-time open-vocabulary object detection.
YOLO-World is pre-trained on large-scale vision-language datasets, including Objects365, GQA, Flickr30K, and CC3M, which enpowers YOLO-World with strong zero-shot open-vocabulary capbility and grounding ability in images.
YOLO-World achieves fast inference speeds and we present re-parameterization techniques for faster inference and deployment given users' vocabularies.

⚙️ Framework

The YOLO-World builds the YOLO detector with the frozen CLIP-based text encoder for extracting text embeddings from the input texts, e.g., object categories or noun phrases.
The YOLO-World contains an Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) to facilitate the interaction between multi-scale image features and text embeddings. The RepVL-PAN can re-parameterize the user's offline vocabularies into the model parameters for fast inference and deployment.
The YOLO-World is pre-trained on large-scale region-text datasets with the region-text contrastive loss to learn the region-level alignment between vision and language. For normal image-text datasets, e.g., CC3M, we adopt an automatic labeling approach to generate pseudo region-text pairs.

Please check more details in our technical report.

📊 Performance

1. Zero-Shot Evaluation on LVIS

We compare the zero-shot performance on LVIS (minival) of recent open-vocabulary detectors:

Method	Backbone	Pre-trained Data	FPS(V100)	AP	AP_r
GLIP-T	Swin-T	O365,GoldG	0.12	24.9	17.7
GLIP-T	Swin-T	O365,GoldG,Cap4M	0.12	26.0	20.8
GLIPv2-T	Swin-T	O365,GoldG	0.12	26.9	-
GLIPv2-T	Swin-T	O365,GoldG,Cap4M	0.12	29.0	-
GroundingDINO-T	Swin-T	O365,GoldG	1.5	25.6	14.4
GroundingDINO-T	Swin-T	O365,GoldG,Cap4M	1.5	27.4	18.1
DetCLIP-T	Swin-T	O365,GoldG	2.3	34.4	26.9
YOLO-World-S	YOLOv8-S	O365,GoldG	74.1	26.2	19.1
YOLO-World-M	YOLOv8-M	O365,GoldG	58.1	31.0	23.8
YOLO-World-L	YOLOv8-L	O365,GoldG	52.0	35.0	27.1
YOLO-World-L	YOLOv8-L	O365,GoldG,CC-250K	52.0	35.4	27.6

Zero-shot Evaluation on LVIS minival

2. Speed and Accuracy Curve

We compare the speed and accuracy curve of pre-trained YOLO-World vesus recent open-vocabulary detectors on zero-shot LVIS evaluation:

3. Visualizations

We provide some visualization results generated by the pre-trained YOLO-World(L):

(a) Visualization Results on Zero-shot Inference on LVIS

(b) Visualization Results on User’s Vocabulary

BibTeX

If you find YOLO-World is useful in your research or applications, please consider giving us a citation.


        @article{cheng2024yolow,
          title={YOLO-World: Real-Time Open-Vocabulary Object Detection},
          author={Cheng, Tianheng and Song, Lin and Ge, Yixiao and Liu, Wenyu and Wang, Xinggang and Shan, Ying},
          journal={arXiv preprint arXiv:},
          year={2024}
        }

Acknowledgement

This website is adapted from Nerfies, LLaVA, and ShareGPT4V, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.