Link to this sectionHow to Build Semantic Image Search with OpenAI CLIP#

This guide walks you through building a semantic image search engine using OpenAI CLIP and Flask. By combining CLIP's visual-language embeddings with fast cosine similarity search powered by NumPy, you can build a web interface that retrieves relevant images from natural language queries, no labels or categories required.

Watch: How Similarity Search Works | Visual Search Using OpenAI CLIP and the Ultralytics Package 🎉

Flask webpage with semantic search results overview

The Ultralytics Python package wraps this entire pipeline behind two classes, so you can launch a working search app or run queries programmatically in a few lines. This guide covers why semantic search is useful, how it works, running the web app, searching programmatically, and configuring parameters.

Link to this sectionWhy Use Semantic Image Search?#

Building your own semantic image search system with CLIP provides several compelling advantages:

Zero-shot capabilities: You don't need to train on your dataset. CLIP's zero-shot learning lets you query any image collection with free-form natural language, saving time and resources.
Human-like understanding: Unlike keyword search, CLIP understands semantic context and retrieves images from abstract, emotional, or relational queries like "a happy child in nature" or "a futuristic city skyline at night."
No labels or metadata: This approach needs only raw images. CLIP generates embeddings without any manual annotation.
Lightweight and exact search: A single normalized matrix multiplication in NumPy ranks every image by cosine similarity, giving exact results with real-time response across thousands of embeddings and no extra search dependency to install or manage.
Cross-domain applications: Whether you're building a personal photo archive, a creative inspiration tool, a product search engine, or an art recommendation system, the same stack adapts with minimal tweaking.

Link to this sectionHow Semantic Image Search Works#

The pipeline combines three components, each handling one stage of turning images and text into ranked results:

CLIP uses a vision encoder (e.g., ResNet or ViT) for images and a text encoder (Transformer-based) for language to project both into the same multimodal embedding space. This allows direct comparison between text and images using cosine similarity.
NumPy stores the image embeddings as a single array and ranks them against a query embedding with one matrix multiplication, returning the closest vectors by cosine similarity with no extra indexing dependency.
Flask provides a simple web interface to submit natural language queries and display semantically matched images from the index.

OpenAI Clip image retrieval workflow

Because both images and text land in the same vector space, retrieval is zero-shot: you don't need labels or categories, just image data and a good prompt.

Link to this sectionRun the Semantic Search Web App#

The SearchApp class launches the full Flask interface. On first run it downloads a sample image set, builds the embedding index, and serves a page where you can type a query and view ranked results.

Image Path Warning

If you're using your own images, make sure to provide an absolute path to the image directory. Otherwise, the images may not appear on the webpage due to Flask's file serving limitations.

from ultralytics import solutions

app = solutions.SearchApp(
    # data = "path/to/img/directory" # Optional, build search engine with your own images
    device="cpu"  # configure the device for processing, e.g., "cpu" or "cuda"
)

app.run(debug=False)  # You can also use `debug=True` argument for testing

Link to this sectionSearch Images Programmatically#

The VisualAISearch class performs all the backend operations without the web layer:

Loads or builds an embedding index from local images.
Extracts image and text embeddings using CLIP.
Performs similarity search using cosine similarity.

Call the searcher with a natural language query to get back a list of matching image filenames ranked by similarity:

from ultralytics import solutions

searcher = solutions.VisualAISearch(
    # data = "path/to/img/directory" # Optional, build search engine with your own images
    device="cpu"  # configure the device for processing, e.g., "cpu" or "cuda"
)

results = searcher("a dog sitting on a bench")

# Ranked Results:
#     - 000000546829.jpg | Similarity: 0.3269
#     - 000000549220.jpg | Similarity: 0.2899
#     - 000000517069.jpg | Similarity: 0.2761
#     - 000000029393.jpg | Similarity: 0.2742
#     - 000000534270.jpg | Similarity: 0.2680

Link to this sectionConfigure VisualAISearch Parameters#

The table below outlines the available parameters for VisualAISearch:

Argument	Type	Default	Description
`data`	`str`	`'images'`	Path to image directory used for similarity search.

Argument	Type	Default	Description
`device`	`str`	`None`	Specifies the device for inference (e.g., `cpu`, `cuda:0` or `0`). Allows users to select between CPU, a specific GPU, or other compute devices for model execution.

Manage your data in the cloud

To search image collections at production scale without managing local files, you can organize and version your images in the Ultralytics Platform before indexing them with CLIP.

Link to this sectionConclusion#

With CLIP and the Ultralytics Python package, you can stand up a zero-shot semantic image search engine in just a few lines, either as a Flask web app or as a programmatic search backend. From here, point data at your own image directory to index it, then explore other Ultralytics Solutions to build on top of your computer vision workflows.

Link to this sectionFAQ#

Link to this sectionHow does CLIP understand both images and text?#

CLIP (Contrastive Language Image Pretraining) is a model developed by OpenAI that learns to connect visual and linguistic information. It's trained on a massive dataset of images paired with natural language captions. This training allows it to map both images and text into a shared embedding space, so you can compare them directly using vector similarity.

Link to this sectionWhy is CLIP considered so powerful for AI tasks?#

What makes CLIP stand out is its ability to generalize. Instead of being trained just for specific labels or tasks, it learns from natural language itself. This allows it to handle flexible queries like "a man riding a jet ski" or "a surreal dreamscape," making it useful for everything from classification to creative semantic search, without retraining.

Link to this sectionHow are images ranked against a text query?#

Once CLIP turns your images into embeddings, the Ultralytics package L2-normalizes them and stores them in a single NumPy array. A query is ranked with one matrix multiplication that computes the cosine similarity between the query embedding and every image embedding, then sorts the scores. This brute-force search is exact and fast for typical image collections, with no extra vector-database dependency to install or manage.

Link to this sectionWhy use the Ultralytics Python package if CLIP is from OpenAI?#

While CLIP is developed by OpenAI, the Ultralytics Python package wraps embedding generation, indexing, and cosine-similarity search into a complete semantic image search pipeline behind a few lines of code that just work:

from ultralytics import solutions

searcher = solutions.VisualAISearch(
    # data = "path/to/img/directory" # Optional, build search engine with your own images
    device="cpu"  # configure the device for processing, e.g., "cpu" or "cuda"
)

results = searcher("a dog sitting on a bench")

This high-level implementation handles:

CLIP-based image and text embedding generation.
Embedding index creation and management.
Efficient semantic search with cosine similarity.
Directory-based image loading and visualization.

Link to this sectionCan I customize the frontend of this app?#

Yes. The current setup uses Flask with a basic HTML frontend, but you can replace it with your own HTML or build a more dynamic UI with React, Vue, or another frontend framework. Flask can serve as the backend API for your custom interface.

Link to this sectionIs it possible to search through videos instead of static images?#

Not directly. A simple workaround is to extract individual frames from your videos (e.g., one every second), treat them as standalone images, and feed those into the system. This way, the search engine can semantically index visual moments from your videos.

Contributors

GLglenn-jocher⁶ RIRizwanMunawar³ RAraimbekovm¹ PDpderrenger¹

Created May 4, 2025Updated 3 weeks ago