Cross-Modal Compatibility: Evaluating Embeddings for Efficient Data Retrieval in Film Projects

This study contributes to the understanding of cross-modal data compatibility and presents optimized configurations for improving retrieval performance in film production environments. The insights gained can be applied to broader domains requiring multi-modal data management, such as media archiving, automated indexing, and AI-driven content recommendation.

Text-Only RAG vs Multi-Modal RAG

Initial situation

Multi-modal (omni) models, which work with different data modalities such as text, images, audio, video, tables, and network graphs, have gained significant attention recently. However, a key question arises: How can these diverse data modalities be represented meaningfully? This issue is crucial in film projects, where massive datasets must be accessed and searched during quality control.

Techniques like Retrieval Augmented Generation (RAG) are widely used for text data. RAG relies on embeddings or converting text into vector form to perform efficient retrieval. Nonetheless, adapting this to a multi-modal setting presents several challenges. To illustrate, how do we determine the optimal size of text chunks for embeddings or the right image patch size for vision and image models?

One way to tackle this issue is to explore different embeddings in a cross-retrieval setting — like using text-based captions of a movie scene to find matching images. The objective is to determine whether an adapter model could facilitate better translation and representation between different modalities.

Problem statement / Project goal

Efficient retrieval of multimodal embeddings, particularly for text and image data, presents several challenges in ensuring accurate cross-modal alignment. This project aims to explore and optimize techniques for embedding text and images to enhance retrieval performance. Specifically, it will address the following key questions:

Embedding Techniques

What are the most effective methods for generating text and image embeddings for cross-modal retrieval? How well do current models align textual descriptions (e.g., captions) with visual content?

Optimal Configurations

What are the ideal configurations for image patch sizes and text chunk sizes to maximize retrieval accuracy?

Performance Measurement

How can we effectively evaluate retrieval performance using metrics such as precision-recall and ranking measures? What datasets are most suitable for training and benchmarking models for text-to-image and image-to-text retrieval?

The goal of this work is to refine embedding strategies and evaluation frameworks to improve the accuracy and efficiency of multimodal retrieval systems.

Impression 1

FHNW

We visualize our image and text embedding space in CLIP in a higher dimension to examine how the vectors are stored. The PCA algorithm is used to transform the embeddings to 2D.

Impression 2

FHNW

Exploring which words are commonly occuring in the our MS COCO dataset to evaluate further steps.

Impression 3

FHNW

Exploring how the LLM processes the image by patching it. Here, we observe what happens when the image patch size is 16×16.

Results

This study evaluated various multimodal embedding models for text-to-image and image-to-text retrieval tasks, focusing on optimizing retrieval performance through embedding selection, parameter tuning, and preprocessing techniques. The laion/CLIP-ViT-H-14-laion2B-s32B-b79K model emerged as the most effective, achieving the highest retrieval accuracy across both retrieval tasks.

  • Embedding Selection: The Laion_CLIP model demonstrated superior performance, outperforming OpenAI CLIP models.
  • Parameter Optimization: The size of the image patch (14x14, 16x16 or 32x32) and the text chunk (8, 11, 17 or 54) were critical to achieving optimal performance.
  • Metadata Utilization: Preprocessing captions to extract frequent words and incorporating them as metadata improved retrieval performance, particularly for T2I retrieval.
  • Evaluation Metrics: The LRAP metric is more reliable than traditional precision-recall metrics for cross-modal retrieval.

The findings from this study contribute to the growing body of research on multimodal retrieval, emphasizing the importance of embedding quality, dataset preprocessing, and retrieval metric selection. As media archives, AI-driven content recommendation systems, and film production workflows increasingly rely on multimodal retrieval, robust cross-modal embedding techniques become more pressing. The improvements achieved through preprocessing metadata highlight a potential path toward further refining retrieval accuracy.

  • Machine LearningThe field of artificial intelligence that enables systems to learn patterns from data and make predictions or decisions without explicit programming.
  • Multi-Modal RetrievalA retrieval technique that processes and matches information across multiple data types, such as text, images, and audio.
  • Cross-Modal CompatibilityThe ability of models to understand and align different modalities (e.g., text-to-image) to improve retrieval or generation tasks.
  • Retrieval PipelineA structured sequence of processes designed to efficiently retrieve relevant data from a large corpus, often used in search engines and recommendation systems.
  • Evaluation Retrieval PipelineA systematic framework for assessing the performance of a retrieval system, ensuring relevance, accuracy, and efficiency in fetching relevant data.

Team

Rehan Agarwal
Robin Schneider

Advisor
Fernando Benites

Jonas Grüter