Skip to content

Introduction Multidimensional

Jorge Quenta edited this page Jul 17, 2025 · 3 revisions

🧠 Introduction

This project focuses on the development of a content-based retrieval system that works across three data modalities: text, images, and audio.

🗂️ Data Domains and Datasets

  • Text
    We use the MPST Movie Plot Synopses dataset from Kaggle, which contains thousands of movie synopses annotated with genres and tags. This dataset is ideal for testing content-based retrieval in the textual domain since it allows us to evaluate semantic similarity between descriptions.

  • Images
    We use the Fashion Product Images Dataset (~25GB), which contains labeled images of clothing products. Each image is processed using SIFT descriptors and BoVW histograms to perform visual similarity search.

  • Audio
    We use a curated subset of Spotify Songs in .wav format, including various artists and genres. We extract MFCC features to characterize the audio and build histograms of acoustic words using KMeans clustering.

🌐 Why a Multimodal Database?

Traditional retrieval systems often rely on metadata or keyword matching. However, content-based retrieval requires understanding the actual content (visual, auditory, or textual). Each data domain has unique properties:

  • Text requires natural language processing and inverted indexes.
  • Images require local feature extraction and visual vocabularies.
  • Audio requires signal processing and acoustic descriptors.

To unify these under a single framework, we need a multimodal database that allows:

  • Indexing and retrieving content from different modalities
  • Using specialized pipelines for each topic

Clone this wiki locally