Most existing semantic instance segmentation systems—like Detectron2, YOLO, and similar—work with a closed set of object categories, typically those found in datasets such as MS-COCO or Pascal VOC. While their accuracy is high within those limits, they struggle in real-world environments where the variety of objects goes far beyond the training data. This is especially problematic in fields like mobile robotics, where systems often encounter unfamiliar objects.
TALOS (TAgging–LOcation–Segmentation) is an open-vocabulary instance segmentation pipeline designed to overcome these limitations. It works in three main stages: Tagging extracts object labels from the image, Location locates those objects, and Segmentation generates binary masks for each detected instance. Each stage is modular and independent, making it easy to extend or replace components with newer models as needed.
Compared to traditional closed-vocabulary detectors, TALOS brings several key advantages. It correctly segments individual object instances, works automatically from just an RGB image, supports easy model integration thanks to its modular architecture, and allows for natural-language customization using large language models.
Detectron2 (left, pink) is limited to COCO categories and often mislabels or misses unknown objects. TALOS (right, green) correctly identifies and segments previously unseen categories like "curtain", "piano" or "avocado".
TALOS has been integrated as a ROS 2 node and connected to Voxeland, a 3D semantic mapping platform that previously relied on closed-vocabulary systems. Results on a variety of input images show improved semantic detail, which translates into richer and more informative maps—an essential step toward more capable and aware robotic systems.
📄 Read the TALOS paper (Spanish PDF)
This project is currently under development. Please check the develop branch for the latest updates.
TALOS takes an arbitrary number of RGB images as input and produces instance-level segmentations for each image, that include binary masks for each object instance, along with their corresponding bounding boxes and semantic labels.
The TALOS pipeline consists of three main stages: Tagging, Location, and Segmentation. Each stage is modular and can be extended or replaced with new models as needed.
The pipeline is designed to be modular, allowing for easy integration of new models and components. The three main stages of the pipeline are as follows:
- Description: Extracts object category labels using large-scale models (LVLMs and/or LLMs).
- Input: RGB image.
- Output: List of semantic object categories (textual labels)
- Tagging methods:
- Direct Tagging: Uses a Large Vision-Language Model (LVLM) to extract labels directly. A smaller and more specific model like RAM++ is suitable for this Tagging method too, but this is not as flexible as the LVLM approach.
- Tagging via LVLM Image Description and LLM Keyword Extraction: Uses a LVLM to generate a description of the image, which is then processed by an LLM to extract keywords of the object categories that are present in the image description.
- Description: Locates objects described by the category tags using a visual grounding model.
- Inputs:
- RGB image
- List of object labels (output from the Tagging stage).
- Output: List of bounding boxes with label and confidence.
- Description: Produces accurate instance segmentation masks using category-agnostic segmentation models.
- Inputs:
- RGB image.
- Located bounding boxes for each detected object (output from the Location stage).
- Outputs:
- Binary masks, one per object instance.
- Detections JSON that includes the semantic label, bounding box coordinates, location confidence score for each instance and mask ID.
TALOS integrates a variety of advanced models and technologies to achieve its open-vocabulary instance segmentation capabilities:
- Recomended Python version: Python 3.10.12 or higher.
- Recomended operating system: Ubuntu 22.04 or higher.
To install TALOS, follow these steps:
-
Clone the repository and its submodules
git clone --recurse-submodules https://github.com/macorisd/TALOS.git
If you plan to use TALOS for robotics tasks with ROS 2, clone it into the
src/directory of your ROS 2 workspace. -
Create virtual environment and install dependencies You can manage dependencies however you like, but a helper script is provided to automate virtual-environment creation and dependency installation:
chmod +x venvs/update_venv.bash ./venvs/update_venv.bash
This will create (or update) a Python 3 virtual environment named
talos_envinvenvs/and install all required packages.Alternatively, install manually:
python3 -m venv venvs/talos_env source venvs/talos_env/bin/activate pip install -r requirements.txt -
Install models and technologies
-
Supported Ollama models: To use the LLaVA, DeepSeek, and MiniCPM models in the TALOS Tagging stage, install Ollama and pull the models:
curl -fsSL https://ollama.com/install.sh | sh ollama pull llava:34b ollama pull deepseek-r1:14b ollama pull minicpm-v:8b -
Recognize Anything Plus Model (RAM++): To use RAM++ in the TALOS Tagging stage, download the
.pthfile (The recommended checkpoint file isram_plus_swin_large_14m.pth) intosrc/pipeline/talos/tagging/direct_tagging/ram_plus/models/. This file available at: https://huggingface.co/xinyu1205/recognize-anything-plus-model/blob/main/ram_plus_swin_large_14m.pth -
Gemma 3 (Hugging Face): To use Gemma in the TALOS Tagging stage, set your Hugging Face token as an environment variable. For example, add the following to
pipeline/.env:HUGGINGFACE_TOKEN=<your_token_here>
-
Other supported Hugging Face models: Other Hugging Face models used by TALOS do not require additional installation steps. They are automatically downloaded when the pipeline is run with each model for the first time.
These usage instructions demonstrate running TALOS in “user” mode. To launch the ROS 2 node for robotic applications (e.g., building 3D semantic maps), please refer to the TALOS ROS 2 README.
- Add input images
Place your input images (recommended formats: png, jpg, jpeg) into src/pipeline/input_images/ before running TALOS.
-
Activate the virtual environment
Before running the pipeline, ensure your Python virtual environment is active:
source venvs/talos_env/bin/activate -
Run the pipeline
Navigate to the pipeline directory and launch the main script:
cd src/pipeline python pipeline_main.py [OPTIONS]The outputs for each stage will be saved in the
src/pipeline/output/directory.Available command-line arguments:
-
-img,--input_images- Zero or more image filenames (e.g.,
image1.png image2.jpg). - Defaults to
['desk.jpg'](provided example image) if not specified. - Images must be located in the
src/pipeline/input_images/folder.
- Zero or more image filenames (e.g.,
-
-iters,--iterations- Integer number of times to run the pipeline per image.
- Defaults to
1.
-
-cfg,--config_file- Configuration filename (e.g.,
config.json,config2.json). - Defaults to
config.json. - File must be located in the
src/pipeline/config/directory.
- Configuration filename (e.g.,
An execution example:
cd src/pipeline python pipeline_main.py -img input_image1.jpg input_image2.jpg -iters 2 -cfg config2.jsonRunning without any options will process the example image
desk.jpgonce usingconfig.json.You can also consult the help message for more information:
cd src/pipeline python pipeline_main.py --helpEach stage of the pipeline can be executed independently by executing the corresponding script of the final implementations of each stage's Strategy. It is recommended to open these scripts in an IDE to customize the main function (selecting the input image, etc.). These scripts are located in
src/pipeline/talos/subdirectories and are named as follows:qwen_tagging.py: Runs the Tagging stage using the Qwen model.minicpm_tagging.py: Runs the Tagging stage using the MiniCPM model.gemma_tagging.py: Runs the Tagging stage using the Gemma model.ram_plus_tagging.py: Runs the Tagging stage using the RAM++ model.lvlm_llm_tagging.py: Runs the Tagging stage using the LVLM Image Description and LLM Keyword Extraction method (the models are defined in the main function).llava_image_description.py: Runs the LVLM Image Description method of the Tagging stage using the LLaVA model.qwen_image_description.py: Runs the LVLM Image Description method of the Tagging stage using the Qwen model.minicpm_image_description.py: Runs the LVLM Image Description method of the Tagging stage using the MiniCPM model.deepseek_keyword_extraction.py: Runs the LLM Keyword Extraction method of the Tagging stage using the DeepSeek model.qwen_keyword_extraction.py: Runs the LLM Keyword Extraction method of the Tagging stage using the Qwen model.minicpm_keyword_extraction.py: Runs the LLM Keyword Extraction method of the Tagging stage using the MiniCPM model.llama_keyword_extraction.py: Runs the LLM Keyword Extraction method of the Tagging stage using the Llama model.grounding_dino_location.py: Runs the Location stage using the Grounding DINO model.sam2_segmentation.py: Runs the Segmentation stage using the SAM2 model.
These scripts can be run directly from the command line and from the IDE:
python qwen_tagging.py
-
-
Configuration file: The pipeline configuration is defined in a JSON file located in
src/pipeline/config/. The default configuration file isconfig.json, but you can create and use your own configuration files. This file specifies the models and parameters for each stage of the pipeline. -
Large Vision-Language Models (LVLMs) and Large Language Models (LLMs) prompt customization: TALOS allows you to customize the prompts used by the LVLMs and LLMs in the Tagging stage. You can modify the prompt
txtfiles, located in:- Direct LVLM Tagging:
src/pipeline/talos/tagging/direct_lvlm_tagging/prompts/prompt.txt - Tagging with LVLM Image Description and LLM Keyword Extraction:
- LVLM Image Description:
src/pipeline/talos/tagging/lvlm_llm_tagging/lvlm_image_description/base_image_description.py(this prompt is defined in the code because it is much shorter than the other prompts) - LLM Keyword Extraction:
src/pipeline/talos/tagging/lvlm_llm_tagging/llm_keyword_extraction/prompts/prompt1.txt(main prompt) and.../prompt2.txt(output enhancement prompt)
- LVLM Image Description:
- Direct LVLM Tagging:
TALOS includes unit tests to ensure the functionality of its components. Please refer to the tests README for more information on how to run the tests.
Please check the evaluation README for detailed information about the evaluation metrics and results of the TALOS pipeline.
Please note that only the Tagging models are shown in the evaluation results, as the Location stage for every case was performed using Grounding DINO, and the Segmentation stage was performed using SAM2 (Segment Anything Model 2). The evaluation results are based on a random subset of 1,000 images from the LVIS dataset.
| TAGGING MODEL(S) | Detection count | Label precision | Label recall | BBox sim. | Mask sim. | Avg final score | Avg exec. time (s) |
|---|---|---|---|---|---|---|---|
| MiniCPM | 68.88 | 40.88 | 41.92 | 75.26 | 71.09 | 59.61 | 1.63 |
| Gemma | 58.80 | 30.79 | 58.73 | 75.43 | 71.99 | 59.15 | 8.11 |
| Qwen | 58.01 | 29.78 | 59.73 | 74.90 | 72.02 | 58.89 | 4.18 |
| MiniCPM + MiniCPM | 65.34 | 27.65 | 47.08 | 74.68 | 72.28 | 57.41 | 2.68 |
| LLaVA + MiniCPM | 64.92 | 28.60 | 43.58 | 74.78 | 70.88 | 56.55 | 5.63 |
| RAM Plus | 66.22 | 26.32 | 49.50 | 71.15 | 68.86 | 56.41 | 0.63 |
Contributions are welcome!
- Fork the repository.
- Create a new branch: git checkout -b feature/my-feature.
- Make your changes.
- Push to your fork: git push origin feature/my-feature.
- Submit a pull request with a clear description of your changes.
All pull requests will be reviewed and require approval before being merged into the main branch.
If you use TALOS in your research, please cite the TALOS paper as follows:
@article{decena2025talos,
author = {Decena-Gimenez, M. and Moncada-Ramirez, J. and Ruiz-Sarmiento, J.R. and Gonzalez-Jimenez, J.},
title = {Instance semantic segmentation using an open vocabulary},
journal = {Simposio CEA de Robótica, Bioingeniería, Visión Artificial y Automática Marina 2025},
volume = {1},
number = {1},
year = {2025},
url = {https://ingmec.ual.es/ojs/index.php/RBVM25/article/view/38}
}This project is licensed under the GPL-3.0 License. See the LICENSE file for details.
Hi! I'm Maco 👋
This project is being developed as part of my Bachelor's Thesis in Software Engineering, as a member of the MAPIR research group at the University of Málaga, Spain. It would not have been possible without the trust and support of my professors Javier and Raúl. Thank you for believing in me!
If you have any questions, suggestions, or feedback, please reach out to me!
- Linkedin: macorisd
- Email: macorisd@gmail.com



