From 6dad96c14a606db0110780a274d53246edd4d94e Mon Sep 17 00:00:00 2001 From: Neha Sharma Date: Wed, 24 Sep 2025 15:49:59 +0530 Subject: [PATCH 1/2] Update image captioning guide --- .../how_image_captioning_works.ipynb | 187 +++++++++++++++++- 1 file changed, 186 insertions(+), 1 deletion(-) diff --git a/guide/14-deep-learning/how_image_captioning_works.ipynb b/guide/14-deep-learning/how_image_captioning_works.ipynb index d75e5966fd..c890647203 100644 --- a/guide/14-deep-learning/how_image_captioning_works.ipynb +++ b/guide/14-deep-learning/how_image_captioning_works.ipynb @@ -1 +1,186 @@ -{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Image Captioning"]}, {"cell_type": "markdown", "metadata": {"toc": true}, "source": ["

Table of Contents

\n", "
"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Deep learning has been achieving superhuman level performance in computer vision that can do object classification, object detection, or semantic segmentation of different features. On the other hand, natural language processing models perform well on tasks such as named entity recognition, text classification, etc. This guide explains a new model which is a combination of both image and text. Image captioning model, as the name suggests, generates textual captions of an image. Like all supervised learning scenarios, this model also requires labelled training data in order to train a model. Image captioning technique is mostly done on images taken from handheld camera, however, research continues to explore captioning for remote sensing images. These could help describe the features on the map for accessibility purposes.\n", "Figure 1 shows an example of a few images from the RSICD dataset [1]. This dataset contains up to 5 unique captions for ~11k images."]}, {"cell_type": "markdown", "metadata": {}, "source": ["
\n", " \n", "
\n", "
\n", "
Figure 1. Training data for remote sensing image captioning. [1]
\n", "
\n", "
"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Architecture"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The image captioning model consists of an encoder and a decoder. The encoder extracts out important features from the image. The decoder takes those features as inputs and uses them to generate the caption. Typically, we use [ImageNet](http://www.image-net.org/) Pretrained networks like [VGGNet](https://arxiv.org/abs/1409.1556) or [ResNet](https://arxiv.org/abs/1512.03385) as the encoder for the image. The decoder is a language model that takes in current word and image features as an input and outputs the next word. The words are generated sequentially to complete the caption. The neural networks that are quite apt for this case are Recurrent Neural Networks (RNNs), specifically [Long Short Term Memory (LSTMs)](https://www.bioinf.jku.at/publications/older/2604.pdf) or [Gated Recurrent Units (GRUs)](https://arxiv.org/abs/1412.3555). Some research has also been done to incorporate recently [transformers](https://arxiv.org/abs/1706.03762) as the decoder [4].\n", "\n", "\n", "**Attention Mechanism:** The attention mechanism in image captioning attends to a portion of the image before generating the next word. So, the model can decide where to look in the image to generate the next word. This technique had been proposed in **Show, Attend, and Tell** paper [2]. A figure of this mechanism, along with complete architecture, is shown below. "]}, {"cell_type": "markdown", "metadata": {}, "source": ["
\n", " \n", "
\n", "
\n", "
Figure 2. Image captioning architecture with attention [2]
\n", "
\n", "
"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Implementation in `arcgis.learn`"]}, {"cell_type": "markdown", "metadata": {}, "source": ["In `arcgis.learn`, we have used the architecture shown in Figure 2. It currently supports only the **RSICD dataset** [1] for image captioning due to the lack of remote sensing captioning data. Other datasets are available, i.e., **UC Merced Captions** [3] and **Sydney Captions** [3], but they are not readily accessible. The RSICD dataset size is pretty decent in size as well as diverse, allowing the image captioning model to learn reasonable captions.\n", "\n", "\n", "\n", "We need to put the RSICD dataset in a specific format, i.e., a root folder containing a folder named \"*images*\" and the JSON file containing the annotations named \"*annotations.json*\". The specific format of the json can be seen [here](https://github.com/201528014227051/RSICD_optimal/blob/master/dataset_rsicd.json). "]}, {"cell_type": "markdown", "metadata": {}, "source": ["
\n", " \n", "
\n", "
\n", "
Figure 3. Folder structure for RSICD dataset. A root folder containing \"images\" folder and \"annotations.json\" file.
\n", "
\n", "
"]}, {"cell_type": "markdown", "metadata": {}, "source": ["When we have data in this specific format, we can call the `prepare_data` function with `dataset_type='ImageCaptioning'`, we can then use the data object and pass that into the `ImageCaptioner` class to train the model using the ArcGIS workflow. The `ImageCaptioner` class can be initialized as follows."]}, {"cell_type": "markdown", "metadata": {}, "source": ["```python\n", "from arcgis.learn import ImageCaptioner\n", "ic = ImageCaptioner(data)\n", "```\n", "\n", "Advanced users can play with the internal architecture of the model by passing a keyword argument `decoder_params` as show below.\n", "\n", "```python\n", "ic = ImageCaptioner(data, \n", " decoder_params={\n", " 'embed_size':100, # Size of word embedding to be used during training.\n", " 'hidden_size':100, # Size of hidden activations in the LSTM\n", " 'attention_size':100, # Size of attention vectors\n", " 'teacher_forcing':1, # Probability of using teacher forcing during training.\n", " 'dropout':0.1, # Dropout probability for regularization.\n", " 'pretrained_emb':False #If true, it will use pretrained fasttext embeddings.\n", " }\n", " )\n", "```\n", "\n", "Once the model object is created we can use that to train the model using the `fit` method."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### References\n", "\n", "\n", "* [1] Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, Xuelong Li: \u201cExploring Models and Data for Remote Sensing Image Caption Generation\u201d, 2017; arXiv:1712.07835. DOI: 10.1109/TGRS.2017.2776321.\n", "\n", "* [2] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R. and Bengio, Y., 2015, June. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057).\n", "\n", "* [3] B. Qu, X. Li, D. Tao, and X. Lu, \u201cDeep semantic understanding of high resolution remote sensing image,\u201d International Conference onComputer, Information and Telecommunication Systems, pp. 124\u2013128,2016.\n", "\n", "* [4] Desai, K., & Johnson, J. (2020). VirTex: Learning Visual Representations from Textual Annotations. arXiv preprint arXiv:2006.06666."]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2"}, "toc": {"base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": true}}, "nbformat": 4, "nbformat_minor": 4} \ No newline at end of file +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Image Captioning" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "toc": true + }, + "source": [ + "

Table of Contents

\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Deep learning has been achieving superhuman level performance in computer vision that can do object classification, object detection, or semantic segmentation of different features. On the other hand, natural language processing models perform well on tasks such as named entity recognition, text classification, etc. This guide explains a new model which is a combination of both image and text. Image captioning model, as the name suggests, generates textual captions of an image. Like all supervised learning scenarios, this model also requires labelled training data in order to train a model. Image captioning technique is mostly done on images taken from handheld camera, however, research continues to explore captioning for remote sensing images. These could help describe the features on the map for accessibility purposes.\n", + "Figure 1 shows an example of a few images from the RSICD dataset [1]. This dataset contains up to 5 unique captions for ~11k images." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " \n", + "
\n", + "
\n", + "
Figure 1. Training data for remote sensing image captioning. [1]
\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Architecture" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The image captioning model consists of an encoder and a decoder. The encoder extracts out important features from the image. The decoder takes those features as inputs and uses them to generate the caption. Typically, we use [ImageNet](http://www.image-net.org/) Pretrained networks like [VGGNet](https://arxiv.org/abs/1409.1556) or [ResNet](https://arxiv.org/abs/1512.03385) as the encoder for the image. The decoder is a language model that takes in current word and image features as an input and outputs the next word. The words are generated sequentially to complete the caption. The neural networks that are quite apt for this case are Recurrent Neural Networks (RNNs), specifically [Long Short Term Memory (LSTMs)](https://www.bioinf.jku.at/publications/older/2604.pdf) or [Gated Recurrent Units (GRUs)](https://arxiv.org/abs/1412.3555). Some research has also been done to incorporate recently [transformers](https://arxiv.org/abs/1706.03762) as the decoder [4].\n", + "\n", + "\n", + "**Attention Mechanism:** The attention mechanism in image captioning attends to a portion of the image before generating the next word. So, the model can decide where to look in the image to generate the next word. This technique had been proposed in **Show, Attend, and Tell** paper [2]. A figure of this mechanism, along with complete architecture, is shown below. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " \n", + "
\n", + "
\n", + "
Figure 2. Image captioning architecture with attention [2]
\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Implementation in `arcgis.learn`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In `arcgis.learn`, we have used the architecture shown in Figure 2. It currently supports only the **RSICD dataset** [1] for image captioning due to the lack of remote sensing captioning data. Other datasets are available, i.e., **UC Merced Captions** [3] and **Sydney Captions** [3], but they are not readily accessible. The RSICD dataset size is pretty decent in size as well as diverse, allowing the image captioning model to learn reasonable captions.\n", + "\n", + "\n", + "\n", + "We need to put the RSICD dataset in a specific format, i.e., a root folder containing a folder named \"*images*\" and the JSON file containing the annotations named \"*annotations.json*\". The specific format of the json can be seen [here](https://github.com/201528014227051/RSICD_optimal/blob/master/dataset_rsicd.json). " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " \n", + "
\n", + "
\n", + "
Figure 3. Folder structure for RSICD dataset. A root folder containing \"images\" folder and \"annotations.json\" file.
\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When we have data in this specific format, we can call the `prepare_data` function with `dataset_type='ImageCaptioning'`, we can then use the data object and pass that into the `ImageCaptioner` class to train the model using the ArcGIS workflow. The `ImageCaptioner` class can be initialized as follows." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "```python\n", + "from arcgis.learn import ImageCaptioner\n", + "ic = ImageCaptioner(data)\n", + "```\n", + "\n", + "Advanced users can play with the internal architecture of the model by passing a keyword argument `decoder_params` as show below.\n", + "\n", + "```python\n", + "ic = ImageCaptioner(data, \n", + " decoder_params={\n", + " 'embed_size':100, # Size of word embedding to be used during training.\n", + " 'hidden_size':100, # Size of hidden activations in the LSTM\n", + " 'attention_size':100, # Size of attention vectors\n", + " 'teacher_forcing':1, # Probability of using teacher forcing during training.\n", + " 'dropout':0.1, # Dropout probability for regularization.\n", + " }\n", + " )\n", + "```\n", + "\n", + "Once the model object is created we can use that to train the model using the `fit` method." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### References\n", + "\n", + "\n", + "* [1] Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, Xuelong Li: “Exploring Models and Data for Remote Sensing Image Caption Generation”, 2017; arXiv:1712.07835. DOI: 10.1109/TGRS.2017.2776321.\n", + "\n", + "* [2] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R. and Bengio, Y., 2015, June. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057).\n", + "\n", + "* [3] B. Qu, X. Li, D. Tao, and X. Lu, “Deep semantic understanding of high resolution remote sensing image,” International Conference onComputer, Information and Telecommunication Systems, pp. 124–128,2016.\n", + "\n", + "* [4] Desai, K., & Johnson, J. (2020). VirTex: Learning Visual Representations from Textual Annotations. arXiv preprint arXiv:2006.06666." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:conda-dl] *", + "language": "python", + "name": "conda-env-conda-dl-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.7" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": true, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": true + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From afcb839915f25dda8a3ed8283df6e056eb4761c4 Mon Sep 17 00:00:00 2001 From: Neha Sharma Date: Wed, 24 Sep 2025 15:51:47 +0530 Subject: [PATCH 2/2] Update image captioning guide --- guide/14-deep-learning/how_image_captioning_works.ipynb | 1 - 1 file changed, 1 deletion(-) diff --git a/guide/14-deep-learning/how_image_captioning_works.ipynb b/guide/14-deep-learning/how_image_captioning_works.ipynb index c890647203..30525b30af 100644 --- a/guide/14-deep-learning/how_image_captioning_works.ipynb +++ b/guide/14-deep-learning/how_image_captioning_works.ipynb @@ -114,7 +114,6 @@ "from arcgis.learn import ImageCaptioner\n", "ic = ImageCaptioner(data)\n", "```\n", - "\n", "Advanced users can play with the internal architecture of the model by passing a keyword argument `decoder_params` as show below.\n", "\n", "```python\n",