Visual-Question-Answering

A Visual Question Answering Model with a Simple GUI that deals with both Images and Videos

Dataset

The dataset used is the publicly available VQA Dataset.
VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer.

265,016 images (COCO and abstract scenes)
At least 3 questions (5.4 questions on average) per image
10 ground truth answers per question
3 plausible (but likely incorrect) answers per question
Automatic evaluation metric.

We used the Version 2 of the dataset.

Architecture

Encoder

We used OpenAI's CLIP Encoder to encode the images and the questions in an equal embeddings space.

Decoder

GPT-2 is used as a decoder to convert the encoded embeddings back to sequence whilst comparing it with the 'Ground Truth' answers of the images.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Samples		Samples
templates		templates
FlaskApp.ipynb		FlaskApp.ipynb
README.md		README.md
inference.py		inference.py
main.py		main.py
model.py		model.py
training.ipynb		training.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual-Question-Answering

Dataset

Architecture

Encoder

Decoder

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Visual-Question-Answering

Dataset

Architecture

Encoder

Decoder

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages