OpenVQA Documentation

Documentation Status powered-by MILVLG

OpenVQA is a general platform for visual question ansering (VQA) research, with implementing state-of-the-art approaches on different benchmark datasets. Supports for more methods and datasets will be updated continuously.

Installation

This page provides basic prerequisites to run OpenVQA, including the setups of hardware, software, and datasets.

Hardware & Software Setup

A machine with at least 1 GPU (>= 8GB), 20GB memory and 50GB free disk space is required. We strongly recommend to use a SSD drive to guarantee high-speed I/O.

The following packages are required to build the project correctly.

$ pip install -r requirements.txt
$ wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz
$ pip install en_vectors_web_lg-2.1.0.tar.gz

Dataset Setup

The following datasets should be prepared before running the experiments.

Note that if you only want to run experiments on one specific dataset, you can focus on the setup for that and skip the rest.

VQA-v2

  • Image Features

The image features are extracted using the bottom-up-attention strategy, with each image being represented as an dynamic number (from 10 to 100) of 2048-D features. We store the features for each image in a .npz file. You can prepare the visual features by yourself or download the extracted features from OneDrive or BaiduYun. The downloaded files contains three files: train2014.tar.gz, val2014.tar.gz, and test2015.tar.gz, corresponding to the features of the train/val/test images for VQA-v2, respectively.

All the image feature files are unzipped and placed in the data/vqa/feats folder to form the following tree structure:

|-- data
	|-- vqa
	|  |-- feats
	|  |  |-- train2014
	|  |  |  |-- COCO_train2014_...jpg.npz
	|  |  |  |-- ...
	|  |  |-- val2014
	|  |  |  |-- COCO_val2014_...jpg.npz
	|  |  |  |-- ...
	|  |  |-- test2015
	|  |  |  |-- COCO_test2015_...jpg.npz
	|  |  |  |-- ...
  • QA Annotations

Download all the annotation json files for VQA-v2, including the train questions, val questions, test questions, train answers, and val answers.

In addition, we use the VQA samples from the Visual Genome to augment the training samples. We pre-processed these samples by two rules:

  1. Select the QA pairs with the corresponding images appear in the MS-COCO train and val splits;
  2. Select the QA pairs with the answer appear in the processed answer list (occurs more than 8 times in whole VQA-v2 answers).

We provide our processed vg questions and annotations files, you can download them from OneDrive or BaiduYun.

All the QA annotation files are unzipped and placed in the data/vqa/raw folder to form the following tree structure:

|-- data
	|-- vqa
	|  |-- raw
	|  |  |-- v2_OpenEnded_mscoco_train2014_questions.json
	|  |  |-- v2_OpenEnded_mscoco_val2014_questions.json
	|  |  |-- v2_OpenEnded_mscoco_test2015_questions.json
	|  |  |-- v2_OpenEnded_mscoco_test-dev2015_questions.json
	|  |  |-- v2_mscoco_train2014_annotations.json
	|  |  |-- v2_mscoco_val2014_annotations.json
	|  |  |-- VG_questions.json
	|  |  |-- VG_annotations.json

GQA

  • Image Features

Download the spatial features and object features for GQA from its official website. Spatial Features Files include gqa_spatial_*.h5 and gqa_spatial_info.json. Object Features Files include gqa_objects_*.h5 and gqa_objects_info.json.To make the input features consistent with those for VQA-v2, we provide a script to transform .h5 feature files into multiple .npz files, with each file corresponding to one image.

$ cd data/gqa

$ unzip spatialFeatures.zip
$ python gqa_feat_preproc.py --mode=spatial --spatial_dir=./spatialFeatures --out_dir=./feats/gqa-grid
$ rm -r spatialFeatures.zip ./spatialFeatures

$ unzip objectFeatures.zip
$ python gqa_feat_preproc.py --mode=object --object_dir=./objectFeatures --out_dir=./feats/gqa-frcn
$ rm -r objectFeatures.zip ./objectFeatures

All the processed feature files are placed in the data/gqa/feats folder to form the following tree structure:

|-- data
	|-- gqa
	|  |-- feats
	|  |  |-- gqa-frcn
	|  |  |  |-- 1.npz
	|  |  |  |-- ...
	|  |  |-- gqa-grid
	|  |  |  |-- 1.npz
	|  |  |  |-- ...
  • Questions and Scene Graphs

Download all the GQA QA files from the official site, including all the splits needed for training, validation and testing. Download the scene graphs files for train and val splits from the official site. Download the supporting files from the official site, including the train and val choices supporting files for the evaluation.

All the question files and scene graph files are unzipped and placed in the data/gqa/raw folder to form the following tree structure:

|-- data
	|-- gqa
	|  |-- raw
	|  |  |-- questions1.2
	|  |  |  |-- train_all_questions
	|  |  |  |  |-- train_all_questions_0.json
	|  |  |  |  |-- ...
	|  |  |  |  |-- train_all_questions_9.json
	|  |  |  |-- train_balanced_questions.json
	|  |  |  |-- val_all_questions.json
	|  |  |  |-- val_balanced_questions.json
	|  |  |  |-- testdev_all_questions.json
	|  |  |  |-- testdev_balanced_questions.json
	|  |  |  |-- test_all_questions.json
	|  |  |  |-- test_balanced_questions.json
	|  |  |  |-- challenge_all_questions.json
	|  |  |  |-- challenge_balanced_questions.json
	|  |  |  |-- submission_all_questions.json
	|  |  |-- eval
	|  |  |  |-- train_choices
	|  |  |  |  |-- train_all_questions_0.json
	|  |  |  |  |-- ...
	|  |  |  |  |-- train_all_questions_9.json
	|  |  |  |-- val_choices.json
	|  |  |-- sceneGraphs
	|  |  |  |-- train_sceneGraphs.json
	|  |  |  |-- val_sceneGraphs.json

CLEVR

  • Images, Questions and Scene Graphs

Download all the CLEVR v1.0 from the official site, including all the splits needed for training, validation and testing.

All the image files, question files and scene graph files are unzipped and placed in the data/clevr/raw folder to form the following tree structure:

|-- data
	|-- clevr
	|  |-- raw
	|  |  |-- images
	|  |  |  |-- train
	|  |  |  |  |-- CLEVR_train_000000.json
	|  |  |  |  |-- ...
	|  |  |  |  |-- CLEVR_train_069999.json
	|  |  |  |-- val
	|  |  |  |  |-- CLEVR_val_000000.json
	|  |  |  |  |-- ...
	|  |  |  |  |-- CLEVR_val_014999.json
	|  |  |  |-- test
	|  |  |  |  |-- CLEVR_test_000000.json
	|  |  |  |  |-- ...
	|  |  |  |  |-- CLEVR_test_014999.json
	|  |  |-- questions
	|  |  |  |-- CLEVR_train_questions.json
	|  |  |  |-- CLEVR_val_questions.json
	|  |  |  |-- CLEVR_test_questions.json
	|  |  |-- scenes
	|  |  |  |-- CLEVR_train_scenes.json
	|  |  |  |-- CLEVR_val_scenes.json
  • Image Features

To make the input features consistent with those for VQA-v2, we provide a script to extract image features using a pre-trained ResNet-101 model like most previous works did and generate .h5 files, with each file corresponding to one image.

$ cd data/clevr

$ python clevr_extract_feat.py --mode=all --gpu=0

All the processed feature files are placed in the data/clevr/feats folder to form the following tree structure:

|-- data
	|-- clevr
	|  |-- feats
	|  |  |-- train
	|  |  |  |-- 1.npz
	|  |  |  |-- ...
	|  |  |-- val
	|  |  |  |-- 1.npz
	|  |  |  |-- ...
	|  |  |-- test
	|  |  |  |-- 1.npz
	|  |  |  |-- ...

Getting Started

This page provides basic tutorials about the usage of mmdetection. For installation instructions, please see Installation.

Training

The following script will start training a mcan_small model on the VQA-v2 dataset:

$ python3 run.py --RUN='train' --MODEL='mcan_small' --DATASET='vqa'
  • --RUN={'train','val','test'} to set the mode to be executed.
  • --MODEL=str, e.g., to assign the model to be executed.
  • --DATASET={'vqa','gqa','clevr'} to choose the dataset to be executed.

All checkpoint files will be saved to:

ckpts/ckpt_<VERSION>/epoch<EPOCH_NUMBER>.pkl

and the training log file will be placed at:

results/log/log_run_<VERSION>.txt

To add:

  • --VERSION=str, e.g., --VERSION='v1' to assign a name for your this model.
  • --GPU=str, e.g., --GPU='2' to train the model on specified GPU device.
  • --SEED=int, e.g., --SEED=123 to use a fixed seed to initialize the model, which obtains exactly the same model. Unset it results in random seeds.
  • --NW=int, e.g., --NW=8 to accelerate I/O speed.
  • --SPLIT=str to set the training sets as you want. Setting --SPLIT='train' will trigger the evaluation script to run the validation score after every epoch automatically.
  • --RESUME=True to start training with saved checkpoint parameters. In this stage, you should assign the checkpoint version--CKPT_V=str and the resumed epoch number CKPT_E=int.
  • --MAX_EPOCH=int to stop training at a specified epoch number.

If you want to resume training from an existing checkpoint, you can use the following script:

$ python3 run.py --RUN='train' --MODEL='mcan_small' --DATASET='vqa' --CKPT_V=str --CKPT_E=int

where the args CKPT_V and CKPT_E must be specified, corresponding to the version and epoch number of the loaded model.

Multi-GPU Training and Gradient Accumulation

We recommend to use the GPU with at least 8 GB memory, but if you don’t have such device, we provide two solutions to deal with it:

  • Multi-GPU Training:

    If you want to accelerate training or train the model on a device with limited GPU memory, you can use more than one GPUs:

    Add --GPU='0, 1, 2, 3...'

    The batch size on each GPU will be adjusted to BATCH_SIZE/#GPUs automatically.

  • Gradient Accumulation:

    If you only have one GPU less than 8GB, an alternative strategy is provided to use the gradient accumulation during training:

    Add --ACCU=n

    This makes the optimizer accumulate gradients forn small batches and update the model weights at once. It is worth noting that BATCH_SIZE must be divided by n to run this mode correctly.

Validation and Testing

Warning: The args --MODEL and --DATASET should be set to the same values as those in the training stage.

Validation on Local Machine

Offline evaluation on local machine only support the evaluations on the val split. If you want to evaluate the test split, please see [Evaluation on online server](#Evaluation on online server).

There are two ways to start:

(Recommend)

$ python3 run.py --RUN='val' --MODEL=str --DATASET='{vqa,gqa,clevr}' --CKPT_V=str --CKPT_E=int

or use the absolute path instead:

$ python3 run.py --RUN='val' --MODEL=str --DATASET='{vqa,gqa,clevr}' --CKPT_PATH=str
  • For VQA-v2, the results on val split

Testing on Online Server

All the evaluations on the test split of VQA-v2, GQA and CLEVR benchmarks can be achieved by using

$ python3 run.py --RUN='test' --MODEL=str --DATASET='{vqa,gqa,clevr}' --CKPT_V=str --CKPT_E=int

Result file are saved at: results/result_test/result_run_<CKPT_V>_<CKPT_E>.json

  • For VQA-v2, the result file is uploaded the VQA challenge website to evaluate the scores on test-dev or test-std split.
  • For GQA, the result file is uploaded to the GQA Challenge website to evaluate the scores on test or test-dev split.
  • For CLEVR, the result file can be evaluated via sending an email to the author Justin Johnson with attaching this file, and he will reply the scores via email too.

Benchmark and Model Zoo

Environment

We use the following environment to run all the experiments in this page.

  • Python 3.6
  • PyTorch 0.4.1
  • CUDA 9.0.176
  • CUDNN 7.0.4

VQA-v2

We provide three groups of results (including the accuracies of Overall, Yes/No, Number and Other) for each model on VQA-v2 using different training schemes as follows. We provide pre-trained models for the latter two schemes.

  • Train -> Val: trained on the train split and evaluated on the val split.
  • Train+val -> Test-dev: trained on the train+val splits and evaluated on the test-dev split.
  • Train+val+vg -> Test-dev: trained on the train+val+vg splits and evaluated on the test-dev split.

Note that for one model, the used base learning rate in the two schemes may be different, you should modify this setting in the config file to reproduce the results.

Train -> Val

Model Base lr Overall (%) Yes/No (%) Number (%) Other (%)
BUTD 2e-3 63.84 81.40 43.81 55.78
MFB 7e-4 65.35 83.23 45.31 57.05
MFH 7e-4 66.18 84.07 46.55 57.78
BAN-4 2e-3 65.86 83.53 46.36 57.56
BAN-8 2e-3 66.00 83.61 47.04 57.62
MCAN-small 1e-4 67.17 84.82 49.31 58.48
MCAN-large 7e-5 67.50 85.14 49.66 58.80
MMNasNet-small 1.2e-4 67.79 85.02 52.25 58.80
MMNasNet-large 7e-5 67.98 85.22 52.04 59.09

Train+val -> Test-dev

Model Base lr Overall (%) Yes/No (%) Number (%) Other (%) Download
BUTD 2e-3 66.98 83.28 46.19 57.85 model
MFB 7e-4 68.29 84.64 48.29 58.89 model
MFH 7e-4 69.11 85.56 48.81 59.69 model
BAN-4 1.4e-3 68.9 85.0 49.5 59.56 model
BAN-8 1.4e-3 69.07 85.2 49.63 59.71 model
MCAN-small 1e-4 70.33 86.77 52.14 60.40 model
MCAN-large 5e-5 70.48 86.90 52.11 60.63 model

Train+val+vg -> Test-dev

Model Base lr Overall (%) Yes/No (%) Number (%) Other (%) Download
BUTD 2e-3 67.54 83.48 46.97 58.62 model
MFB 7e-4 68.25 84.79 48.24 58.68 model
MFH 7e-4 68.86 85.38 49.27 59.21 model
BAN-4 1.4e-3 69.31 85.42 50.15 59.91 model
BAN-8 1.4e-3 69.48 85.40 50.82 60.14 model
MCAN-small 1e-4 70.69 87.08 53.16 60.66 model
MCAN-large 5e-5 70.82 87.19 52.56 60.98 model
MMNasNet-small 1e-4 71.24 87.11 56.15 61.08 model
MMNasNet-large 5e-5 71.45 87.29 55.71 61.45 model

GQA

We provide a group of results (including Accuracy, Binary, Open, Validity, Plausibility, Consistency, Distribution) for each model on GQA as follows.

  • Train+val -> Test-dev: trained on the train(balance) + val(balance) splits and evaluated on the test-dev(balance) split.

The results shown in the following are obtained from the online server. Note that the offline Test-dev result is evaluated by the provided offical script, which results in slight difference compared to the online result due to some unknown reasons.

Train+val -> Test-dev

Model Base lr Accuracy (%) Binary (%) Open (%) Validity (%) Plausibility (%) Consistency (%) Distribution Download
BUTD (frcn+bbox) 2e-3 53.38 67.78 40.72 96.62 84.81 77.62 1.26 model
BAN-4 (frcn+bbox) 2e-3 55.01 72.02 40.06 96.94 85.67 81.85 1.04 model
BAN-8 (frcn+bbox) 1e-3 56.19 73.31 41.13 96.77 85.58 84.64 1.09 model
MCAN-small (frcn) 1e-4 53.41 70.29 38.56 96.77 85.32 82.29 1.40 model
MCAN-small (frcn+grid) 1e-4 54.28 71.68 38.97 96.79 85.11 84.49 1.20 model
MCAN-small (frcn+bbox) 1e-4 58.20 75.87 42.66 97.01 85.41 87.99 1.25 model
MCAN-small (frcn+bbox+grid) 1e-4 58.38 76.49 42.45 96.98 84.47 87.36 1.29 model
MCAN-large (frcn+bbox+grid) 5e-5 58.10 76.98 41.50 97.01 85.43 87.34 1.20 model

CLEVR

We provide a group of results (including Overall, Count, Exist, Compare Numbers, Query Attribute, Compare Attribute) for each model on CLEVR as follows.

  • Train -> Val: trained on the train split and evaluated on the val split.

Train -> Val

Model Base lr Overall (%) Count (%) Exist (%) Compare Numbers (%) Query Attribute (%) Compare Attribute (%) Download
MCAN-small 4e-5 98.74 96.81 99.27 98.89 99.53 99.19 model

Adding a custom VQA model

This is a tutorial on how to add a custom VQA model into OpenVQA. Follow the steps below, you will obtain a model that can run across VQA/GQA/CLEVR datasets.

1. Preliminary

All implemented models are placed at <openvqa>/openvqa/models/, so the first thing to do is to create a folder there for your VQA model named by <YOU_MODEL_NAME>. After that, all your model related files will be placed in the folder <openvqa>/openvqa/models/<YOU_MODEL_NAME>/.

2. Dataset Adapter

Create a python file <openvqa>/openvqa/models/<YOU_MODEL_NAME>/adapter.py to bridge your model and different datasets. Different datasets have different input features, thus resulting in different operators to handle the features.

Input

Input features (packed as feat_dict) for different datasets.

Output

Customized pre-processed features to be fed into the model.

Adapter Template

from openvqa.core.base_dataset import BaseAdapter
class Adapter(BaseAdapter):
    def __init__(self, __C):
        super(Adapter, self).__init__(__C)
        self.__C = __C

    def vqa_init(self, __C):
    	# Your Implementation

    def gqa_init(self, __C):
    	# Your Implementation

    def clevr_init(self, __C):
    	# Your Implementation

    def vqa_forward(self, feat_dict):
    	# Your Implementation
       
    def gqa_forward(self, feat_dict):
    	# Your Implementation
        
    def clevr_forward(self, feat_dict):
    	# Your Implementation
    

Each dataset-specific initiation function def <dataset>_init(self, __C) corresponds to one feed-forward function def <dataset>_forward(self, feat_dict), your implementations should follow the principles torch.nn.Module.__init__() and torch.nn.Module.forward(), respectively.

The variable feat_dict consists of the input feature names for the datasets, which corresponds to the definitions in <openvqa>/openvqa/core/base_cfg.py

vqa:{
	'FRCN_FEAT': buttom-up features -> [batchsize, num_bbox, 2048],
	'BBOX_FEAT': bbox coordinates -> [batchsize, num_bbox, 5],
}
gqa:{
	'FRCN_FEAT': official buttom-up features -> [batchsize, num_bbox, 2048],
	'BBOX_FEAT': official bbox coordinates -> [batchsize, num_bbox, 5],
	'GRID_FEAT': official resnet grid features -> [batchsize, num_grid, 2048],
}
clevr:{
	'GRID_FEAT': resnet grid features -> [batchsize, num_grid, 1024],
}

More detailed examples can be referred to the adapter for the MCAN model.

3. Definition of model hyper-parameters

Create a python file named <openvqa>/openvqa/models/<YOUR MODEL NAME>/model_cfgs.py

Configuration Template

from openvqa.core.base_cfgs import BaseCfgs
class Cfgs(BaseCfgs):
    def __init__(self):
        super(Cfgs, self).__init__()
        # Your Implementation

Only the variable you defined here can be used in the network. The variable value can be override in the running configuration file described later.

Example

# model_cfgs.py
from openvqa.core.base_cfgs import BaseCfgs
class Cfgs(BaseCfgs):
    def __init__(self):
        super(Cfgs, self).__init__()
        self.LAYER = 6
# net.py
class Net(nn.Module):
    def __init__(self, __C, pretrained_emb, token_size, answer_size):
        super(Net, self).__init__()
        self.__C = __C
        
        print(__C.LAYER)
Output: 6

4. Main body

Create a python file for the main body of the model as <openvqa>/openvqa/models/<YOUR MODEL NAME>/net.py. Note that the filename must be net.py since this filename will be invoked by the running script. Except the file, other auxiliary model files invoked by net.py can be named arbitrarily.

When implementation, you should pay attention to the following restrictions:

  • The main module should be named Net, i.e., class Net(nn.Module):
  • The init function has three input variables: pretrained_emb corresponds to the GloVe embedding features for the question; token_size corresponds to the number of all dataset words; answer_size corresponds to the number of classes for prediction.
  • The forward function has four input variables: frcn_feat, grid_feat, bbox_feat, ques_ix.
  • In the init function, you should initialize the Adapter which you’ve already defined above. In the forward function, you should feed frcn_feat, grid_feat, bbox_feat into the Adapter to obtain the processed image features.
  • Return a prediction tensor of size [batch_size, answer_size]. Note that no activation function like sigmoid or softmax is appended on the prediction. The activation has been designed for the prediction in the loss function outside.

Model Template

import torch.nn as nn
from openvqa.models.mcan.adapter import Adapter
class Net(nn.Module):
    def __init__(self, __C, pretrained_emb, token_size, answer_size):
    	super(Net, self).__init__()
    	self.__C = __C
		self.adapter = Adapter(__C)
   
   	def forward(self, frcn_feat, grid_feat, bbox_feat, ques_ix):
   		img_feat = self.adapter(frcn_feat, grid_feat, bbox_feat)
   		# model implementation
   	    ...
   	    
		return pred

5. Declaration of running configurations

Create a yml file at<openvqa>/configs/<dataset>/<YOUR_CONFIG_NAME>.yml and define your hyper-parameters here. We suggest that <YOUR_CONFIG_NAME>= <YOUR_MODEL_NAME>. If you have the requirement to have one base model support the running scripts for different variants. (e.g., MFB and MFH), you can have different yml files (e.g., mfb.yml and mfh.yml) and use the MODEL_USE param in the yml file to specify the actual used model (i.e., mfb).

Example:

MODEL_USE: <YOUR MODEL NAME>  # Must be defined
LAYER: 6
LOSS_FUNC: bce
LOSS_REDUCTION: sum

Finally, to register the added model to the running script, you can modify <openvqa/run.py> by adding your <YOUR_CONFIG_NAME> into the arguments for models here.

By doing all the steps above, you are able to use --MODEL=<YOUR_CONFIG_NAME> to train/val/test your model like other provided models. For more information about the usage of the running script, please refer to the Getting Started page.

Contributing to OpenVQA

All kinds of contributions are welcome, including but not limited to the following.

  • Fixes (typo, bugs)
  • New features and components

Workflow

  1. fork and pull the latest version of OpenVQA
  2. checkout a new branch (do not use master branch for PRs)
  3. commit your changes
  4. create a PR

Code style

Python

We adopt PEP8 as the preferred code style. We use flake8 as the linter and yapf as the formatter. Please upgrade to the latest yapf (>=0.27.0) and refer to the configuration.

Before you create a PR, make sure that your code lints and is formatted by yapf.

C++ and CUDA

We follow the Google C++ Style Guide.


This repo is currently maintained by Zhou Yu (@yuzcccc) and Yuhao Cui (@cuiyuhao1996).

This version of the documentation was built on Sep 03, 2021.