VQA: Visual Question Answering. VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. 265,016 images (COCO and abstract scenes). At least 3 questions (5.4 questions on average) per image. 10 ground truth answers per question. 3 plausible (but likely incorrect) answers per question. Automatic evaluation metric.