CLEVR: Formal evaluation of compositional reasoning in visual question

Does an AI model actually understand what it sees, or is it just exploiting statistical shortcuts? The CLEVR project provides a controlled computational framework to distinguish genuine reasoning from pattern matching in visual question-answering (VQA) systems.

A synthetic environment with diagnostic precision

CLEVR automatically generates 3D scenes with explicit semantics, avoiding the biases of real-world datasets. Each question is represented as an executable functional program, allowing precise mapping of the cognitive skills required to answer it.

This formal structure makes it possible to isolate and test specific capacities: attribute recognition, object comparison, logic reasoning, short-term memory, and spatial relationships. CLEVR filters out “degenerate questions” solvable by guessing or dataset bias, forcing the model to reason.

Dataset structure

Scenes: Rendered with Blender. Composed of simple 3D objects with discrete attributes: shape, color, size, material, and position.
Programs: Functional representations (e.g. filter, relate, count, query_attribute) over structured scene graphs.
Answers: Computed automatically by executing the program over the scene structure, without human labeling.

Example 1: Counting with spatial relation

Question: How many cylinders are to the left of the red cube?
Program: count(filter_shape(cylinder, relate(left, filter_color(red, filter_shape(cube)))))
Required abilities: Spatial reasoning, attribute filtering, counting.

Example 2: Attribute comparison

Question: Do the large cube and the green cylinder have the same material?
Program: equal_material(query_material(filter_size(large, filter_shape(cube))), query_material(filter_color(green, filter_shape(cylinder))))
Required abilities: Filtering, attribute extraction, short-term memory, logical comparison.

Cognitive taxonomy of reasoning

CLEVR categorizes each question by the type of cognitive task it demands. This enables fine-grained evaluation of neural architectures. Example categories:

Exist: Check for object presence under specific filters.
Count: Count the number of filtered objects.
Compare Integer: Quantitative comparison between object groups.
Equal Attribute: Compare the value of attributes across objects.
Query Attribute: Identify an object and retrieve a specific property.

CLEVR is not about realism — it’s about isolating and testing reasoning in artificial systems, under controlled symbolic conditions.

CLEVR: Diagnosing the Reasoning Capabilities of Visual AI Models