TAVATA: a controlled testbed for AI combining vision and language

Can artificial intelligence truly see and understand at the same time? TAVATA 🐝 is a diagnostic testbed that combines computer vision 🖼️ and natural language 🗣️ to evaluate reasoning in controlled and isolated scenarios.

What is TAVATA 🐝?

TAVATA 🐝 stands for Automated Testing of Vision and Textual Analysis. It is a synthetic environment designed to assess whether an AI model can reason over visual scenes and respond to structured questions in natural language with semantic precision.

Controlled vision

Instead of using real-world images (with noise and biases), TAVATA 🐝 generates synthetic 3D scenes. These scenes contain simple objects 🟥🟦🟨 whose properties —shape, color, size— are precisely defined.

Programmatic language

Questions are generated via symbolic functional programs, which eliminate grammatical ambiguity and allow us to precisely track the logical operations required to answer 🧠.

Why combine vision and language?

Much of human intelligence relies on linking what we see with what we understand through language. TAVATA 🐝 follows the same principle to test whether an AI can:

🔍 Locate objects in structured scenes
🧠 Compare attributes like size or color
🗣️ Interpret spatial relationships described in natural language
📏 Reason about quantities (“Are there more cubes than spheres?”)

A testbed designed to isolate reasoning

TAVATA 🐝 is not about learning the real world. Its goal is to diagnose reasoning processes in AI by providing a clean, structured environment for cognitive analysis.

By controlling each element (position, color, question), we can evaluate:

✅ What kind of logic is required to answer a given question
🧪 Whether the model fails due to lack of memory, attention, or understanding
📊 How performance evolves under increasing complexity

TAVATA 🐝 is like a computational microscope: it doesn't look at the world, it looks into the artificial mind trying to understand it.

Making AI See and Understand — Under the Microscope