Back to perspectives
PERSPECTIVES
Visual Intelligence: From Pixels to Reasoning Systems
Last week we hosted an AI breakfast roundtable in Munich with the team from Black Forest Labs (BFL). No demos, no theatrics, just curious AI researchers and builders debating what visual intelligence actually is, and why most of the current discourse is still off by a layer. This piece is a synthesis of that discussion. It’s opinionated by design from the perspective of our investor, Laurin Class.
Jan 26, 2026
6 Min Read
Earlybird News
Share
Image and video models already look impressive, and especially compared to early iterations, the progress is undeniably real. Systems like BFL’s FLUX.2 have largely solved several first-order problems that mattered enormously only a year ago: image fidelity, photorealism, color consistency, composition, and increasingly reliable spatial reasoning within a single frame. These are not incremental wins, they represent genuine boundary-pushing and should be acknowledged as such.
But solving frame-level quality is not yet the same as solving intelligence. What remains unsolved is reasoning over time. Most models today are still renderers. They interpolate patterns extremely well, but they don’t maintain persistent state or understanding. Objects drift across frames, identities mutate, physics breaks, and scenes lack continuity. Durable memory, notion of intent, a grounded model of why elements exist in a scene are still in the works. Calling this visual intelligence today would be generous. We generate pixels, not understanding. That’s not a critique, it’s a status update.
The root cause is structural. Text models train on symbols that already encode meaning. Language comes pre-compressed, with semantics, intent, and causality baked in. Vision models train on raw sensory data. They must learn structure, causality, and compression from scratch. That gap explains why time, consistency, and cost become the core bottlenecks in vision, and why progress here feels slower despite massive investment.
Human language transmits surprisingly little information. As BFL founder Andreas Blattmann illustrated with a concrete example, across languages the effective bandwidth is roughly forty bits per second - measured in entropy, i.e. how much uncertainty language actually removes. Nature, by contrast, is brutally high-bandwidth and low-compression. Images, video, and audio carry orders of magnitude more raw signal, and none of it is annotated with meaning. Models have to discover the abstractions themselves. That inevitably means more data, more compute, and far better representations. Moving from text to images added a spatial dimension. Video adds time, and time easily breaks things.
This is why the hardest problems in visual intelligence remain largely unsolved. Long-horizon temporal consistency is fragile. Grounding is weak, with models unable to explain why elements exist or how they relate causally. Control and editability remain crude, limiting real creative or industrial use. Cost and latency still make many deployments economically irrational. These are not UX gaps, they are still intelligence gaps.
The real battleground right now is representation learning. Not prompts, not interfaces, not clever orchestration layers. Internal representations determine everything that follows. If models learn the right abstractions, physics can emerge without hard-coded priors, long-horizon video becomes tractable, and real-time generation stops being a fantasy. This aligns with the bitter lesson of machine learning: hand-crafted inductive biases eventually lose to scale and data. Vision models will not learn the world the way humans do, and expecting them to is likely the wrong frame.
The next step, then, is not prettier pixels. It is decision-making in perceptual space. When models move from diffusion-style rendering to systems that can maintain state, reason causally, and simulate outcomes, generation becomes a side effect rather than the objective. At that point, prompted worlds, natural-language-driven simulations, and perceptual agents that act start to make sense.
Aside from representation, real-time capability is the quiet force multiplier in all of this. It is not a marginal latency improvement; it fundamentally changes what systems can be. Interactive loops replace batch jobs, software shifts from static outputs to ongoing interaction, and feedback replaces one‑off generation. Achieving this requires brutal efficiency, which again collapses back to representation quality. Hardware remains a real bottleneck. It will not magically disappear, and models that assume infinite compute will lose.
Evaluation is another weak link. Video evals today are still largely vibes-based, with humans watching clips and declaring them good or bad. That does not scale, and compared to the tooling ecosystem around language models, the gap is striking. Automated, task-based, non-anthropocentric evaluation remains underexplored and represents a meaningful compounding advantage that could be a category of its own.
Where value accrues follows directly from these dynamics. It will not accrue to undifferentiated image or video APIs, and it will not accrue to thin wrappers around foundation models. Value will accrue to teams building reasoning-capable perceptual systems, to control layers that turn models into usable tools, to infrastructure that enables real-time operation, and to organizations that combine models, product thinking, and deep domain context into verticals-specific workflows.

One quiet undercurrent in the discussion was geography. Europe is better positioned here than many assume. Theoretical talent density is high, research culture is strong, and fewer legacy business constraints make focus easier. Increasingly, top researchers are choosing depth over sprawl. In early technological shifts, focus beats resources more often than not, as the BFL Team, based in Freiburg Germany, knows all too well.
Visual intelligence is still early. That is the opportunity. Pixels were the demo. Reasoning systems are the product.
Subscribe to the Gradient Descending calendar to stay up to date on upcoming events, and to Earlybird investor Laurin Class’ Substack for more of his writing.





