Pointing, not describing

I keep seeing AI systems get vision capabilities and the pattern is always the same which is you drop an image in and the model describes what it sees and this has been working for a while now even for open models and even for video and so when I read about a new technique from DeepSeek that adds vision to an AI system my first reaction was so what because vision is not the bottleneck anymore. But then I actually read the paper and it turns out they are not just adding vision and they are changing how the AI sees and that is the part that matters because the old approach to visual reasoning asks the model to describe what it sees with words and if you want to count the number of strawberries in a bowl the model has to say something like there are strawberries clustered on the left and a few scattered near the rim and some are partially hidden behind others and some have their stems visible and at the end you cross your fingers that it got the number right and you cannot really check its work because the whole process happened in a black box of language. The new approach does something different and it is so obvious in retrospect because when you or I want to count things in an image we do not write a paragraph about it and we point at each thing with our finger and we count one two three and we are done and instead of describing the image like a poet the AI points like a human and it gets to use visual primitives like coordinates and bounding boxes and trajectories as part of its thinking process which makes it more accurate and also faster because describing stuff with words is a wasteful intermediate step.

The numbers are genuinely surprising because this free system uses about ninety percent fewer visual tokens than most frontier models and it still matches or beats them across seven benchmarks and the paper explicitly excluded their own in-house benchmarks which matters because the easiest way to win a benchmark game is to just invent a new benchmark that you happen to be good at and they did not do that and they let the existing benchmarks speak for themselves and the results held up. This is free and open research which means the technique can be added to existing models and it is described as a blueprint rather than a finished product and I think that is the right framing because the real contribution is not a specific model checkpoint but a way of thinking about visual reasoning that makes the model both cheaper and better which is the kind of tradeoff you do not see very often.

The mechanism behind it is called policy distillation and the idea is that you have a bunch of expert AI models where one of them is great at drawing bounding boxes and another is great at tracing paths through a maze and you train a student model that learns from all of these teachers at once so when the student says here is what I would try the teachers say here is what I would have done instead and after enough iterations the student internalizes all of these different visual skills into a single model. This is why they call it distillation because you are condensing the knowledge of multiple specialists into one generalist and the result is a model that can not only answer questions about images but can trace its own reasoning visually and show you where the crown connects to the octopus and you can see the path it took instead of just trusting the final answer and when something goes wrong you can find the mistake and fix it which is a huge step toward AI systems that we can actually understand.

There are limitations though because the model does not automatically use this kind of pointed thinking and it needs a word as a cue to start reasoning visually and bounding boxes work well for people but if you are trying to count blades of grass or strands of hair then the low resolution becomes a problem and thin structures are always the thing that breaks these visual systems and the topological reasoning does not generalize as reliably as you would want when you show it something completely new. But I feel like this is one of those papers that shifts how people think about the problem and that is happening a lot lately in AI research because we always assumed that making models smarter meant giving them higher resolution images and more pixels and more data and it turns out that sometimes the right move is to give them better tools for thinking about what they already see and less can be more when the less is designed well.

This is also part of a broader shift that I care about which is that the large AI companies are going to IPO and they will become ventures that need to maximize profits every quarter and the open models with free weights become more important every time a company closes off another piece of the stack and this paper makes those open models better for free and it describes the method in enough detail that other researchers can build on it and that is the kind of research I want to see more of.

Find me on Twitter if any of this connects. I am @troysk704.