The hand cannot think

Earlier this week I was hanging out in one of the engineering chats I am part of and the moderator dropped a take that lit the room up, which was that frontend is cooked and only backend engineers are going to be required from now on. The thread took off, and I said what I actually thought, which was that the take is wrong because frontend is actually the harder half to build with LLMs, not the easier one, and the reason is the one nobody wanted to engage with. The LLM cannot see the jitters, the flickers, or the moment a layout shifts two pixels off the baseline and the whole page starts to feel wrong.

Multiple people jumped in to say I was comparing frontend with backend, and I said back that the comparison I was making is frontend engineering using LLMs versus backend engineering using LLMs, and in that comparison frontend is the harder half by a lot, but the room had already decided frontend was cooked and was not in the mood to hear otherwise.

So here are the arguments.

A backend in Python or Rust is built out of the same medium the model is built out of. Tokens, functions, libraries, types, all of it is text and all of it has rules the model can pattern-match. When you ask an LLM to write a CRUD endpoint, you are asking it to do the one thing it was trained to do, which is predict the next plausible token in a stream of tokens that look like the streams it has seen a million times before. It is not thinking, and it is not even close to thinking, it is autocompleting at a very high resolution, and the autocomplete is so good that it feels like thinking, which is the part that fools people.

A frontend is something else. A frontend is a place where three or four constraints have to be true at the same time, and the constraints live in physics and in the browser, not in the language. A button has to be 12 pixels from the edge and the icon has to land on the baseline. The text has to wrap when the viewport shrinks, the keyboard focus ring has to be visible, the animation has to run at 60 frames, the screen reader has to announce the right role. None of these constraints are linguistic, and they are physical and geometric and environmental, and they are all changing every time the user resizes the window or switches inputs. An LLM has no model of any of that. It has tokens that look like the tokens another developer used in another project, and when the geometry is wrong it does not know the geometry is wrong because it cannot see the geometry. It is doing language about a thing that is not language.

The framing I keep coming back to is the limb. The LLM is a limb, like an arm or a leg, and like a hand it is very good at the things hands are good at, which is gripping and lifting and swinging, and it is very bad at the things hands are not designed for, which is seeing in three dimensions or understanding the physics of the world. You would not ask your hand to do your taxes. You would not ask your leg to write a novel. The LLM is the same. It is a great text hand and it is not a world brain.

That sentence is the cleanest summary I have found for what is going wrong with the way developers deploy these models. They treat the LLM as if it were a general intelligence that happens to express itself in code. It is not. It is a very sophisticated next-token predictor that happens to be expressed in English, and English is not a model of the world. English is a description of the world, written by people who were already inside the world, and the description is lossy, and the LLM only has the description, not the world. It does not know what “left” is in the way you know what “left” is. It has tokens for “left” and tokens for “right” but those tokens do not point to any spatial coordinate, they point to other tokens. That is the whole game. It is a very good game and it is worth a lot, but it is not the game people think they are playing.

I learned this the slow way through self-driving cars. Around 2017 I was training CNNs end to end on driving data, and the working assumption in the field was that if you threw enough hours of driving video at a deep enough network, the network would learn the way humans learn to drive, which is to say, by watching and intuiting the rules of the road from observation. It is a beautiful hypothesis and it is also wrong, and the wrongness took me about two years and a lot of broken steering corrections to figure out. A CNN does not actually know the world. It knows the way pixels tend to co-occur with labels, and it can recognize a stop sign because the textures of stop signs have a particular statistical fingerprint, but it cannot tell you that a stop sign is a piece of metal bolted to a post that exists in a three-dimensional space, and that the metal has mass, and that a car has mass, and that two masses cannot occupy the same space at the same time. It cannot do physics, and it cannot do the thing a fourteen year old can do, which is imagine what would happen if the car in front braked hard and the car behind did not. The CNN is solving a different problem, and the gap between the problem it solves and the problem you actually need solved is the gap that takes two years and a lot of broken parts to see.

This is the same gap I see in frontend. The LLM is solving a language problem and the browser is a physics problem, and the LLM is not going to bridge the gap by getting better at language. It is going to bridge the gap by getting better at physics, and that is the part Yann LeCun has been saying for a long time and most of the industry has been ignoring because the language models were making so much money that nobody wanted to hear the architect of the convolutional networks say that they were a dead end. He left Meta in November to start AMI Labs, which raised a billion dollars in March to build what he calls world models, which are systems that learn the actual physics of the world instead of trying to autocomplete a description of it. The thing LeCun is saying, in slightly more technical language, is exactly what I learned the hard way in the hills with a CNN. You cannot drive a car on patterns of pixels and you cannot lay out a page on patterns of tokens. Both of those jobs need a model of the world, not a model of the description of the world.

I am hopeful that world models come fast, and the hope is grounded in the fact that the work LeCun and a handful of others are doing is the missing piece for self-driving, where the path from CNN to something safer and more general has been stuck for a decade. The same gap that is going to close the frontend problem is going to close the self-driving problem, because the world is the same world and the gap is the same gap and the fix is the same fix. A model that knows what a button is in three dimensions will not need you to explain aria roles. A model that knows what a car is in three dimensions will not need you to hand-label every stop sign in the dataset. I think the fix is closer than the hype cycle makes it look.

In the meantime, use the hand for the things the hand is good at, and do not mistake the hand for the brain, because the day you forget the difference is the day the hand starts confidently producing things that look right and behave wrong. The hand will keep getting better. The temptation to mistake it for the brain will keep getting stronger. The only thing that protects you is keeping the brain awake and remembering what the hand is actually good at.

Find me on Twitter. I am @troysk704.