Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It called me a "NASAwannabe," defending that joke as "peak wordplay" and insulting my "Honda Civic."

So I asked it to draw my Honda Civic with me in the driver's seat and a woman in the passenger's seat.

It got it backwards, putting the woman in the driver's seat.

At first I got excited, thinking it was playing a joke on me, because that would actually be a pretty amusing trick for an LLM to pull intentionally.

But then I experimented a bit more and it became clear that it didn't understand the mistake and wasn't capable of fixing it. LLMs just don't have any intelligence.

https://chatgpt.com/share/68a0d27c-fdd4-800e-9f22-ece644ae87...



After using various LLMs for creative project rubber-ducking, I've found that the most common thing for them to mix up while seeming otherwise 'intelligent' is reversing the relationships between two or more things - left and right, taller and shorter, older and younger, etc. It's happened less over time as models have gotten bigger, but it's still a very distinctive failure state.


Left and right are considered opposites, but semantically they’re extremely similar. They both refer to directions that are relative to some particular point and orientation. Compared to, say, the meaning of “backpack,” their meanings are nearly identical. And in the training data, “A right X” and “B right Y” will tend to have very similar As and Bs, and Xs and Ys. No surprise LLMs struggle.

I imagine this is also why it’s so hard to get an LLM to not do something by specifically telling it not to do that thing. “X” and “not X” are very similar.


The image encodings often don’t have positional information in them very well.


A lot of pictures on the web are flipped horizontally bc. of cameras, mirrors, you name it. It's usually trivial for humans to infer what are the directions involved, I wonder if LLMs could do it as well.


Recently I scanned thousands of family photos, but I didn't have a good way to get them oriented correctly before scanning. I figured I could "fix it in post" .

If you upload an incorrectly oriented image to google photos, it will automatically figure that out and suggest the right way up (no EXIF data). So I set about trying to find an open-source way to do that since I'm self-hosting the family photos server.

So far, I haven't managed it. I found a project doing it using pytorch or something, but it didn't work well.


My favorite is asking it to label images with words that contain n and m. A cursive n looks like a non-cursive m. And so if you ask it to label something “drumming” it will use fragments of a cursive n to make a non-cursive n or even use an m instead. Stupid robots.


Off by one MOD one errors. Classic TRUE|FALSE confusion.


Or they simply don’t have that information. OpenAI models have done badly traditionally on placement because the encoding of the image doesn’t include the information very well. Gemini is better as it seems to be passed pre segmented images with bounding box info.

It’s similar to the counting letters problem - they’re not seeing the same thing you are .

On a simple practical level it’s irrelevant whether your problem is not solved because the model can’t understand or the image encoding is useless. However to understand what the models could be capable of it’s a poor test. Like asking how well I can play chess then saying I’m bad at it after watching me play by feel in thick gloves.


How does that apply in any way to this example?


Imagine being asked to draw what the op said, but you couldn’t see what you’d drawn - only a description that said “a man and a woman in a Honda “

Asked to draw a new picture with the history of :

Draw a picture of a man in the driver seat and a woman in the passenger seat.

(Picture of a man and a woman in a car)

No, the man in the drivers seat!

——

How well do you think a very intelligent model could draw the next picture? It failed the first time and the descriptions mean it has no idea what it even drew before.


Coding agents have had good success doing this. Providing the errors allows it to potentially figure out how to fix it. It's able to do more with this iterative approach than without.


But fundamentally it requires that it can actually see the thing it’s trying to fix. Lots of these models can essentially barely see.


I think it applies. Presumably training data is enough to put humans in the front seats in a car, but lacks info on which seat is the driver's seat, or which person was the driver. Maybe I should have tried "steering wheel".


> LLMs just don't have any intelligence.

The believers will go to any lengths of contorted “reasoning” to tell you that this is clearly wrong. Just take this comment thread for one representative of countless examples: https://news.ycombinator.com/item?id=44912646


I noticed it explicitly requested an image of you to add to the generated Civic image, but when provided one it ran up against its guardrails and refused. When provoked into explaining why the sudden refusal, I couldn't make it all the way through the explanation.

Full of sound and fury, signifying nothing. When taking a step back and looking at the conversation leading up to that, it looks just as empty.

Maybe my bullshit detector is especially sensitive, but I can't stand any of these LLM chat conversations.


I'll confess, though... I chuckled at "Queen of Neptune" and "Professor Rockdust". But then again I think Mad Libs is hilarious.


Yes. It's a disturbing to interact with such an confident bullshit generator, especially when the very concept of truth seems to be under attack from all sides today.


Grab a classroom of children and ask them all to draw a nine-pointed star. EVERY SINGLE child, irrespective of their artistic proficiency, will have zero issues.

Those children also didn't need millions of training samples/data of stars with nine points on them. They didn't need to run in a REPL, look at the picture, and say, "Oh darn the luck, it seems I've drawn a star with 8 points. I apologize, you're absolutely right, let me try again!", and lock themselves in a continuous feedback loop until they got it correct either which incidentally is a script that I put together to help improve the prompt adherence of even the SOTA models like Imagen4 and gpt-image-1. (painfully slow and expensive)


Lots of kids will get this wrong, I don’t know what age you’re thinking of here. They need years of direct coaching to get to words, what stars are, how to hold and move a pen, how to count…

Comparing physical drawing to these models is frankly daft for an intelligence test. This is a “count the letters” in image form.


As a parent of a 4 year old in preschool, this is obviously wrong.


I appreciate the sentiment, but I don’t know if this is the best example. I’ve seen adults struggle with drawing stars.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: