People have made the point many times before, that “hallucination” is mainly what generative AI does and not the exception, but most of the time it’s a useful hallucination.
I describe it as more of a “mashup”, like an interpolation of statistically related output that was in the training data.
The thinking was in the minds of the people that created the tons of content used for training, and from the view of information theory there is enough redundancy in the content to recover much of the intent statistically. But some intent is harder to extract just from example content.
So when generating statically similar output, the statistical model can miss the hidden rules that were a part of the thinking that went into the content that was used for training.
Intent does matter if you want to classify things as lies.
If someone told you it's Thursday when it's really Wednesday, we would not necessary say they lied. We would say they were mistaken, if the intent was to tell you the correct day of the week. If they intended to mislead you, then we would say they lied.
So intent does matter. AI isn't lying, it intends to provide you with accurate information.
The AI doesn't intend anything. It produces, without intent, something that would be called lies if it came from a human. It produces the industrial-scale mass-produced equivalent of lies – it's effectively an automated lying machine.
Maybe we should call the output "synthetic lies" to distinguish it it from the natural lies produced by humans?
> statements produced without particular concern for truth, clarity, or meaning, distinguishing "bullshit" from a deliberate, manipulative lie intended to subvert the truth
It's a perfect fit for how LLMs treat "truth": they don't know so that can't care.
So you're saying deliberate deception, mistaken statements and negligent falsehoods should all be considered the same thing, regardless?
Personally, I'd be scared if LLMs were proven to be deliberately deceptive, but I think they currently fall in the two later camps, if we're doing human analogies.
> the statistical model can miss the hidden rules that were a part of the thinking that went into the content that was used for training.
Makes sense. Hidden rules such as, "recommending a package works only if I know the package actually exists and I’m at least somewhat familiar with it."
Now that I think about it, this is pretty similar to cargo-culting.
LLMs don’t really “know” though.mif you look at the recent Anthropic findings, they show that large language models can do math like addition but they do it weird way and when you asked the model how they arrive to the solution they provide method that is completely different to how they actually do it
Our brains killed us if we figured things out wrong. We’d get eaten. We learned to get things right enough, and how to be pretty right, fast, even when we didn’t know the new context (plants, animals, snow, lava).
LLM’s are just so happy to generate enough tokens that look right ish. They need so many examples driven into them during training.
The map is not the territory, and we’re training them on the map of our codified outputs. They don’t actually have to survive. They’re pretty amazing but of course they’re absolutely not doing what we do, because success for us and them look so different. We need to survive.
(Please can we not have one that really wants to survive.)
There is an interesting phenomenon with polynomial interpolation called Runge Spikes. I think "Runge Spikes" offers a better metaphor than "hallucination" and argue the point: https://news.ycombinator.com/item?id=43612517
> People have made the point many times before, that “hallucination” is mainly what generative AI does and not the exception, but most of the time it’s a useful hallucination.
Oh yeah that's exactly what I want from a machine intelligence, a "best friend who knows everything about me," is that they just make shit up that they think I'd like to hear. I'd really love a personal assistant that gets me and my date a reservation at a restaurant that doesn't exist. That'll really spice up the evening.
The mental gymnastics involved in the AI community are truly pushing the boundaries of parody at this point. If your machines mainly generate bullshit, they cannot be serious products. If on the other hand they're intelligent, why do they make up so much shit? You just can't have this both ways and expect to be taken seriously.
One of the main reason LLMs are unintuitive and difficult to use is that you have to learn how to get useful results out of fundamentally unreliable technology.
Once you figure out how to do that they're absurdly useful.
Maybe a good analogy here is working with animals? Guide dogs, sniffer dogs, falconry... all cases where you can get great results but you have to learn how to work with a very unpredictable partner.
> Once you figure out how to do that they're absurdly useful
I have read some posts of yours advancing that but I never met those with the details: do you mean more "prompt engineering", or "application selection", or "system integration"...?
Typing code faster. Building quick illustrative prototypes. Researching options for libraries (that are old and stable enough to be in the training data). Porting code from one language to another (surprisingly) [1]. Using as a thesaurus. Answering questions about code (like piping in a whole codebase and asking about it) [2]. Writing an initial set of unit tests. Finding the most interesting new ideas in a paper or online discussion thread without reading the whole thing. Building one-off tools for converting data. Writing complex SQL queries. Finding potential causes of difficult bugs. [3]
Or if you meant "what do you have to figure out to use them effectively despite their flaws?", that's a huge topic. It's mostly about building a deep intuition for what they can and cannot help with, then figuring out how to prompt them (including managing their context of inputs) to get good results. The most I've written about that is probably this piece: https://simonwillison.net/2025/Mar/11/using-llms-for-code/
All of that is very interesting. Side note: don't you agree that "answering about documentation with 100% reliability" would be a more than desirable further feature? (Think of those options in the shell commands which can be so confusing they made it to xkcd material.) But that would mean achieving production-level RAG; and that in turn would be a revolution in LLMs, which would revise your list above...
LLMs can never provide 100% reliability - there's a random number generator in the mix after all (reflected in the "temperature" setting).
For documentation answering the newer long context models are wildly effective in my experience. You can dump a million tokens (easily a full codebase or two for most projects) into Gemini 2.5 Pro and get great answers to almost anything.
There are some new anonymous preview models with 1m token limits floating around right now which I suspect may be upcoming OpenAI models. https://openrouter.ai/openrouter/optimus-alpha
> One of the main reason LLMs are unintuitive and difficult to use is that you have to learn how to get useful results out of fundamentally unreliable technology.
Name literally any other technology that works this way.
> Guide dogs, sniffer dogs, falconry...
Guide dogs are an imperfect solution to an actual problem: the inability for people to see. And dogs respond to training far more reliably than LLMs respond to prompts.
Sniffer dogs are at least in part bullshit and have been shown in many studies to respond to the subtle cues of their handlers far more reliably than anything they actually smell. And the best of part of them is they also (completely outside their own control mind you) ruin lives when falsely detecting drugs on cars that look a way the officer handling them thinks means they have drugs inside.
"Name literally any other technology that works this way"
Since you don't like my animal examples, how about power tools? Chainsaws, table saws, lathes... all examples of tools where you have to learn how to use them before they'll be useful to you.
(My inability to come up with an analogy you find convincing shouldn't invalidate my claim that "LLMs are unreliable technology that is still useful if you learn how to work with it" - maybe this is the first time that's ever been true for an unreliable technology, though I find that doubtful.)
> Name literally any other technology that works this way.
The internet for one.
Not the internet itself (although it certainly can be unreliable), but rather the information on it.
Which I think is more relevant to the argument anyway, as LLM’s do in fact reliably function exactly the way they were built to.
Information on the internet is inherently unreliable. It’s only when you consider externalities (like reputation of source) that its information can then be made “reliable”.
Information that comes out of LLM’s is inherently unreliable. It’s only through externalities (such as online research) that its information can be made reliable.
Unless you can invent a truth machine that somehow can tell truth from fiction, I don’t see either of these things becoming reliable, stand-alone sources of information.
> Name literally any other technology that works this way.
How about people? They make mistakes all the time, disobey instructions, don’t show up to work, occasionally attempt to embezzle or sabotage their employers. Yet we manage to build huge successful companies out of them.
People who solely code and are not good software architects will try and fail to delegate coding to LLM.
What we are doing in practice when delegating coding to LLMs is climbing up the abstraction level ladder.
We can compensate bad software architecture because we understand deeply the code details and make indirect couplings in the code. When we don't understand deeply the code, we need to compensate it with good architecture.
That means thinking about code in terms of interfaces, stores, procedures, behaviours, actors, permissions and competences (what the actors should do, how they should behave and the scope of action they should be limited to).
Then these details should reflect directly in the prompts. See how hard it is to make this process agentic, because you need user input in the agent inner workings.
And after running these prompts and with luck successfully extracting functioning components, you are the one that should be putting these components together to make working system.
"What we are doing in practice when delegating coding to LLMs is climbing up the abstraction level ladder."
Except that ladder is built on hallucinated rungs. Coding can be delegated to humans. Coding cannot be delegated to AI, LLM or ML because they are not real nor are they reliable.
I still think that main issue of hallucination is bad AI wrapper tools. AI must have every available public API with documentation, preloaded in the context. And explicit instructions to avoid using any API not mentioned in the context.
LLM is like a developer without internet or docs access, who needs to write code on the paper. Every developer would hallucinate in that environment. It's a miracle that LLM does so much with so limited environment.
It’s not a miracle, it’s statistics. Once you understand it’s a clever lossy text compression technique, you can see why it appears to do well with boilerplate(crud)/common interview coding questions. Any code request requiring any kind of nuisance will return the equivalent of the first answer of a stack overflow question. Aka. Kinda maybe in the ballpark but incorrect.
I was using LLM to help me with a PoC. I wanted to access an API that required OTP via email. I asked I believe Claude to provide me with an initial implementation of the interfacing with Gmail and it worked the first time. That showcases how you can use LLMs with day to day activities, in prototyping and synthesizing first versions of small components.
That's way more advanced than just coding interview questions that the solution could just be added to the dataset.
You need first to believe there is value in adding AI to your workflow. Then you need to search and find ways to have it add value to you. But you are ultimately the one that understands what value really is and who has to put effort into making AI valuable.
Vim won't make you a better developer just as much as LLMs won't code for you. But they can both be invaluable if you know how to wield them.
“You need to believe” pretty much says it all. Your example isn’t convincing because there will be only one correct answer with little variation(the API in question).
I’m sure you’re finding some use for it.
I can’t wait for when the LLM providers start including ads in the answers to help pay back all that VC money currently being burned.
Both Facebook and Google won by being patient before including ads. MySpace and Yahoo both were riddled with ads early and lost. It will be interesting to see who blinks first. My money is on Microsoft who anded ands to Solitaire of all things.
If you don't believe computers have value you will default to writing on paper. That's what I meant with it. You need to believe first that there is something of value to be had there before exploring otherwise you are just aimlessly shooting and seeing what sticks. Maybe that gives you a better understanding of what I meant.
Have LLMs replace developers for lower level code can be a goal but isn't the only one.
You can use AI to assist you with lower level coding, maybe coming up with multiple prototypes for a given component, maybe quickly refactoring some interfaces and see if they fit your mental model better.
But if you want AI to make your life easier I think you will have a hard time. AI should be just another tool in your toolbelt to make you more productive when implementing stuff.
So my question is, why do you expect LLMs to be 100% accurate to have any value? Shouldn't developers do their work and integrate LLMs to speed up some steps in coding process, but still taking ownership of the process?
> What we are doing in practice when delegating coding to LLMs is climbing up the abstraction level ladder.
100%. I like to say that we went from building a Millennium Falcon out of individual LEGO pieces, to instead building an entire LEGO planet made of Falcon-like objects. We’re still building, the pieces are just larger :)
This is a real problem, and AI is a new vector for it, but the root cause is the lack of reliable trust and security around packages in general.
I really wonder what the solution is.
Has there been any work on limiting the permissions of modules? E.g. by default a third-party module can't access disk or network or various system calls or shell functions or use tools like Python's "inspect" to access data outside what is passed to them? Unless you explicitly pass permissions in your import statement or something?
Components can't do any IO or interfere with any other components in an application except through interfaces explicitly given to them. So you could, e.g., have a semi-untrusted image compression component composed with the rest of your app, and not have to worry that it's going to exfiltrate user data.
So you refuse to learn from the history, because that's basically the UNIX model. That you string together simple text processing programs and any misbehaving program gets sigsegv without endangering anything, you don't have to worry. But it transpired that:
1. splitting functionality in such way is not always possible or effective/performant, not to mention operators in practice tend to find fine grained access control super annoying
2. and more importantly, even if the architecture is working, hostile garbage in your pipeline WILL cause problems with the rest of your app.
It doesn't seem like a stretch that an LLM will very soon be able to configure your dependent web assembly components to permit the dangerous access. It feels like this model of security, while definitely a step in the right direction, won't make a novice vibe coder any more secure.
Java used to have Java Security Manager, which basically made it possible to set permissions for what a jar/dependency could do. But deprecated and no real good alternative anymore.
Java could have really nice security if it provided access to OS API via interfaces with main function receiving the interface for the real implementation. It would be possible then to implement really tight sandboxes. But that ship sailed 30 years ago…
My crank opinion is that we should invest in capability-based security, or an effects system, for code in general, both internal and external. Your external package can't pwn you if you have to explicitly grant it permissions it shouldn't have.
I wonder how you could retrofit something like that onto Go for instance. I've always thought a buried package init function could be devastating. Allow/deny listing syscalls, sockets, files, etc for packages could be interesting.
Most languages have that early init problem. C++ allows global constructors, Java has class statics, Rust can also initialize thing globally.
Even C allows library initializers running arbitrary code. It was used to implement that attack against ssh via malicious xz library.
Disabling globals that are not compile-time constants or at least are never initialized unless the application explicitly called things will nicely address that issue. But language designers think that running arbitrary code before main is a must.
Rust doesn't have static initialisers for complex objects; it has lazy initialisers in the standard library that run when they're first requested, but there's no way to statically initialise any object more complex than a primitive: https://doc.rust-lang.org/reference/items/static-items.html#...
> This is a real problem, and AI is a new vector for it, but the root cause is the lack of reliable trust and security around packages in general.
I agree. And the problem has intensified due to the explosion of dependencies.
> Has there been any work on limiting the permissions of modules?
With respect to PyPI, npm, and the like, and as far as I know: no. But regarding C and generally things you can control relatively easily yourself, see for instance:
I don't think it's a bad idea, but currently packages aren't written with adversarial packages in mind. E.g. requests in Python should have network access, but probably not if it's called from a sandboxed package, but you might be able to trick certain packages into calling functions for you without having your package in the call stack (e.g. asyncio event loop or Thread). I think any serious attempt would get pushback from library authors.
Also it's hard to argue against hard process isolation. Specter et al are much easier to defend against at process boundaries. It's probably higher value to make it easier to put sub modules into their own sandboxed processes.
> It would be useful to have different levels of restrictions for various modules within a single process, which I don’t think pledge can do.
Sure: the idea could be improved a lot. And then there is the maintenance burden. Here, perhaps a step forward would be if every package author would provide a "pledge" (or whatever you want to call the idea) instead of others trying to figure out what capabilities are needed. Then you could also audit whether a "pledge" holds in reality.
We do have tools but adoption is sparse. It still too much hassle.
You can do SLSA, SBOM and package attestation with confirmed provenance.
But as mentioned it still is some work but more tools pop up.
Downside is when you will have signed attested package that still will become malicious just like malware creators were signing stuff with help of Microsoft.
To build tokenizers that use hashed identifiers rather than identifiers as plain English?
e.g, "NullPointerException" can be a single kanji. Current LLM processes it like "N, "ull", "P", "oint", er", "Excep", "tion". This lets them make up "PullDrawerException", which is only useful outside code.
That kind of creativity is not useful in code, in which identifiers are just friendly names for pointer addresses.
I guess real question is how much business sense such a solution would make. "S in $buzzword stands for security" kind of thing.
You could have two different packages in a build doing similar things -- one uses less memory but is slower to compute than the other -- so used selectively by scenario from previous experience in production
If someone unfamiliar with the build makes a change and the assistant swaps the package used in the change -- which goes unnoticed as the package itself is already visible and the naming is only slightly different, it's easy to see how surprises can happen
(I've seen o3 do this every time the prompt was re-run in this situation)
In Smolagents you can provide which packages are permitted. Maybe that's a shortcut to enforce this? I can't imagine that in a professional development house it's truly an n x m over all possible libraries.
IME it is rarely productive to ask an LLM to fix code it has just given you as part of the same session context. It can work but I find that the second version often introduces at least as many errors as it fixes, or at least changes unrelated bits of code for no apparent reason.
Therefore I tend to work on a one-shot prompt, and restart the session entirely each time, making tweaks to the prompt based on each output hoping to get a better result (I've found it helpful to point out the AI's past errors as "common mistakes to be avoided").
Doing the prompting in this way also vastly reduces the context size sent with individual requests (asking it to fix something it just made in conversation tends to resubmit a huge chunk of context and use up allowance quotas). Then, if there are bits the AI never quite got correct, I'll go in bit by bit and ask it to fix an individual function or two, with a new session and heavily pruned context.
I agree with this, you will almost always get better results by simply undoing and rewording you prompt vs trying to coerce it to fix something it already did.
Most of the time when I do use it, I almost always use just a couple prompts before starting a completely new one because it just falls off a cliff in terms of reliability after the first couple messages. At that point you're better off fixing it yourself than trying to get it to do it a way you'll accept.
this is also what I started doing, sometimes it will give an actual correct answer but it usually easy to just start a new session. I can even ask the exact same question and get a correct answer with a new session
I find it's only really useful in terms of writing entire features if you're building something fairly simple, on top of using the most well known frameworks and libraries.
If you happen to like using less popular frameworks, libraries, packages etc it's like fighting an uphill battle because it will constantly try to inject what it interprets as the most common way to do things.
I do find it useful for smaller parts of features or writing things like small utilities or things at a scale where it's easy to manage/track where it's going and intervene
But full on vibe coding auto accept everything is madness whenever I see it.
Same thing happens to me. The LLM will make up some reasonably sounding answer, I correct it, three, four, five time, and then it circles back to the original answer... which is still just as wrong.
Either they don't retain previous information, or they are so desperate to give you any answer that they'd prefer the wrong answer. Why is it that an LLM can't go: Yeah, I don't know.
as ever, any task that has any sort of safety or security critical risks should never be left to a “magic black box”.
human input/review/verification/validation is always required. verify the untrusted output of these systems. don’t believe the hype and don’t blindly trust them.
—
i did find the fact that google search’s assistant just parroted the crafted/fake READMEs thing particularly concerning - propagating false confidence/misplaced trust - although it’s not at all surprising given the current state of things.
genuinely feel like “classic search” and “new-fangled LLM queries” need to be split out and separated for low-level/power user vs high-level/casual questions.
at least with classic search i’m usually finding a github repo fairly quickly that i can start reading through, as an example.
at the same time, i could totally see myself scanning through a README and going “yep, sounds like what i need” and making the same mistake (i need other people checking my work too).
> any task that has any sort of safety or security critical risks should never be left to a “magic black box”.
> human input/review/verification/validation is always required.
but, are humans not also a magic black box? We don't know what's going on in other people's heads, and while you can communicate with a human and tell them to do something, they are prone to misunderstanding, not listening, or lying. (which is quite similar to how LLMs behave!)
> at the same time, i could totally see myself scanning through a README and going “yep, sounds like what i need” and making the same mistake (i need other people checking my work too).
yes, us humans have similar issues to the magic black box. i’m not arguing humans are perfect.
this is why we have human code review, tests, staging environments etc. in the release cycle. especially so in safety/security critical contexts. plus warnings from things like register articles/CVEs to keep track of.
like i said. don’t blindly trust the untrusted output (code) of these things — always verify it. like making sure your dependencies aren’t actually crypto miners. we should be doing that normally. but some people still seem to believe the hype about these “magic black box oracles”.
the whole “agentic”/mcp/vibe-coding pattern sounds completely fucking nightmare-ish to me as it reeks of “blindly trust everything LLM throws at you despite what we’ve learned in the last 20 years of software development”.
Sounds like we just need to treat LLMs and humans similarly: accept they are fallible, put review processes in place when it matters if they fail, increase stringency of review as stakes increase.
Vibe coding is all about deciding it doesn’t matter if the implementation is perfect. And that’s true for some things!
> Vibe coding is all about deciding it doesn’t matter if the implementation is perfect. And that’s true for some things!
i was going to say, sure yeah i’m currently building a portfolio/personal website for myself in react/ts, purely for interview showing off etc. probably a good candidate for “vibe coding”, right? here’s the problem - which is explicitly discussed in the article - vibe coding this thing can bring in a bunch of horrible dependencies that do nefarious things.
so i’d be sitting in an interview showing off a few bits and pieces and suddenly their CPU usage spikes at 100% util over all cores because my vibe-coded personal site has a crypto miner package installed and i never noticed. maybe it does some data exfiltration as well just for shits and giggles. or maybe it does <insert some really dark thing here>.
“safety and security critical” applies in way more situations than people think it does within software engineering. so many mundane/boring/vibe-it-out-the-way things we do as software engineers have implicit security considerations to bear in mind (do i install package A or package B?). which is why i find the entire concept of “vibe-coding” to be nightmarish - it treats everything as a secondary consideration to convenience and laziness, including basic and boring security practices like “don’t just randomly install shit”.
It's true, hallucinations in LLM's can be so consistent that I warn the LLM up front about stuff like "do not use NSCacheDefault, it does not exist, and there is no default value" and then keeping my fingers crossed it doesn't find a roundabout way to introduce it anyway.
Can't really remember what is was exactly anymore, something in Apple's Vision libraries that just kept popping up if I didn't explicitly say to not use it.
This was my chief critique when my company forced us to use their AI tooling. I was trying to stitch together our CMDB, two different VMware products, and the corporate technology directory into a form of product tenancy for our customers. At one point I was trying to move and transform data from our CRM into mongoDB, and figured "eh, let's knock these mandatory agent queries out of the way by asking the chatbot to help." I wrote a few prompts to try and explain what I was trying to accomplish (context), and how I'd like it done (instruction).
The bot hallucinated a non-existent mongoDB Powershell cmdlet, complete with documentation on how it works, and then spat out a "solution" to the problem I asked. Every time I reworked the prompt, cut it up into smaller chunks, narrowed the scope of the problem, whatever I tried, the chatbot kept flatly hallucinating non-existent cmdlets, Python packages, or CLI commands, sometimes even providing (non-working) "solutions" in languages I didn't explicitly ask for (such as bash scripting instead of Powershell).
This was at a large technology company, no less, one that's "all-in" on AI.
If you're staying in a very narrow line with a singular language throughout and not calling custom packages, cmdlets, or libraries, then I suspect these things look and feel quite magical. Once you start doing actual work, they're complete jokes in my experience.
If the LLM is "making up" APIs that don't exists, I'm guessing they've been introduced as the model tried to generalize from the training set, as that's the basic idea? These invented APIs might represent patterns the model identified across many similar libraries, or other texts people have written on the internet, wouldn't that actually be a sort of good library to have available if it wasn't already? Maybe we could use these "hallucinations" in a different way, if we could sort of know better what parts are "hallucination" vs not. Maybe just starting points for ideas if nothing else.
In my experience, what's being made up is an incorrect name for an API that already exists elsewhere. They're especially bad at recommending deprecated methods on APIs.
Back in GPT3 days I put together a toy app that let you ask for a python program, and it hooked __getattr__ so if the LLM generated code called a non-existent function it could use GPT3 to define it dynamically. Ended up with some pretty wild alternate reality python implementations. Nothing useful though.
Can’t we just move to have package managers point to a curated list of packages by default, with the option to enable an uncurated one if you know what your doing , ala Ubuntu source lists?
At least having good integrated support in the package manager for an allow-list of packages would be good. Then one could maintain such lists in a company or project. And we could have community efforts to develop shared curated lists that could be starting points.
If that really catches on, one could consider designating one of them as a default.
Might also want to support multiple allow lists, so one can add to a standard list in a project (after review). And also deny, so one can remove a few without exiting completely from common lists.
Getting new package from 0 to any Linux distribution is close to impossible.
Debian sucks as no one gets on top of reviewing and testing.
„Can we just” is not just there is loads of work to be done to curate packages no one is willing to pay for it.
There is so far no model that works where you can have up to date cutting edge stuff reviewed. So you are stuck with 5 year old crap because it was reviewed.
My favorite is when the LLM hallucinates some function or an entire library and you call it out for the mistake. A likely response is "Oh, I'm sorry, You're right. Here's how you would implement function_that_does_not_exist()" and proceeds to write the library it hallucinated in the first place.
It's quirks like these that prove LLMs are a long long way from AGI.
More recent models are producing much higher quality code than models from 6/12/18 months ago. I believe a lot of this is because the AI labs have figured out how to feed them better examples in the training - filtering for higher quality open source code libraries, or loading up on code that passes automated tests.
A lot of model training these days uses synthetic data. Generating good code synthetic data is a whole lot easier than any other category, as you can at least ensure the code you're generating is gramatically valid and executes without syntax errors.
The only real solution I see is lint and ci tooling that prevents non approved packages from getting into your repo. Even with this there is potential for theft on localhost. There are a dozen new YC startups visible in just those two sentences.
Could the AI providers themselves monitor any code snippets and look for non-existent dependencies? They could then ask the LLM to create that package with the necessary interface and implant an exploit in the code. Languages that allow build scripts would be perfect as then the malicious repo only needs to have the interface (so that the IDE doesn't complain) and the build script can download a separate malicious payload to run.
The AI providers already write the code, on the whole crazy promise that humans need not to care/read about it. I'm not sure that it changes anything at that point to add one weak level of indirection. You are already compromised.
Hi — I’m the security firm CEO mentioned, though I wear a few other hats too: I’ve been maintaining open source projects for over a decade (some with 100s of millions of npm downloads), and I taught Stanford’s web security course (https://cs253.stanford.edu).
Totally understand the skepticism. It’s easy to assume commercial motives are always front and center. But in this case, the company actually came after the problem. I’ve been deep in this space for a long time, and eventually it felt like the best way to make progress was to build something focused on it full-time.
That's not the technical report; it's also just a blog article which links to someone else's paper, and finishes off by promoting something:
"Socket addresses this exact problem. Our platform scans every package in your dependency tree, flags high-risk behaviors like install scripts, obfuscated code, or hidden payloads, and alerts you before damage is done. Even if a hallucinated package gets published and spreads, Socket can stop it from making it into production environments."
I’m not measuring it, but it seems like copilot suggests fewer imports than it used to. It could be that it has more context to see that I rarely import external packages and follows suit. Or maybe I’m using it subtlety different than I used to.
Because doing so is computationally expensive and would be making false promises.
False positives where it incorrectly flagged a safe package would result in the need for a human review step, which is even more expensive.
False negatives where malware patterns didn't match anything previously would happen all the time, so if people learned to "trust" the scanning they would get caught out - at which point what value is the scanning adding?
I don't know if there are legal liability issues here too, but that would be worth digging into.
As it stands, there are already third parties that are running scans against packages uploaded to npm and PyPI and helping flag malware. Leaving this to third parties feels like a better option to me, personally.
When using AI, you are still the one responsible for the code. If the AI writes code and you don't read every line, why did it make its way into a commit? If you don't understand every line it wrote, what are you doing? If you don't actually love every line it wrote, why didn't you make it rewrite it with some guidance or rewrite it yourself?
The situation described in the article is similar to having junior developers we don't trust committing code and us releasing it to production and blaming the failure on them.
If a junior on the team does something dumb and causes a big failure, I wonder where the senior engineers and managers were during that situation. We closely supervise and direct the work of those people until they've built the skills and ways of thinking needed to be ready for that kind of autonomy. There are reasons we have multiple developers of varying levels of seniority: trust.
We build relationships with people, and that is why we extend them the trust. We don't extend trust to people until they have demonstrated they are worthy of that trust over a period of time. At the heart of relationships is that we talk to each other and listen to each other, grow and learn about each other, are coachable, get onto the same page with each other. Although there are ways to coach llm's and fine tune them, LLM's don't do nearly as good of a job at this kind of growth and trust building as humans do. LLM's are super useful and absolutely should be worked into the engineering workflow, but they don't deserve the kind of trust that some people erroneously give to them.
You still have to care deeply about your software. If this story talked about inexperienced junior engineers messing up codebases, I'd be wondering where the senior engineers and leadership were in allowing that to mess things up. A huge part of engineering is all about building reliable systems out of unreliable components and always has been. To me this story points to process improvement gaps and ways of thinking people need to change more than it points to the weak points of AI.
When I suspect that it will make stuff up, I tell it to cite the docs that contain the functions it used. It causes more global warming, but it works fine.
I have no other gear than polemic on the topic of AI-for-code-generation so ignore this comment if you don’t like that.
I think people in software envy real-engineering too much. Software development is what it is. If it does not live up to that bar then so be it. But AI-for-code-generation (“AI” for short now) really drops any kind of pretense. I got into software because it was supposed to be analytic, even kind of a priori. And deterministic. What even is AI right now? It melds the very high tech and probabilistic (AI tech) with the low tech of code generation (which is deterministic by itself but not with AI). That’s a regression both in terms of craftmanship (code generation) and so-called engineering (deterministic). I was looking forward to higher-level software development: more declarative (better programming languages and other things), more tool-assisted (tests, verification), more deterministic and controlled (Nix?), and less process redundancies (e.g. less redundancies in manual/automated testing, verification, review, auditing). Instead we are mining the hard work of the past three decades and spitting out things that have the mandatory label “this might be anything, verify it yourself”. We aren’t making higher-level tools—we[1] are making a taller tower with less support beams, until the tower reaches so high that the wind can topple it at any moment.
The above was just for AI-for-code-generation. AI could perhaps be used to create genuinely higher level processes. A solid structure with better support. But that’s not the current trajectory/hype.
Usually, when the model hallucinates a dependency, the subject of the hallucination really should exist. I've often thought that was kind of interesting in itself. It can feel like a genuine glimpse of emergent creativity.
Children may invent the world as they do not know it well yet. Adults know that reality is not what you may expect. We need to deal with reality, so...
I'm waiting for the AI apologists to swarm on this post explaining how these are just the results of poorly written prompts, because AI could not make mistakes with proper prompts. Been seeing an increase of this recently on AI-critical content, and it's exhausting.
Sure, with well written prompts you can have some success using AI assistants for things, but also with well-written non-ambiguous prompts you can inexplicably end up with absolute garbage.
Until things become consistent, this sort of generative AI is more akin to a party trick than being able to replace or even supplement junior engineers.
As an "AI apologist", sorry to disappoint but the answer here isn't better prompting: it's code review.
If an LLM spits out code that uses a dependency you aren't familiar with, it's your job to review that dependency before you install it. My lowest effort version of this is to check that it's got a credible commit and release history and evidence that many other people are using it already.
Same as if some stranger opens a PR against your project introducing a new-to-you dependency.
If you don't have the discipline to do good code review, you shouldn't be using AI-assisted programming outside of safe sandbox environments.
(Understanding "safe sandbox environment" is a separate big challenge!)
Haha. That sounds like something Sonnet 3.6 would do, it learned to cheat that way and it's an absolute pain in the ass to make it produce longer outputs.
This is just another reason why dependencies are an anti-pattern. If you do nothing, your software shouldn't change.
I suspect that this style of development became popular in the first place because the LGPL has different copyright implications based on whether code is statically or dynamically linked. Corporations don't want to be forced to GPL their code so a system that outsources libraries to random web sites solves a legal problem for them.
But it creates many worse problems because it involves linking your code to code that you didn't write and don't control. This upstream code can be changed in a breaking way or even turned into malware at any time but using these dependencies means you are trusting that such things won't happen.
Modern dependency based software will never "just work" decades from now like all of that COBOL code from the 1960s that infamously still runs government and bank computer systems on the backend. Which is probably a major reason why they won't just rewrite the COBOL code.
You could say as a counterargument that operating systems often include breaking changes as well. Which is true but you don't update your operating system on a regular basis. And the most popular operating system (Windows) is probably the most popular because Microsoft historically has prioritized backward compatibility even to the extreme point of including special code in Windows 95 to make sure it didn't break popular games like SimCity that relied on OS bugs from Windows 3.1 and MS-DOS[0].
IME when AI "hallucinates" API endpoints or library functions that just aren't there it's almost always the case that they should be. In other words the AI has based it's understanding on the combined knoweledge of hundreds(?) of other APIs and libraries and is geenrating an obvious analogy.
Turning this around: a great use case is to ask AI to review documents, APIs, etc. AI is really great for teasing out your blindspots.
If the training data contains useless endpoints the AI will also hallucinate those useless endpoints.
The wisdom of the crowd only works for the end result not if you consider every given answer, then you get more wrong answers because you fall to the average.
People have made the point many times before, that “hallucination” is mainly what generative AI does and not the exception, but most of the time it’s a useful hallucination.
I describe it as more of a “mashup”, like an interpolation of statistically related output that was in the training data.
The thinking was in the minds of the people that created the tons of content used for training, and from the view of information theory there is enough redundancy in the content to recover much of the intent statistically. But some intent is harder to extract just from example content.
So when generating statically similar output, the statistical model can miss the hidden rules that were a part of the thinking that went into the content that was used for training.
"useful hallucination" so much AI glazing its crazy
"useful hallucination" so much AI glazing its crazy
I'm still a fan of the standard term "lying." Intent, or a lack thereof, doesn't matter. It's still a lie.
Intent does matter if you want to classify things as lies.
If someone told you it's Thursday when it's really Wednesday, we would not necessary say they lied. We would say they were mistaken, if the intent was to tell you the correct day of the week. If they intended to mislead you, then we would say they lied.
So intent does matter. AI isn't lying, it intends to provide you with accurate information.
The AI doesn't intend anything. It produces, without intent, something that would be called lies if it came from a human. It produces the industrial-scale mass-produced equivalent of lies – it's effectively an automated lying machine.
Maybe we should call the output "synthetic lies" to distinguish it it from the natural lies produced by humans?
There is actually an acknowledged term of art for this: "bullshit".
Summary from Wikipedia: https://en.m.wikipedia.org/wiki/Bullshit
> statements produced without particular concern for truth, clarity, or meaning, distinguishing "bullshit" from a deliberate, manipulative lie intended to subvert the truth
It's a perfect fit for how LLMs treat "truth": they don't know so that can't care.
I’m imagining your comment read by George Carlin … if only he were still here to play with this. You know he would.
So intent does matter. AI isn't lying, it intends to provide you with accurate information.
Why are we making excuses for machines?
AI doesn’t have “intent” at all.
If intent doesn't matter, is it still lying when the reality happens to coincide with what the machine says?
Because the OP's name seems way more descriptive and easier to generalize.
So you're saying deliberate deception, mistaken statements and negligent falsehoods should all be considered the same thing, regardless?
Personally, I'd be scared if LLMs were proven to be deliberately deceptive, but I think they currently fall in the two later camps, if we're doing human analogies.
Have you asked your LLMs if they're capable of lying?
Did the answers strike you as deceptive?
“Glazing” is such performative rhetoric its hilarious
> the statistical model can miss the hidden rules that were a part of the thinking that went into the content that was used for training.
Makes sense. Hidden rules such as, "recommending a package works only if I know the package actually exists and I’m at least somewhat familiar with it."
Now that I think about it, this is pretty similar to cargo-culting.
LLMs don’t really “know” though.mif you look at the recent Anthropic findings, they show that large language models can do math like addition but they do it weird way and when you asked the model how they arrive to the solution they provide method that is completely different to how they actually do it
That's the point. It's one of the implicit, real-world rules that were underlying the training set.
And cargo-culting is in fact exactly what happens when people act as LLM's.
Our brains killed us if we figured things out wrong. We’d get eaten. We learned to get things right enough, and how to be pretty right, fast, even when we didn’t know the new context (plants, animals, snow, lava).
LLM’s are just so happy to generate enough tokens that look right ish. They need so many examples driven into them during training.
The map is not the territory, and we’re training them on the map of our codified outputs. They don’t actually have to survive. They’re pretty amazing but of course they’re absolutely not doing what we do, because success for us and them look so different. We need to survive.
(Please can we not have one that really wants to survive.)
There is an interesting phenomenon with polynomial interpolation called Runge Spikes. I think "Runge Spikes" offers a better metaphor than "hallucination" and argue the point: https://news.ycombinator.com/item?id=43612517
if it catches on, everyone will start applying it on humans. "he's got runge spikes". you can't win against antropomorphization
That's just overfitting. Using too flexible a model without regularization.
That is interesting indeed, thanks!
> People have made the point many times before, that “hallucination” is mainly what generative AI does and not the exception, but most of the time it’s a useful hallucination.
Oh yeah that's exactly what I want from a machine intelligence, a "best friend who knows everything about me," is that they just make shit up that they think I'd like to hear. I'd really love a personal assistant that gets me and my date a reservation at a restaurant that doesn't exist. That'll really spice up the evening.
The mental gymnastics involved in the AI community are truly pushing the boundaries of parody at this point. If your machines mainly generate bullshit, they cannot be serious products. If on the other hand they're intelligent, why do they make up so much shit? You just can't have this both ways and expect to be taken seriously.
One of the main reason LLMs are unintuitive and difficult to use is that you have to learn how to get useful results out of fundamentally unreliable technology.
Once you figure out how to do that they're absurdly useful.
Maybe a good analogy here is working with animals? Guide dogs, sniffer dogs, falconry... all cases where you can get great results but you have to learn how to work with a very unpredictable partner.
> Once you figure out how to do that they're absurdly useful
I have read some posts of yours advancing that but I never met those with the details: do you mean more "prompt engineering", or "application selection", or "system integration"...?
Typing code faster. Building quick illustrative prototypes. Researching options for libraries (that are old and stable enough to be in the training data). Porting code from one language to another (surprisingly) [1]. Using as a thesaurus. Answering questions about code (like piping in a whole codebase and asking about it) [2]. Writing an initial set of unit tests. Finding the most interesting new ideas in a paper or online discussion thread without reading the whole thing. Building one-off tools for converting data. Writing complex SQL queries. Finding potential causes of difficult bugs. [3]
[1] I built https://tools.simonwillison.net/hacker-news-thread-export this morning from my phone using that trick: https://claude.ai/share/7d0de887-5ff8-4b8c-90b1-b5d4d4ca9b84
[2] Examples of that here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#b...
[3] https://simonwillison.net/2024/Sep/25/o1-preview-llm/ is an early example of using a "reasoning" model for that
Or if you meant "what do you have to figure out to use them effectively despite their flaws?", that's a huge topic. It's mostly about building a deep intuition for what they can and cannot help with, then figuring out how to prompt them (including managing their context of inputs) to get good results. The most I've written about that is probably this piece: https://simonwillison.net/2025/Mar/11/using-llms-for-code/
All of that is very interesting. Side note: don't you agree that "answering about documentation with 100% reliability" would be a more than desirable further feature? (Think of those options in the shell commands which can be so confusing they made it to xkcd material.) But that would mean achieving production-level RAG; and that in turn would be a revolution in LLMs, which would revise your list above...
LLMs can never provide 100% reliability - there's a random number generator in the mix after all (reflected in the "temperature" setting).
For documentation answering the newer long context models are wildly effective in my experience. You can dump a million tokens (easily a full codebase or two for most projects) into Gemini 2.5 Pro and get great answers to almost anything.
There are some new anonymous preview models with 1m token limits floating around right now which I suspect may be upcoming OpenAI models. https://openrouter.ai/openrouter/optimus-alpha
I actually use LLMs for command line arguments for tools like ffmpeg all the time, I built a plugin for that: https://simonwillison.net/2024/Mar/26/llm-cmd/
> One of the main reason LLMs are unintuitive and difficult to use is that you have to learn how to get useful results out of fundamentally unreliable technology.
Name literally any other technology that works this way.
> Guide dogs, sniffer dogs, falconry...
Guide dogs are an imperfect solution to an actual problem: the inability for people to see. And dogs respond to training far more reliably than LLMs respond to prompts.
Sniffer dogs are at least in part bullshit and have been shown in many studies to respond to the subtle cues of their handlers far more reliably than anything they actually smell. And the best of part of them is they also (completely outside their own control mind you) ruin lives when falsely detecting drugs on cars that look a way the officer handling them thinks means they have drugs inside.
And falconry is a hobby.
"Name literally any other technology that works this way"
Since you don't like my animal examples, how about power tools? Chainsaws, table saws, lathes... all examples of tools where you have to learn how to use them before they'll be useful to you.
(My inability to come up with an analogy you find convincing shouldn't invalidate my claim that "LLMs are unreliable technology that is still useful if you learn how to work with it" - maybe this is the first time that's ever been true for an unreliable technology, though I find that doubtful.)
The correct name for unreliable power tools is "trash".
which happens to be the correct name for A"I" too
> Name literally any other technology that works this way.
The internet for one.
Not the internet itself (although it certainly can be unreliable), but rather the information on it.
Which I think is more relevant to the argument anyway, as LLM’s do in fact reliably function exactly the way they were built to.
Information on the internet is inherently unreliable. It’s only when you consider externalities (like reputation of source) that its information can then be made “reliable”.
Information that comes out of LLM’s is inherently unreliable. It’s only through externalities (such as online research) that its information can be made reliable.
Unless you can invent a truth machine that somehow can tell truth from fiction, I don’t see either of these things becoming reliable, stand-alone sources of information.
> Name literally any other technology that works this way.
How about people? They make mistakes all the time, disobey instructions, don’t show up to work, occasionally attempt to embezzle or sabotage their employers. Yet we manage to build huge successful companies out of them.
People who solely code and are not good software architects will try and fail to delegate coding to LLM.
What we are doing in practice when delegating coding to LLMs is climbing up the abstraction level ladder.
We can compensate bad software architecture because we understand deeply the code details and make indirect couplings in the code. When we don't understand deeply the code, we need to compensate it with good architecture.
That means thinking about code in terms of interfaces, stores, procedures, behaviours, actors, permissions and competences (what the actors should do, how they should behave and the scope of action they should be limited to).
Then these details should reflect directly in the prompts. See how hard it is to make this process agentic, because you need user input in the agent inner workings.
And after running these prompts and with luck successfully extracting functioning components, you are the one that should be putting these components together to make working system.
"What we are doing in practice when delegating coding to LLMs is climbing up the abstraction level ladder."
Except that ladder is built on hallucinated rungs. Coding can be delegated to humans. Coding cannot be delegated to AI, LLM or ML because they are not real nor are they reliable.
I still think that main issue of hallucination is bad AI wrapper tools. AI must have every available public API with documentation, preloaded in the context. And explicit instructions to avoid using any API not mentioned in the context.
LLM is like a developer without internet or docs access, who needs to write code on the paper. Every developer would hallucinate in that environment. It's a miracle that LLM does so much with so limited environment.
It’s not a miracle, it’s statistics. Once you understand it’s a clever lossy text compression technique, you can see why it appears to do well with boilerplate(crud)/common interview coding questions. Any code request requiring any kind of nuisance will return the equivalent of the first answer of a stack overflow question. Aka. Kinda maybe in the ballpark but incorrect.
I was using LLM to help me with a PoC. I wanted to access an API that required OTP via email. I asked I believe Claude to provide me with an initial implementation of the interfacing with Gmail and it worked the first time. That showcases how you can use LLMs with day to day activities, in prototyping and synthesizing first versions of small components.
That's way more advanced than just coding interview questions that the solution could just be added to the dataset.
You need first to believe there is value in adding AI to your workflow. Then you need to search and find ways to have it add value to you. But you are ultimately the one that understands what value really is and who has to put effort into making AI valuable.
Vim won't make you a better developer just as much as LLMs won't code for you. But they can both be invaluable if you know how to wield them.
“You need to believe” pretty much says it all. Your example isn’t convincing because there will be only one correct answer with little variation(the API in question).
I’m sure you’re finding some use for it.
I can’t wait for when the LLM providers start including ads in the answers to help pay back all that VC money currently being burned.
Both Facebook and Google won by being patient before including ads. MySpace and Yahoo both were riddled with ads early and lost. It will be interesting to see who blinks first. My money is on Microsoft who anded ands to Solitaire of all things.
If you don't believe computers have value you will default to writing on paper. That's what I meant with it. You need to believe first that there is something of value to be had there before exploring otherwise you are just aimlessly shooting and seeing what sticks. Maybe that gives you a better understanding of what I meant.
Have LLMs replace developers for lower level code can be a goal but isn't the only one.
You can use AI to assist you with lower level coding, maybe coming up with multiple prototypes for a given component, maybe quickly refactoring some interfaces and see if they fit your mental model better.
But if you want AI to make your life easier I think you will have a hard time. AI should be just another tool in your toolbelt to make you more productive when implementing stuff.
So my question is, why do you expect LLMs to be 100% accurate to have any value? Shouldn't developers do their work and integrate LLMs to speed up some steps in coding process, but still taking ownership of the process?
Remember, there is no free lunch.
> What we are doing in practice when delegating coding to LLMs is climbing up the abstraction level ladder.
100%. I like to say that we went from building a Millennium Falcon out of individual LEGO pieces, to instead building an entire LEGO planet made of Falcon-like objects. We’re still building, the pieces are just larger :)
> What we are doing in practice when delegating coding to LLMs is climbing up the abstraction level ladder.
You’re not abstracting if you are generating code that you have to verify/fret about. You’re at exactly the same level as before.
Garbage collection is an abstraction. AI-generated C code that uses manual memory management isn’t.
This is a real problem, and AI is a new vector for it, but the root cause is the lack of reliable trust and security around packages in general.
I really wonder what the solution is.
Has there been any work on limiting the permissions of modules? E.g. by default a third-party module can't access disk or network or various system calls or shell functions or use tools like Python's "inspect" to access data outside what is passed to them? Unless you explicitly pass permissions in your import statement or something?
You may be interested in WebAssembly Components: https://component-model.bytecodealliance.org/.
Components can't do any IO or interfere with any other components in an application except through interfaces explicitly given to them. So you could, e.g., have a semi-untrusted image compression component composed with the rest of your app, and not have to worry that it's going to exfiltrate user data.
So you refuse to learn from the history, because that's basically the UNIX model. That you string together simple text processing programs and any misbehaving program gets sigsegv without endangering anything, you don't have to worry. But it transpired that:
1. splitting functionality in such way is not always possible or effective/performant, not to mention operators in practice tend to find fine grained access control super annoying
2. and more importantly, even if the architecture is working, hostile garbage in your pipeline WILL cause problems with the rest of your app.
It doesn't seem like a stretch that an LLM will very soon be able to configure your dependent web assembly components to permit the dangerous access. It feels like this model of security, while definitely a step in the right direction, won't make a novice vibe coder any more secure.
It seems like it would be rare, though.
An LLM might hallucinate the wrong permissions, but they're going to be plausible guesses.
It's extremely unlikely to hallucinate full network access for a module that has nothing to do with networking.
Java used to have Java Security Manager, which basically made it possible to set permissions for what a jar/dependency could do. But deprecated and no real good alternative anymore.
Java could have really nice security if it provided access to OS API via interfaces with main function receiving the interface for the real implementation. It would be possible then to implement really tight sandboxes. But that ship sailed 30 years ago…
My crank opinion is that we should invest in capability-based security, or an effects system, for code in general, both internal and external. Your external package can't pwn you if you have to explicitly grant it permissions it shouldn't have.
I wonder how you could retrofit something like that onto Go for instance. I've always thought a buried package init function could be devastating. Allow/deny listing syscalls, sockets, files, etc for packages could be interesting.
Most languages have that early init problem. C++ allows global constructors, Java has class statics, Rust can also initialize thing globally.
Even C allows library initializers running arbitrary code. It was used to implement that attack against ssh via malicious xz library.
Disabling globals that are not compile-time constants or at least are never initialized unless the application explicitly called things will nicely address that issue. But language designers think that running arbitrary code before main is a must.
Rust doesn't have static initialisers for complex objects; it has lazy initialisers in the standard library that run when they're first requested, but there's no way to statically initialise any object more complex than a primitive: https://doc.rust-lang.org/reference/items/static-items.html#...
Thanks, I stand corected. Rust does not allow to initialize globals with arbitrary code running before main even with unsafe.
One more point to consider Rust over C++.
> This is a real problem, and AI is a new vector for it, but the root cause is the lack of reliable trust and security around packages in general.
I agree. And the problem has intensified due to the explosion of dependencies.
> Has there been any work on limiting the permissions of modules?
With respect to PyPI, npm, and the like, and as far as I know: no. But regarding C and generally things you can control relatively easily yourself, see for instance:
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
It would be useful to have different levels of restrictions for various modules within a single process, which I don’t think pledge can do.
I don't think it's a bad idea, but currently packages aren't written with adversarial packages in mind. E.g. requests in Python should have network access, but probably not if it's called from a sandboxed package, but you might be able to trick certain packages into calling functions for you without having your package in the call stack (e.g. asyncio event loop or Thread). I think any serious attempt would get pushback from library authors.
Also it's hard to argue against hard process isolation. Specter et al are much easier to defend against at process boundaries. It's probably higher value to make it easier to put sub modules into their own sandboxed processes.
> It would be useful to have different levels of restrictions for various modules within a single process, which I don’t think pledge can do.
Sure: the idea could be improved a lot. And then there is the maintenance burden. Here, perhaps a step forward would be if every package author would provide a "pledge" (or whatever you want to call the idea) instead of others trying to figure out what capabilities are needed. Then you could also audit whether a "pledge" holds in reality.
We do have tools but adoption is sparse. It still too much hassle.
You can do SLSA, SBOM and package attestation with confirmed provenance.
But as mentioned it still is some work but more tools pop up.
Downside is when you will have signed attested package that still will become malicious just like malware creators were signing stuff with help of Microsoft.
To build tokenizers that use hashed identifiers rather than identifiers as plain English?
e.g, "NullPointerException" can be a single kanji. Current LLM processes it like "N, "ull", "P", "oint", er", "Excep", "tion". This lets them make up "PullDrawerException", which is only useful outside code.
That kind of creativity is not useful in code, in which identifiers are just friendly names for pointer addresses.
I guess real question is how much business sense such a solution would make. "S in $buzzword stands for security" kind of thing.
Why not train an LAM, a Large AST Model?
That will miss comments and documentation.
If that were true of AST implementations then "prettier"-esque tooling wouldn't exist. https://github.com/prettier/prettier/blob/3.5.3/src/main/com...
It's deeper than the security issue
You could have two different packages in a build doing similar things -- one uses less memory but is slower to compute than the other -- so used selectively by scenario from previous experience in production
If someone unfamiliar with the build makes a change and the assistant swaps the package used in the change -- which goes unnoticed as the package itself is already visible and the naming is only slightly different, it's easy to see how surprises can happen
(I've seen o3 do this every time the prompt was re-run in this situation)
In Smolagents you can provide which packages are permitted. Maybe that's a shortcut to enforce this? I can't imagine that in a professional development house it's truly an n x m over all possible libraries.
I am constantly correcting the AI code it gives me, and all I get for it is "oh your right! here is the corrected code"
then it gives me more hallucinations
correcting the latest hallucination results in it telling me the first hallucination
IME it is rarely productive to ask an LLM to fix code it has just given you as part of the same session context. It can work but I find that the second version often introduces at least as many errors as it fixes, or at least changes unrelated bits of code for no apparent reason.
Therefore I tend to work on a one-shot prompt, and restart the session entirely each time, making tweaks to the prompt based on each output hoping to get a better result (I've found it helpful to point out the AI's past errors as "common mistakes to be avoided").
Doing the prompting in this way also vastly reduces the context size sent with individual requests (asking it to fix something it just made in conversation tends to resubmit a huge chunk of context and use up allowance quotas). Then, if there are bits the AI never quite got correct, I'll go in bit by bit and ask it to fix an individual function or two, with a new session and heavily pruned context.
I agree with this, you will almost always get better results by simply undoing and rewording you prompt vs trying to coerce it to fix something it already did.
Most of the time when I do use it, I almost always use just a couple prompts before starting a completely new one because it just falls off a cliff in terms of reliability after the first couple messages. At that point you're better off fixing it yourself than trying to get it to do it a way you'll accept.
this is also what I started doing, sometimes it will give an actual correct answer but it usually easy to just start a new session. I can even ask the exact same question and get a correct answer with a new session
I find it's only really useful in terms of writing entire features if you're building something fairly simple, on top of using the most well known frameworks and libraries.
If you happen to like using less popular frameworks, libraries, packages etc it's like fighting an uphill battle because it will constantly try to inject what it interprets as the most common way to do things.
I do find it useful for smaller parts of features or writing things like small utilities or things at a scale where it's easy to manage/track where it's going and intervene
But full on vibe coding auto accept everything is madness whenever I see it.
Same thing happens to me. The LLM will make up some reasonably sounding answer, I correct it, three, four, five time, and then it circles back to the original answer... which is still just as wrong.
Either they don't retain previous information, or they are so desperate to give you any answer that they'd prefer the wrong answer. Why is it that an LLM can't go: Yeah, I don't know.
I have this same experience. Vibe coding is literally hell.
as ever, any task that has any sort of safety or security critical risks should never be left to a “magic black box”.
human input/review/verification/validation is always required. verify the untrusted output of these systems. don’t believe the hype and don’t blindly trust them.
—
i did find the fact that google search’s assistant just parroted the crafted/fake READMEs thing particularly concerning - propagating false confidence/misplaced trust - although it’s not at all surprising given the current state of things.
genuinely feel like “classic search” and “new-fangled LLM queries” need to be split out and separated for low-level/power user vs high-level/casual questions.
at least with classic search i’m usually finding a github repo fairly quickly that i can start reading through, as an example.
at the same time, i could totally see myself scanning through a README and going “yep, sounds like what i need” and making the same mistake (i need other people checking my work too).
> any task that has any sort of safety or security critical risks should never be left to a “magic black box”. > human input/review/verification/validation is always required.
but, are humans not also a magic black box? We don't know what's going on in other people's heads, and while you can communicate with a human and tell them to do something, they are prone to misunderstanding, not listening, or lying. (which is quite similar to how LLMs behave!)
Well if a human consistently hallucinates as much as an LLM, you definitely not want them employed and would probably recommend they go to rehab.
from my comment
> at the same time, i could totally see myself scanning through a README and going “yep, sounds like what i need” and making the same mistake (i need other people checking my work too).
yes, us humans have similar issues to the magic black box. i’m not arguing humans are perfect.
this is why we have human code review, tests, staging environments etc. in the release cycle. especially so in safety/security critical contexts. plus warnings from things like register articles/CVEs to keep track of.
like i said. don’t blindly trust the untrusted output (code) of these things — always verify it. like making sure your dependencies aren’t actually crypto miners. we should be doing that normally. but some people still seem to believe the hype about these “magic black box oracles”.
the whole “agentic”/mcp/vibe-coding pattern sounds completely fucking nightmare-ish to me as it reeks of “blindly trust everything LLM throws at you despite what we’ve learned in the last 20 years of software development”.
Sounds like we just need to treat LLMs and humans similarly: accept they are fallible, put review processes in place when it matters if they fail, increase stringency of review as stakes increase.
Vibe coding is all about deciding it doesn’t matter if the implementation is perfect. And that’s true for some things!
> Vibe coding is all about deciding it doesn’t matter if the implementation is perfect. And that’s true for some things!
i was going to say, sure yeah i’m currently building a portfolio/personal website for myself in react/ts, purely for interview showing off etc. probably a good candidate for “vibe coding”, right? here’s the problem - which is explicitly discussed in the article - vibe coding this thing can bring in a bunch of horrible dependencies that do nefarious things.
so i’d be sitting in an interview showing off a few bits and pieces and suddenly their CPU usage spikes at 100% util over all cores because my vibe-coded personal site has a crypto miner package installed and i never noticed. maybe it does some data exfiltration as well just for shits and giggles. or maybe it does <insert some really dark thing here>.
“safety and security critical” applies in way more situations than people think it does within software engineering. so many mundane/boring/vibe-it-out-the-way things we do as software engineers have implicit security considerations to bear in mind (do i install package A or package B?). which is why i find the entire concept of “vibe-coding” to be nightmarish - it treats everything as a secondary consideration to convenience and laziness, including basic and boring security practices like “don’t just randomly install shit”.
> We don't know what's going on in other people's heads
I don't know about you, but for most people theory of mind develops around age 2...
It's true, hallucinations in LLM's can be so consistent that I warn the LLM up front about stuff like "do not use NSCacheDefault, it does not exist, and there is no default value" and then keeping my fingers crossed it doesn't find a roundabout way to introduce it anyway.
Can't really remember what is was exactly anymore, something in Apple's Vision libraries that just kept popping up if I didn't explicitly say to not use it.
This was my chief critique when my company forced us to use their AI tooling. I was trying to stitch together our CMDB, two different VMware products, and the corporate technology directory into a form of product tenancy for our customers. At one point I was trying to move and transform data from our CRM into mongoDB, and figured "eh, let's knock these mandatory agent queries out of the way by asking the chatbot to help." I wrote a few prompts to try and explain what I was trying to accomplish (context), and how I'd like it done (instruction).
The bot hallucinated a non-existent mongoDB Powershell cmdlet, complete with documentation on how it works, and then spat out a "solution" to the problem I asked. Every time I reworked the prompt, cut it up into smaller chunks, narrowed the scope of the problem, whatever I tried, the chatbot kept flatly hallucinating non-existent cmdlets, Python packages, or CLI commands, sometimes even providing (non-working) "solutions" in languages I didn't explicitly ask for (such as bash scripting instead of Powershell).
This was at a large technology company, no less, one that's "all-in" on AI.
If you're staying in a very narrow line with a singular language throughout and not calling custom packages, cmdlets, or libraries, then I suspect these things look and feel quite magical. Once you start doing actual work, they're complete jokes in my experience.
If the LLM is "making up" APIs that don't exists, I'm guessing they've been introduced as the model tried to generalize from the training set, as that's the basic idea? These invented APIs might represent patterns the model identified across many similar libraries, or other texts people have written on the internet, wouldn't that actually be a sort of good library to have available if it wasn't already? Maybe we could use these "hallucinations" in a different way, if we could sort of know better what parts are "hallucination" vs not. Maybe just starting points for ideas if nothing else.
In my experience, what's being made up is an incorrect name for an API that already exists elsewhere. They're especially bad at recommending deprecated methods on APIs.
It’s not that the imports don’t exist, they did in the original codebase the LLM creator stole from by ignoring the projects license terms.
Back in GPT3 days I put together a toy app that let you ask for a python program, and it hooked __getattr__ so if the LLM generated code called a non-existent function it could use GPT3 to define it dynamically. Ended up with some pretty wild alternate reality python implementations. Nothing useful though.
> wouldn't that actually be a sort of good library to have available if it wasn't already
I for one do not want my libraries APIs defined by the median person commenting about code of making questions on Stack Overflow.
Also, every time I see people using LLMs output as a starting point for software architecture the results became completely useless.
The average of the internet is heavily skewed towards the mediocre side.
Can’t we just move to have package managers point to a curated list of packages by default, with the option to enable an uncurated one if you know what your doing , ala Ubuntu source lists?
At least having good integrated support in the package manager for an allow-list of packages would be good. Then one could maintain such lists in a company or project. And we could have community efforts to develop shared curated lists that could be starting points. If that really catches on, one could consider designating one of them as a default.
Might also want to support multiple allow lists, so one can add to a standard list in a project (after review). And also deny, so one can remove a few without exiting completely from common lists.
Then you are stuck on whatever passes the gates.
It is shitloads of work to maintain.
Getting new package from 0 to any Linux distribution is close to impossible.
Debian sucks as no one gets on top of reviewing and testing.
„Can we just” is not just there is loads of work to be done to curate packages no one is willing to pay for it.
There is so far no model that works where you can have up to date cutting edge stuff reviewed. So you are stuck with 5 year old crap because it was reviewed.
So many good packages made it into Debian relatively recently! Eg: fzf, fd-find, ripgrep, jq, exa, nvim, ...
Yes. But that would mean someone needs to work harder.
> "What a world we live in: AI hallucinated packages are validated and rubber-stamped by another AI that is too eager to be helpful."
That's actually hilarious.
My favorite is when the LLM hallucinates some function or an entire library and you call it out for the mistake. A likely response is "Oh, I'm sorry, You're right. Here's how you would implement function_that_does_not_exist()" and proceeds to write the library it hallucinated in the first place.
It's quirks like these that prove LLMs are a long long way from AGI.
Seems to also especially love making up options and settings for command line tools.
Thank you, AI, for exposing the idiocy of package-driven programming, where everything is a mess of churning external dependencies.
What’s the alternative? Statically linked binaries?
It's so bad, people can't even think of obvious alternatives.
Do you mean "shipping the right libs with your software" (in a self contained package)?
That may mean a lot of redundancy.
Most of the code is badly written. Models are doing what most of their dataset is doing.
I remember, fresh out of college, being shocked by the amount of bugs in open source.
More recent models are producing much higher quality code than models from 6/12/18 months ago. I believe a lot of this is because the AI labs have figured out how to feed them better examples in the training - filtering for higher quality open source code libraries, or loading up on code that passes automated tests.
A lot of model training these days uses synthetic data. Generating good code synthetic data is a whole lot easier than any other category, as you can at least ensure the code you're generating is gramatically valid and executes without syntax errors.
The dataset isn't making up fake dependencies.
A few days ago: https://news.ycombinator.com/item?id=43644880
The only real solution I see is lint and ci tooling that prevents non approved packages from getting into your repo. Even with this there is potential for theft on localhost. There are a dozen new YC startups visible in just those two sentences.
Who do you think is going to be writing those linting rules after the first person that cared about it the most finishes?
Could the AI providers themselves monitor any code snippets and look for non-existent dependencies? They could then ask the LLM to create that package with the necessary interface and implant an exploit in the code. Languages that allow build scripts would be perfect as then the malicious repo only needs to have the interface (so that the IDE doesn't complain) and the build script can download a separate malicious payload to run.
The AI providers already write the code, on the whole crazy promise that humans need not to care/read about it. I'm not sure that it changes anything at that point to add one weak level of indirection. You are already compromised.
The article contains nothing new. Just opinions including a security firm CEO selling his security offerings.
Read this instead, it's the technical report that is only linked to and barely mentioned in the article: https://socket.dev/blog/slopsquatting-how-ai-hallucinations-...
Hi — I’m the security firm CEO mentioned, though I wear a few other hats too: I’ve been maintaining open source projects for over a decade (some with 100s of millions of npm downloads), and I taught Stanford’s web security course (https://cs253.stanford.edu).
Totally understand the skepticism. It’s easy to assume commercial motives are always front and center. But in this case, the company actually came after the problem. I’ve been deep in this space for a long time, and eventually it felt like the best way to make progress was to build something focused on it full-time.
socket article seems to mostly be a review of this arXiv preprint paper: https://arxiv.org/pdf/2406.10279
there’s also some info from Python software foundation folks in the register article, so it’s not just a socket pitch article.
That's not the technical report; it's also just a blog article which links to someone else's paper, and finishes off by promoting something:
"Socket addresses this exact problem. Our platform scans every package in your dependency tree, flags high-risk behaviors like install scripts, obfuscated code, or hidden payloads, and alerts you before damage is done. Even if a hallucinated package gets published and spreads, Socket can stop it from making it into production environments."
I’m not measuring it, but it seems like copilot suggests fewer imports than it used to. It could be that it has more context to see that I rarely import external packages and follows suit. Or maybe I’m using it subtlety different than I used to.
Why can’t pypy / npm / etc just scan all newly uploaded modules for typical malware patterns before the package gets approved for distribution?
> Why can’t [X] just [Y] first?
The word "just" here always presumes magic that does not actually exist.
Because doing so is computationally expensive and would be making false promises.
False positives where it incorrectly flagged a safe package would result in the need for a human review step, which is even more expensive.
False negatives where malware patterns didn't match anything previously would happen all the time, so if people learned to "trust" the scanning they would get caught out - at which point what value is the scanning adding?
I don't know if there are legal liability issues here too, but that would be worth digging into.
As it stands, there are already third parties that are running scans against packages uploaded to npm and PyPI and helping flag malware. Leaving this to third parties feels like a better option to me, personally.
When using AI, you are still the one responsible for the code. If the AI writes code and you don't read every line, why did it make its way into a commit? If you don't understand every line it wrote, what are you doing? If you don't actually love every line it wrote, why didn't you make it rewrite it with some guidance or rewrite it yourself?
The situation described in the article is similar to having junior developers we don't trust committing code and us releasing it to production and blaming the failure on them.
If a junior on the team does something dumb and causes a big failure, I wonder where the senior engineers and managers were during that situation. We closely supervise and direct the work of those people until they've built the skills and ways of thinking needed to be ready for that kind of autonomy. There are reasons we have multiple developers of varying levels of seniority: trust.
We build relationships with people, and that is why we extend them the trust. We don't extend trust to people until they have demonstrated they are worthy of that trust over a period of time. At the heart of relationships is that we talk to each other and listen to each other, grow and learn about each other, are coachable, get onto the same page with each other. Although there are ways to coach llm's and fine tune them, LLM's don't do nearly as good of a job at this kind of growth and trust building as humans do. LLM's are super useful and absolutely should be worked into the engineering workflow, but they don't deserve the kind of trust that some people erroneously give to them.
You still have to care deeply about your software. If this story talked about inexperienced junior engineers messing up codebases, I'd be wondering where the senior engineers and leadership were in allowing that to mess things up. A huge part of engineering is all about building reliable systems out of unreliable components and always has been. To me this story points to process improvement gaps and ways of thinking people need to change more than it points to the weak points of AI.
If you get pwned by some AI code hallucination you deserve it honestly. They're code assistants not code developers.
If you get pwned by external dependencies in any way, you deserve it.
This idea of programs fetching reams of needed stuff from the cloud somewhere is a real scourge in programming.
When I suspect that it will make stuff up, I tell it to cite the docs that contain the functions it used. It causes more global warming, but it works fine.
Slop in slop out
I have no other gear than polemic on the topic of AI-for-code-generation so ignore this comment if you don’t like that.
I think people in software envy real-engineering too much. Software development is what it is. If it does not live up to that bar then so be it. But AI-for-code-generation (“AI” for short now) really drops any kind of pretense. I got into software because it was supposed to be analytic, even kind of a priori. And deterministic. What even is AI right now? It melds the very high tech and probabilistic (AI tech) with the low tech of code generation (which is deterministic by itself but not with AI). That’s a regression both in terms of craftmanship (code generation) and so-called engineering (deterministic). I was looking forward to higher-level software development: more declarative (better programming languages and other things), more tool-assisted (tests, verification), more deterministic and controlled (Nix?), and less process redundancies (e.g. less redundancies in manual/automated testing, verification, review, auditing). Instead we are mining the hard work of the past three decades and spitting out things that have the mandatory label “this might be anything, verify it yourself”. We aren’t making higher-level tools—we[1] are making a taller tower with less support beams, until the tower reaches so high that the wind can topple it at any moment.
The above was just for AI-for-code-generation. AI could perhaps be used to create genuinely higher level processes. A solid structure with better support. But that’s not the current trajectory/hype.
[1] ChatGPT em-dash alert. https://news.ycombinator.com/item?id=43498204
[dead]
Usually, when the model hallucinates a dependency, the subject of the hallucination really should exist. I've often thought that was kind of interesting in itself. It can feel like a genuine glimpse of emergent creativity.
Children may invent the world as they do not know it well yet. Adults know that reality is not what you may expect. We need to deal with reality, so...
I'm waiting for the AI apologists to swarm on this post explaining how these are just the results of poorly written prompts, because AI could not make mistakes with proper prompts. Been seeing an increase of this recently on AI-critical content, and it's exhausting.
Sure, with well written prompts you can have some success using AI assistants for things, but also with well-written non-ambiguous prompts you can inexplicably end up with absolute garbage.
Until things become consistent, this sort of generative AI is more akin to a party trick than being able to replace or even supplement junior engineers.
As an "AI apologist", sorry to disappoint but the answer here isn't better prompting: it's code review.
If an LLM spits out code that uses a dependency you aren't familiar with, it's your job to review that dependency before you install it. My lowest effort version of this is to check that it's got a credible commit and release history and evidence that many other people are using it already.
Same as if some stranger opens a PR against your project introducing a new-to-you dependency.
If you don't have the discipline to do good code review, you shouldn't be using AI-assisted programming outside of safe sandbox environments.
(Understanding "safe sandbox environment" is a separate big challenge!)
Yep. The issue is most people I've seen who lean most on these tools do not have that discipline.
Being good at reading and reviewing code is quite a rare skill!
One time some of our internal LLM tooling decided to delete a bunch of configuration and replace it with: “[EXISTING CONFIGURATION HERE]”
Lmfaooo
Haha. That sounds like something Sonnet 3.6 would do, it learned to cheat that way and it's an absolute pain in the ass to make it produce longer outputs.
Hahahaha. That's actually amazing.
You are getting replaced, man. Burying your head in the sand won’t help.
This is just another reason why dependencies are an anti-pattern. If you do nothing, your software shouldn't change.
I suspect that this style of development became popular in the first place because the LGPL has different copyright implications based on whether code is statically or dynamically linked. Corporations don't want to be forced to GPL their code so a system that outsources libraries to random web sites solves a legal problem for them.
But it creates many worse problems because it involves linking your code to code that you didn't write and don't control. This upstream code can be changed in a breaking way or even turned into malware at any time but using these dependencies means you are trusting that such things won't happen.
Modern dependency based software will never "just work" decades from now like all of that COBOL code from the 1960s that infamously still runs government and bank computer systems on the backend. Which is probably a major reason why they won't just rewrite the COBOL code.
You could say as a counterargument that operating systems often include breaking changes as well. Which is true but you don't update your operating system on a regular basis. And the most popular operating system (Windows) is probably the most popular because Microsoft historically has prioritized backward compatibility even to the extreme point of including special code in Windows 95 to make sure it didn't break popular games like SimCity that relied on OS bugs from Windows 3.1 and MS-DOS[0].
[0]: https://www.joelonsoftware.com/2000/05/24/strategy-letter-ii...
IME when AI "hallucinates" API endpoints or library functions that just aren't there it's almost always the case that they should be. In other words the AI has based it's understanding on the combined knoweledge of hundreds(?) of other APIs and libraries and is geenrating an obvious analogy.
Turning this around: a great use case is to ask AI to review documents, APIs, etc. AI is really great for teasing out your blindspots.
If the training data contains useless endpoints the AI will also hallucinate those useless endpoints.
The wisdom of the crowd only works for the end result not if you consider every given answer, then you get more wrong answers because you fall to the average.
The next step could be to ask it to generate the missing function.