For me, a team of junior developers that refuse to learn from their mistakes is the fuel of nightmares. I'm stuck in a loop where every day I need to explain to a new hire why they made the exact same beginner's mistake as the last person on the last day. Eventually, I'd rather spend half an hour of my own time than to explain the problem once more...
Why anyone thinks having 3 different PRs for each jira ticket might boost productivity, is beyond me.
Related anime: I May Be a Guild Receptionist, But I'll Solo Any Boss to Clock Out on Time
One of the (many) differences between junior developers and LLM assistance is that humans can learn from their mistakes, whereas with LLMs it's up to you as the prompter to learn from their mistakes.
If an LLM screws something up you can often adjust their prompt to avoid that particular problem in the future.
> One of the (many) differences between junior developers and LLM assistance is that humans can learn from their mistakes
One would think so, but I've had some developers repeat the same mistake a hundred times, where eventually they admit they just keep forgetting it.
The frustration you feel when telling a human for the Xth time that we do not allow yoda-conditions in our codebase is incredibly similar to when an AI does something wrong.
Randomising LLM outputs (temperature) results is outputs that will always have some degree of hallucination.
That’s just math. You can’t mix a random factor in and magically expect it to not exist. There will always be p(generates random crap) > 0.
However, in any probabilistic system, you can run a function k times and you’ll get an output distribution that is meaningful if k is high enough.
3 is not high enough.
At 3, this is stupid; all you’re observing is random variance.
…but, in general, running the same prompt multiple times and taking some kind of general solution from the distribution isn’t totally meaningless, I guess.
The thing with LLMs is they scale in a way that actually allows this to be possible, in a way that scaling with humans can’t.
… like the monkeys and Shakespeare, there probably a limit to the value it can offer; but it’s not totally meaningless to try it.
I think this is an interesting idea, but I also somewhat suspect you've replaced a tedious problem with a harder, more tedious problem.
Take your idea further. Now I've got 100 agents, and 100 PRs, and some small percentage of them are decent. The task went from "implement a feature" to "review 100 PRs and select the best one".
Even assuming you can ditch 50 percent right off the bat as trash... Reviewing 50 potentially buggy implementations of a feature and selecting the best genuinely sounds worse than just writing the solution.
Worse... If you haven't solved the problem before anyways, you're woefully unqualified as a reviewer.
Theory of constraints is clear: speeding up something that isn't a bottleneck worsens system performance.
The idea that too little code is the problem is the problem. Code is liability. Making more of it faster (and probabilistic) is a fantastically bad idea.
There should be test cases ran, coverage ensured. This is trivially automated. LLMs should also review the PRs, at least initially, using the test results as part of the input.
Who tests the tests? How do you know that the LLM-generated tests are actually asserting anything meaningful and cover the relevant edge cases?
The tests are part of the code that needs to be reviewed in the PR by a human. They don't solve the problem, they just add more lines to the reviewer's job.
So now either the agent is writing the tests, in which case you're right back to the same issue (which tests are actually worth running?) or your job is now just writing tests (bleh...).
And for the llm review of the pr... Why do you assume it'll be worth any more then the original implementation? Or are we just recursing down a level again (if 100 llms review each of the 100 PRs... To infinity and beyond!)
The LLMs can help with the writing of the tests, but you should verify that they're testing critical aspects and known edge cases are covered. A single review-promoted LLM can then utilize those across the PRs and provide a summary for acceptance the the best. Or discard all and do manually; that initial process should only have taken a few minutes, so minimal wastage in the grand scheme of things, given over time there are a decent amount of acceptances, compared to the alternative 100% manual effort and associated time sunk.
The linked article from Steve Yegge (https://sourcegraph.com/blog/revenge-of-the-junior-developer) provides a 'solution', which he thinks is also imminent - supervisor AI agents, where you might have 100+ coding agents creating PRs, but then a layer of supervisors that are specialized on evaluating quality, and the only PRs that a human being would see would be the 'best', as determined by the supervisor agent layer.
From my experience with AI agents, this feels intuitively possible - current agents seem to be ok (thought not yet 'great') at critiquing solutions, and such supervisor agents could help keep the broader system in alignment.
>but then a layer of supervisors that are specialized on evaluating quality
Why would supervisor agents be any better than the original LLMs? Aren't they still prone to hallucinations and subject to the same limitations imposed by training data and model architecture?
It feels like it just adds another layer of complexity and says, "TODO: make this new supervisor layer magically solve the issue." But how, exactly? If we already know the secret sauce, why not bake it into the first layer from the start?
Similar to how human brains behave, it is easier to train a model to select a better solution between many choices than to check an individual solution for correctness [1], which is in turn an easier task to learn than writing a correct single solution in the first place.
[1] the diffs in logic can suggest good ideas that may have been missed in subsets of solutions.
Just add a CxO layer that monitors the supervisors! And the board of directors watches the CEO and the shareholders monitor the board of directors. It's agents all the way up!
LLMs are smarter in hindsight than going forward, sort of like humans! only they don't have such flexible self reflection loops so they tend to fall into local minima more easily.
This reads like it could result in "the blind, leading the blind". Unless the Supervisor AI agents are deterministic, it can still be a crapshoot. Given the resources that SourceGraph has, I'm still surprised they missed the most obvious thing, which is "context is king" and we need tooling that can make adding context to LLMs dead simple. Basically, we should be optimizing for the humans in the loop.
Agents have their place for trivial and non-critical fixes/features, but the reality is, unless the agents can act in a deterministic manner across LLMs, you really are coding with a loaded gun. The worst is, agents can really dull your senses over time.
I do believe in a future where we can trust agents 99% of the time, but the reality is, we are not training on the thought process, for this to become a reality. That is, we are not focused on the conversation to code training data. I would say 98% of my code is AI generated, and it is certainly not vibe coding. I don't have a term for it, but I am literally dictating to the LLM what I want done and have it fill in the pieces. Sometimes it misses the mark, sometimes it aligns and sometimes it introduces whole new ideas that I have never thought of, which will lead to a better solution. The instructions that I provide is based on my domain knowledge and I think people are missing the mark when they talk about vibe coding, in a professional context.
Full Disclosure: I'm working on improving the "conversation to code" process, so my opinions are obviously biased, but I strongly believe we need to first focus on better capturing our thought process.
> However, in any probabilistic system, you can run a function k times and you’ll get an output distribution that is meaningful if k is high enough.
This is the underlying flaw in this approach. Attempting to use probabilistic algorithms to produce a singular verifiably correct result requires an external agent to select what is correct in the output of "k times" invocations. This is a person capable of making said determination.
> The thing with LLMs is they scale in a way that actually allows this to be possible, in a way that scaling with humans can’t.
For the "k times" generation of text part, sure. Not for the determination of which one within k, if any, are acceptable for the problem at hand.
EDIT: clarified "produce a verifiably correct result" to be "produce a singular verifiably correct result"
Three is high enough, in my eyes. Two might be. Remember that we don't care about any but the best solution. With one sample you've only got one 50/50 shot to get above the median. With three, the odds of the best of the three being above the median is 87.5%.
Of course picking the median as the random crap boundary is entirely arbitrary, but it'll do until there's a justification for a better number.
Even with human junior devs, ideally you'd maintain some documentation about common mistakes/gotchas so that when you onboard new people to the team they can read that instead of you having to hold their hand manually.
You can do the same thing for LLMs by keeping a file with those details available and included in their context.
You can even set up evaluation loops so that entries can be made by other agents.
> For me, a team of junior developers that refuse to learn from their mistakes is the fuel of nightmares. I'm stuck in a loop where every day I need to explain to a new hire why they made the exact same
This is a huge opportunity, maybe the next big breakthrough in AI when someone figures out how to solve it
Instead of having a model that knows everything, have a model that can learn on the go from the feedback it gets from the user
Ideally a local model too. So something that runs on my computer that I train with my own feedback so that it gets better at the tasks I need it to perform
You could also have one at team level, a model that learns from the whole team to perform the tasks the team needs it to perform
Continual feedback means continual training. No way around it. So you’d have to scope down the functional unit to a fairly small lora in order to get reasonable re-training costs here.
That's not quite true. The system prompt is state that you can use for "training" in a way that fits the problem here. It's not differentiable so you're in slightly alien territory, but it's also more comprehensible than gradient-descending a bunch of weights.
> This is a huge opportunity, maybe the next big breakthrough in AI when someone figures out how to solve it
I am not saying I solved it, but I believe we are going to experience a paradigm shift in how we program and teach and for some, they are really going to hate it. With AI, we can now easily capture the thought process for how we solve problems, but there is a catch. For this to work, senior developers will need to come to terms that their value is not in writing code, but solving problems.
I would say 98% of my code is now AI generated and I have 0% fear that it will make me dumber. I will 100% become less proficient in writing code, but my problem solving skills will not go away and will only get sharper. In the example below, 100% of the code/documentation was AI generated, but I still needed to guide Gemini 2.5 Pro
After reviewing the code, it was clear what the problem was and since I didn't want to waste token and time, I literally suggested the implementation and told it to not generate any code, but asks it to explain the problem and the solution, as shown below.
> The bug is still there. Why does it not use states to to track the start @@ and end @@? If you encounter @@ , you can do an if else on the line by asking if the line ends with @@. If so, you can change the state to expect replacement start delimiter. If it does not end with @@ you can set the state to expect line to end with @@ and not start with @@. Do you understand? Do not generate any code yet.
How I see things evolving over time is, senior developers will start to code less and less and the role for junior developers will not only be to code but to review conversations. As we add new features and fix bugs, we will start to link to conversations that Junior developers can learn from. The Dooms day scenario is obviously, with enough conversations, we may reach the point where AI can solve most problems one shot.
> For this to work, senior developers will need to come to terms that their value is not in writing code, but solving problems.
This is the key reason behind authoring https://ghuntley.com/ngmi - developers that come to terms with the new norm will flourish yet the developers who don't will struggle in corporate...
Seems more likely that Orange would just be producing 16x the mediocre C-tier work they were doing before. This scenario assumes that the problem from the lower-tier programmers is their volume of work, but if you aren't able to properly understand the problems or solutions in the first place, LLMs are not going to fix it for you.
Chances are Grape and Apple will eventually adopt LLMs because they need to in order to fix the mistakes Orange is now producing at scale.
Nice write up. I don't necessary think it is "if you adopt it, you will flourish" as much as, if you have this type of "personality you will easily become 10x, if you have this type of personality you will become 2x and if you have this type, you will become .5x".
I'm obviously biased, but I believe developers with a technical entrepreneur mindset, will see the most benefit. This paradigm shift requires the ability to properly articulate your thoughts and be able to create problem statements for every action. And honestly, not everybody can do this.
Obviously, a lot depends on the problems being solved and how well trained the LLM is in that person's domain. I had Claude and a bunch of other models write my GitSense Chat Bridge code which makes it possible to bring Git's history into my chat app and it is slow as hell. It works most of the time, but it was obvious that the design pattern was based on simple CRUD apps. And this is where LLMs will literally slow you down and I know this because I already solved this problem. The LLM generated chat bridge code will be free and open sourced but I will charge for my optimized indexing engine.
Darn I wonder if systems could be modified so that common mistakes become less common or if documentation could be written once and read multiple times by different people.
We feed it conventions that are automatically loaded for every LLM task, do that the LLM adheres to coding style, comment style, common project tooling and architecture etc.
These systems don't do online learning, but that doesn't mean you can spoon feed them what they should know and mutate that knowledge over time.
The current challenge is not to create a patch, but to verify it.
Testing a fix in a big application is a very complex task. First of all, you have to reproduce the issue, to verify steps (or create them, because many issues don't contain clear description). Then you should switch to the fixed version and make sure that the issue doesn't exists. Finally, you should apply little exploratory testing to make sure that the fix doesn't corrupted neighbour logic (deep application knowledge required to perform it).
To perform these steps you have to deploy staging with the original/fixed versions or run everything locally and do pre-setup (create users, entities, etc. to achieve the corrupted state).
This is very challenging area for the current agents. Now they just can't do these steps - their mental models just not ready for a such level of integration into the app and infra. And creation of 3/5/10/100 unverified pull requests just slow down software development process.
Are you sure about "all"? Because I mentioned not only env deployment, but also functional issue reproduction using UI/API, which is also require necessary pre-setup.
Automated tests partially solve the case, but in real world no one writes tests blindly. It's always manual work, and when the failing trajectory is clear - the test is written.
Theoretically agent can interact with UI or API. But it requires deep project understanding, gathered from code, documentation, git history, tickets, slack. And obtaining this context, building an easily accessible knowledge base and puring only necessary parts into the agent context - is still a not solved task.
If your CI/CD process was able to fully verify a fix then it would have stopped the bug from making it to production the first time around and the Jira ticket which was handed to multiple LLMs never would have existed.
There is no fundamental blocker to agents doing all those things. Mostly a matter of constructing the right tools and grounding, which can be fair amount of up-front work. Arming LLMs with the right tools and documentation got us this far. There’s no reason to believe that path is exhausted.
Author proposed one-line solution, but the following discussion includes analysis of RFC, potential negative outcomes, different ways to fix it.
And without deep understanding of the project - it's not clear how to fix it properly, without damage to backward compatibility and neighbor functionality.
Also such a fix must be properly tested manually, because even well designed autotests are not 100% match the actual flow.
You can explore other open and closed issues and corresponding discussions. And this is the complexity level of real software, not pet projects or simple apps.
I guess that existing attention mechanism is the fundamental blocker, because it barely able to process all the context required for a fix.
Have you tried building agents? They will go from PhD level smart to making mistakes a middle schooler would find obvious, even on models like gemini-2.5 and o1-pro. It's almost like building a sandcastle where once you get a prompt working you become afraid to make any changes because something else will break.
I think the issue right now is so many people want to believe in the moonshot and are investing heavily in it, when the reality is we should be focusing on the home runs. LLMs are a game changer, but there is still A LOT of tooling that can be created to make it easier to integrate humans in the loop.
Correct! Over at https://ghuntley.com/mcp I propose that each company develops their own tools for their particular codebase that shapes LLM actions on how to work with their codebase.
I’ve been using Cursor and Code regularly for a few months now and the idea of letting three of them run free on the codebase seems insane. The reason for the chat interface is that the agent goes off the rails on a regular basis. At least 25% of the time I have to hit the stop button and go back to a checkpoint because the automatic lawnmower has started driving through the flowerbed again. And paradoxically, the more capable the model gets, the more likely it seems to get random ideas of how to fix things that aren’t broken.
Had a similar experience with Claude Code lately. I got a notice some credits were expiring, so I opened up Claude Code and asked it to fix all the credo errors in an elixir project (style guide enforcement).
I gave it incredibly clear steps of what to run in what process, maybe 6 steps, 4 of which were individual severity levels.
Within a few minutes it would as to commit code, create branches, run tests, start servers — always something new, none of which were in my instructions. It would also often run mix credo, get a list of warnings, deem them unimportant, then try to go do its own thing.
It was really cool, I basically worked through 1000 formatting errors in 2 hours with $40 of credits (that I would have had no use for otherwise).
But man, I can’t imagine letting this thing run a single command without checking the output
So... I know that people frame these sorts of things as if it's some kind of quantization conspiracy, but as someone who started using Claude Code the _moment_ that it came out, it felt particularly strong. Then, it feels like they... tweaked something, whether in CC or Sonnet 3.7 and it went a little downhill. It's still very impressive, but something was lost.
I've found Gemini 2.5 Pro to be extremely impressive and much more able to run in an extended fashion by itself, although I've found very high variability in how well 'agent mode' works between different editors. Cursor has been very very weak in this regard for me, with Windsurf working a little better. Claude Code is excellent, but at the moment does feel let down by the model.
I've been using Aider with Gemini 2.5 Pro and found that it's very much able to 'just go' by itself. I shipped a mode for Aider that lets it do so (sibling comment here) and I've had it do some huge things that run for an hour or more, but assuredly it does get stuck and act stupidly on other tasks as well.
My point, more than anything, is that... I'd try different editors and different (stronger) models and see - and that small tweaks to prompt and tooling are making a big difference to these tools' effectiveness right now. Also, different models seem to excel at different problems, so switching models is often a good choice.
Eh I am happy waiting many years before any of that. If it only work right with the right model for the right job, and it’s very fuzzy which models work for which tasks, and the models change all the time (often times silently)… at some point it’s just easier to do the easy task I’m trying to offload then juggle all off this.
If and when I go about trying these tools in the future, I’ll probably looks for and open source TUI, so keep up the great work on aider!
> letting three of them run free on the codebase seems insane
That seems like an unfair characterization of the process they described here.
They only allowed the agents to create pull requests for a specific bug. Both the bug report and the decision of which, if any, PR to accept is done by a human being.
Right, but it seems like that would just generate three PRs I don’t want to review, given the likelihood the agent went into the weeds without someone supervising it.
Edit: In case anyone wants to try it, I uploaded it to PyPI as `navigator-mode`, until (and if!) the PR is accepted. By I, I mean that it uploaded itself. You can see the session where it did that here: https://asciinema.org/a/9JtT7DKIRrtpylhUts0lr3EfY
and, because Aider's already an amazing platform without the autonomy, it's very easy to use the rest of Aider's options, like using `/ask` first, using `/code` or `/architect` for specific tasks [1], but if you start in `/navigator` mode (which I built, here), you can just... ask for a particular task to be done and... wait and it'll often 'just get done'.
It's... decidedly expensive to run an LLM this way right now (Gemini 2.5 Pro is your best bet), but if it's $N today, I don't doubt that it'll be $0.N by next year.
I don't mean to speak in meaningless hype, but I think that a lot of folks who are speaking to LLMs' 'inability' to do things are also spending relatively cautiously on them, when tomorrow's capabilities are often here, just pricey.
I'm definitely still intervening as it goes (as in the Devin demos, say), but I'm also having LLMs relatively autonomously build out large swathes of functionality, the kind that I would put off or avoid without them. I wouldn't call it a programmer-replacement any time soon (it feels far from that), but I'm solo finishing architectures now that I know how to build, but where delegating them to a team of senior devs would've resulted in chaos.
[1]: also for anyone who hasn't tried it and doesn't like TUI, do note that Aider has a web mode and a 'watch mode', where you can use your normal editor and if you leave a comment like '# make this darker ai!', Aider will step in and apply the change. This is even fancier with navigator/autonomy.
It does for me, yes -- models seem to be pretty capable of adhering to the tool call format, which is really all that they 'need' in order to do a good job.
I'm still tweaking the prompts (and I've introduced a new, tool-call based edit format as a primary replacement to Aider's usual SEARCH/REPLACE, which is both easier and harder for LLMs to use - but it allows them to better express e.g. 'change the name of this function').
So... if you have any trouble with it, I would adjust the prompts (in `navigator_prompts.py` and `navigator_legacy_prompts.py` for non-tool-based editing). In particular when I adopted more 'terseness and proactively stop' prompting, weaker LLMs started stopping prematurely more often. It's helpful for powerful thinking models (like Sonnet and Gemini 2.5 Pro), but for smaller models I might need to provide an extra set of prompts that let them roam more.
One thing I've had in the back of my brain for a few days is the idea of LLM-as-a-judge over a multi-armed bandit, testing out local models. Locally, if you aren't too fussy about how long things take, you can spend all the tokens you want. Running head-to-head comparisons is slow, but with a MAB you're not doing so for every request. Nine times out of ten it's the normal request cycle. You could imagine having new models get mixed in as and when they become available, able to take over if they're genuinely better, entirely behind the scenes. You don't need to manually evaluate them at that point.
I don't know how well that gels with aider's modes; it feels like you want to be able to specify a judge model but then have it control the other models itself. I don't know if that's better within aider itself (so it's got access to the added files to judge a candidate solution against, and can directly see the evaluation) or as an API layer between aider and the vllm/ollama/llama-server/whatever service, with the complication of needing to feed scores out of aider to stoke the MAB.
You could extend the idea to generating and comparing system prompts. That might be worthwhile but it feels more like tinkering at the edges.
It's funny you say this! I was adding a tool just earlier (that I haven't yet pushed) that allows the model to... switch model.
Aider can also have multiple models active at any time (the architect, editor and weak model is the standard set) and use them for different aspects. I could definitely imagine switching one model whilst leaving another active.
I think it did a fairly good job! It took just a couple of minutes and it effectively just switches the main model based on recent input, but I don’t doubt that this could become really robust if I had poked or prompted it further with preferences, ideas, beliefs and pushback! I imagine that you could very quickly get it there if you wished.
It's definitely not showing off the most here, because it's almost all direct-coding, very similar to ordinary Aider. :)
The trend with LLMs so far has been: if you have an issue with the AI, wait 6 months for a more advanced model. Cobbling together workarounds for their deficiencies is basically a waste of effort.
I've noticed that large models from different vendors often end up converging on more or less the same ideas (probably because they're trained on more or less the same data). A few days ago, I asked both Grok and ChatGPT to produce several stories with an absurd twist, and they consistently generated the same twists, differing only in minor details. Often, they even used identical wording!
Is there any research into this phenomenon? Is code generation any different? Isn't there a chance that several "independent" models might produce the same (say, faulty) result?
Plandex[1] uses a similar “wasteful” approach for file edits (note: I’m the creator). It orchestrates a race between diff-style replacements plus validation, writing the whole file with edits incorporated, and (on the cloud service) a specialized model plus validation.
While it sounds wasteful, the calls are all very cheap since most of the input tokens are cached, and once a valid result is achieved, other in-flight requests are cancelled. It’s working quite well, allowing for quick results on easy edits with fallbacks for more complex changes/large files that don’t feel incredibly slow.
This is a very interesting idea and I really should consider Aider in the "scriptable" sense more, I only use interactively.
I might add another step after each PR is created where another agent(s?) review and compare the results (maybe have the other 2 agents review the first agents code?).
Thanks, and having another step for reviewing each other's code is a really cool extension to this, I'll give it a shot :) Whether it works or it doesn't it could be really interesting for a future post!
We're going to have no traditional programming in 2 years? Riiight.
It would also be nice to see a demo where the task was something that I couldn't have done myself in essentially no time. Like, what happens if you say "tasks should support tags, and you should be able to filter/group tasks by tag"?
Gave it a shot real quick, looks like I need to fix something up about automatically running the migrations either in the CI script or locally...
But if you're curious, task was this:
----
Title: Bug: Users should be able to add tags to a task to categorize them
Description:
Users should be able to add multiple tags to a task but aren't currently able to.
Given I am a user with multiple tasks
When I select one
Then I should be able to add one or many tags to it
Given I am a user with multiple tasks each with multiple tags
When I view the list of tasks
Then I should be able to see the tags associated with each task
One thing to note that I've found - I know you had the "...and you should be able to filter/group tasks by tag" on the request - usually when you have a request that is "feature A AND feature B" you get better results when you break it down into smaller pieces and apply them one by one. I'm pretty confident that if I spent time to get the migrations running, we'd be able to build that request out story-by-story as long as we break it out into bite-sized pieces.
You can have a larger model split things out into more manageable steps and create new tickets - marked as blocked or not on each other, then have the whole thing run.
Wouldnt AI be perfect for those easy tasks? They still take time if you wanna do it "properly" with a new branch etc. I get lots of "can you change the padding for that component". And that is all. Is it easy? Sure. But still takes time to open the project, create a new branch, make the change, push the change, create a merge request, etc. That probably takes me 10 min.
If I could just let the AI do all of them and just go in and check the merge requests and approve them it would save me time.
In 2 years the entire industry will be in cybersecurity threat prevention. You think C is bad because of memory safety - wait until you see the future where every line was written by AI.
I wouldn't be surprised if someone tries to leverage this with their customer feature request tool.
Imagine having your customers write feature requests for your saas, that immediately triggers code generation and a PR. A virtual environment with that PR is spun up and served to that customer for feedback and refinement. Loop until customer has implemented the feature they would like to see in your product.
It's cute but I don't see the benefit. In my experience, if one LLM fails to solve a problem, the other ones won't be too different.
If you picked a problem where LLMs are good, now you have to review 3 PRs instead of just 1. If you picked a problem where they're bad, now you have 3 failures.
I think there are not many cases where throwing more attempts at the problem is useful.
Insincere answer that will probably be attempted sincerely nonetheless: throw even more agents at the problem by having them do code review as well. The solution to problems caused by AI is always more AI.
Technically that's known as "LLM-as-judge" and it's all over the literature. The intuition would be that the capability to choose between two candidates doesn't exactly overlap with the ability to generate either one of them from scratch. It's a bit like how (half of) generative adversarial networks work.
Simple, just ask an(other) AI! But seriously, different models are better/worse at different tasks, so if you can figure out which model is best at evaluating changes, use that for the review.
I wonder if using thinking models would work better here. They generally have less variance and consider more options, which could achieve the same goal.
I love this! I have a similar automation for moving a feature through ideation/requirements/technical design, but I usually dump the result into Cursor for last mile and to save on inference. Seeing the cost analysis is eye opening.
There’s probably also some upside to running the same model multiple times. I find Sonnet will sometimes fail, I’ll roll back and try again with same prompt but clean context, and it will succeed.
There's something cooked about Windsurf/Cursors' go-to-market pricing - there's no way they are turning a profit at $50/month. $50/month gets you a happy meal experience. If you want more power, you gotta ditch snacking at McDonald’s.
In the future, companies should budget $100 USD to $500 USD per day, per dev, on tokens as the new normal for business, which is circa $25k USD (low end) to $50k USD (likely) to $127k USD (highest) per year.
I've been lucky enough to have a few conversations with Scott a month or so ago and he is doing some really compelling work around the AISDLC and creating a factory line approach to building software. Seriously folks, I recommend following this guy closely.
There's another guy in this space I know who's doing similar incredible things but he doesn't really speak about it publicly so don't want to discuss w/o his permission. I'm happy to make an introduction for those interested just hmu (check my profile for how).
love to see "Why It Matters" turn into the heading equivalent to "delve" in body text (although different in that the latter is a legitimate word while the former is a "we need to talk about…"–level turn of phrase)
I'm authoring a self-compiling compiler with custom lexical tokens via LLM. I'm almost at stage 2, and approximately 50 "stdlib" concerns have specifications authored for them.
The idea of doing them individually in the IDE is very unappealing. Now that the object system, ast, lexer, parser, and garbage collection have stabilized, the codebase is at a point where fanning out agents makes sense.
As stage 3 nears, it won't make sense to fan out until the fundamentals are ready again/stabilised, but at that point, I'll need to fan out again.
The fleet approach can work well particularly because: 1) different models are trained differently, even though using mostly same data (think someone who studied SWE at MIT, vs one who studied at Harvard), 2) different agents can be given different prompts, which specializes their focus (think coder vs reviewer), and 3) the context window content influences the result (think someone who's seen the history of implementation attempts, vs one seeing a problem for the first time). Put those traits in various combinations and the results will be very different from a single agent.
The 10 cents is BS. It was only that because it was a trivial bug. A non-trivial bug requires context and the more context something requires, the more expensive it gets. Also once you are working with larger apps you have to pick the context, especially with LLMs that have smaller windows.
For me, a team of junior developers that refuse to learn from their mistakes is the fuel of nightmares. I'm stuck in a loop where every day I need to explain to a new hire why they made the exact same beginner's mistake as the last person on the last day. Eventually, I'd rather spend half an hour of my own time than to explain the problem once more...
Why anyone thinks having 3 different PRs for each jira ticket might boost productivity, is beyond me.
Related anime: I May Be a Guild Receptionist, But I'll Solo Any Boss to Clock Out on Time
One of the (many) differences between junior developers and LLM assistance is that humans can learn from their mistakes, whereas with LLMs it's up to you as the prompter to learn from their mistakes.
If an LLM screws something up you can often adjust their prompt to avoid that particular problem in the future.
> One of the (many) differences between junior developers and LLM assistance is that humans can learn from their mistakes
One would think so, but I've had some developers repeat the same mistake a hundred times, where eventually they admit they just keep forgetting it.
The frustration you feel when telling a human for the Xth time that we do not allow yoda-conditions in our codebase is incredibly similar to when an AI does something wrong.
> often
Often being about 30% of the time in my experience
It may not be as stupid as it sounds.
Randomising LLM outputs (temperature) results is outputs that will always have some degree of hallucination.
That’s just math. You can’t mix a random factor in and magically expect it to not exist. There will always be p(generates random crap) > 0.
However, in any probabilistic system, you can run a function k times and you’ll get an output distribution that is meaningful if k is high enough.
3 is not high enough.
At 3, this is stupid; all you’re observing is random variance.
…but, in general, running the same prompt multiple times and taking some kind of general solution from the distribution isn’t totally meaningless, I guess.
The thing with LLMs is they scale in a way that actually allows this to be possible, in a way that scaling with humans can’t.
… like the monkeys and Shakespeare, there probably a limit to the value it can offer; but it’s not totally meaningless to try it.
I think this is an interesting idea, but I also somewhat suspect you've replaced a tedious problem with a harder, more tedious problem.
Take your idea further. Now I've got 100 agents, and 100 PRs, and some small percentage of them are decent. The task went from "implement a feature" to "review 100 PRs and select the best one".
Even assuming you can ditch 50 percent right off the bat as trash... Reviewing 50 potentially buggy implementations of a feature and selecting the best genuinely sounds worse than just writing the solution.
Worse... If you haven't solved the problem before anyways, you're woefully unqualified as a reviewer.
Theory of constraints is clear: speeding up something that isn't a bottleneck worsens system performance.
The idea that too little code is the problem is the problem. Code is liability. Making more of it faster (and probabilistic) is a fantastically bad idea.
There should be test cases ran, coverage ensured. This is trivially automated. LLMs should also review the PRs, at least initially, using the test results as part of the input.
Who tests the tests? How do you know that the LLM-generated tests are actually asserting anything meaningful and cover the relevant edge cases?
The tests are part of the code that needs to be reviewed in the PR by a human. They don't solve the problem, they just add more lines to the reviewer's job.
So now either the agent is writing the tests, in which case you're right back to the same issue (which tests are actually worth running?) or your job is now just writing tests (bleh...).
And for the llm review of the pr... Why do you assume it'll be worth any more then the original implementation? Or are we just recursing down a level again (if 100 llms review each of the 100 PRs... To infinity and beyond!)
This by definition is not trivially automated.
The LLMs can help with the writing of the tests, but you should verify that they're testing critical aspects and known edge cases are covered. A single review-promoted LLM can then utilize those across the PRs and provide a summary for acceptance the the best. Or discard all and do manually; that initial process should only have taken a few minutes, so minimal wastage in the grand scheme of things, given over time there are a decent amount of acceptances, compared to the alternative 100% manual effort and associated time sunk.
The linked article from Steve Yegge (https://sourcegraph.com/blog/revenge-of-the-junior-developer) provides a 'solution', which he thinks is also imminent - supervisor AI agents, where you might have 100+ coding agents creating PRs, but then a layer of supervisors that are specialized on evaluating quality, and the only PRs that a human being would see would be the 'best', as determined by the supervisor agent layer.
From my experience with AI agents, this feels intuitively possible - current agents seem to be ok (thought not yet 'great') at critiquing solutions, and such supervisor agents could help keep the broader system in alignment.
>but then a layer of supervisors that are specialized on evaluating quality
Why would supervisor agents be any better than the original LLMs? Aren't they still prone to hallucinations and subject to the same limitations imposed by training data and model architecture?
It feels like it just adds another layer of complexity and says, "TODO: make this new supervisor layer magically solve the issue." But how, exactly? If we already know the secret sauce, why not bake it into the first layer from the start?
Similar to how human brains behave, it is easier to train a model to select a better solution between many choices than to check an individual solution for correctness [1], which is in turn an easier task to learn than writing a correct single solution in the first place.
[1] the diffs in logic can suggest good ideas that may have been missed in subsets of solutions.
Just add a CxO layer that monitors the supervisors! And the board of directors watches the CEO and the shareholders monitor the board of directors. It's agents all the way up!
LLMs are smarter in hindsight than going forward, sort of like humans! only they don't have such flexible self reflection loops so they tend to fall into local minima more easily.
This reads like it could result in "the blind, leading the blind". Unless the Supervisor AI agents are deterministic, it can still be a crapshoot. Given the resources that SourceGraph has, I'm still surprised they missed the most obvious thing, which is "context is king" and we need tooling that can make adding context to LLMs dead simple. Basically, we should be optimizing for the humans in the loop.
Agents have their place for trivial and non-critical fixes/features, but the reality is, unless the agents can act in a deterministic manner across LLMs, you really are coding with a loaded gun. The worst is, agents can really dull your senses over time.
I do believe in a future where we can trust agents 99% of the time, but the reality is, we are not training on the thought process, for this to become a reality. That is, we are not focused on the conversation to code training data. I would say 98% of my code is AI generated, and it is certainly not vibe coding. I don't have a term for it, but I am literally dictating to the LLM what I want done and have it fill in the pieces. Sometimes it misses the mark, sometimes it aligns and sometimes it introduces whole new ideas that I have never thought of, which will lead to a better solution. The instructions that I provide is based on my domain knowledge and I think people are missing the mark when they talk about vibe coding, in a professional context.
Full Disclosure: I'm working on improving the "conversation to code" process, so my opinions are obviously biased, but I strongly believe we need to first focus on better capturing our thought process.
> It may not be as stupid as it sounds.
It is.
> However, in any probabilistic system, you can run a function k times and you’ll get an output distribution that is meaningful if k is high enough.
This is the underlying flaw in this approach. Attempting to use probabilistic algorithms to produce a singular verifiably correct result requires an external agent to select what is correct in the output of "k times" invocations. This is a person capable of making said determination.
> The thing with LLMs is they scale in a way that actually allows this to be possible, in a way that scaling with humans can’t.
For the "k times" generation of text part, sure. Not for the determination of which one within k, if any, are acceptable for the problem at hand.
EDIT: clarified "produce a verifiably correct result" to be "produce a singular verifiably correct result"
> like the monkeys and Shakespeare, there probably a limit to the value it can offer; but it’s not totally meaningless to try it.
Whenever someone uses this analogy, a question never addressed is:
> taking some kind of general solution from the distribution
My instinct is that this should be the temperature 0K response (no randomness).
Three is high enough, in my eyes. Two might be. Remember that we don't care about any but the best solution. With one sample you've only got one 50/50 shot to get above the median. With three, the odds of the best of the three being above the median is 87.5%.
Of course picking the median as the random crap boundary is entirely arbitrary, but it'll do until there's a justification for a better number.
Even with human junior devs, ideally you'd maintain some documentation about common mistakes/gotchas so that when you onboard new people to the team they can read that instead of you having to hold their hand manually.
You can do the same thing for LLMs by keeping a file with those details available and included in their context.
You can even set up evaluation loops so that entries can be made by other agents.
correct.
> For me, a team of junior developers that refuse to learn from their mistakes is the fuel of nightmares. I'm stuck in a loop where every day I need to explain to a new hire why they made the exact same
This is a huge opportunity, maybe the next big breakthrough in AI when someone figures out how to solve it
Instead of having a model that knows everything, have a model that can learn on the go from the feedback it gets from the user
Ideally a local model too. So something that runs on my computer that I train with my own feedback so that it gets better at the tasks I need it to perform
You could also have one at team level, a model that learns from the whole team to perform the tasks the team needs it to perform
Continual feedback means continual training. No way around it. So you’d have to scope down the functional unit to a fairly small lora in order to get reasonable re-training costs here.
That's not quite true. The system prompt is state that you can use for "training" in a way that fits the problem here. It's not differentiable so you're in slightly alien territory, but it's also more comprehensible than gradient-descending a bunch of weights.
If you treat it as vectors instead of words, it might be differentiable.
Or maybe figure out a different architecture
Either way, the end user experience would be vastly improved
> This is a huge opportunity, maybe the next big breakthrough in AI when someone figures out how to solve it
I am not saying I solved it, but I believe we are going to experience a paradigm shift in how we program and teach and for some, they are really going to hate it. With AI, we can now easily capture the thought process for how we solve problems, but there is a catch. For this to work, senior developers will need to come to terms that their value is not in writing code, but solving problems.
I would say 98% of my code is now AI generated and I have 0% fear that it will make me dumber. I will 100% become less proficient in writing code, but my problem solving skills will not go away and will only get sharper. In the example below, 100% of the code/documentation was AI generated, but I still needed to guide Gemini 2.5 Pro
https://app.gitsense.com/?chat=c35f87c5-5b61-4cab-873b-a3988...
After reviewing the code, it was clear what the problem was and since I didn't want to waste token and time, I literally suggested the implementation and told it to not generate any code, but asks it to explain the problem and the solution, as shown below.
> The bug is still there. Why does it not use states to to track the start @@ and end @@? If you encounter @@ , you can do an if else on the line by asking if the line ends with @@. If so, you can change the state to expect replacement start delimiter. If it does not end with @@ you can set the state to expect line to end with @@ and not start with @@. Do you understand? Do not generate any code yet.
How I see things evolving over time is, senior developers will start to code less and less and the role for junior developers will not only be to code but to review conversations. As we add new features and fix bugs, we will start to link to conversations that Junior developers can learn from. The Dooms day scenario is obviously, with enough conversations, we may reach the point where AI can solve most problems one shot.
Full Disclosure: This is my tool
> For this to work, senior developers will need to come to terms that their value is not in writing code, but solving problems.
This is the key reason behind authoring https://ghuntley.com/ngmi - developers that come to terms with the new norm will flourish yet the developers who don't will struggle in corporate...
Seems more likely that Orange would just be producing 16x the mediocre C-tier work they were doing before. This scenario assumes that the problem from the lower-tier programmers is their volume of work, but if you aren't able to properly understand the problems or solutions in the first place, LLMs are not going to fix it for you.
Chances are Grape and Apple will eventually adopt LLMs because they need to in order to fix the mistakes Orange is now producing at scale.
Nice write up. I don't necessary think it is "if you adopt it, you will flourish" as much as, if you have this type of "personality you will easily become 10x, if you have this type of personality you will become 2x and if you have this type, you will become .5x".
I'm obviously biased, but I believe developers with a technical entrepreneur mindset, will see the most benefit. This paradigm shift requires the ability to properly articulate your thoughts and be able to create problem statements for every action. And honestly, not everybody can do this.
Obviously, a lot depends on the problems being solved and how well trained the LLM is in that person's domain. I had Claude and a bunch of other models write my GitSense Chat Bridge code which makes it possible to bring Git's history into my chat app and it is slow as hell. It works most of the time, but it was obvious that the design pattern was based on simple CRUD apps. And this is where LLMs will literally slow you down and I know this because I already solved this problem. The LLM generated chat bridge code will be free and open sourced but I will charge for my optimized indexing engine.
Darn I wonder if systems could be modified so that common mistakes become less common or if documentation could be written once and read multiple times by different people.
We feed it conventions that are automatically loaded for every LLM task, do that the LLM adheres to coding style, comment style, common project tooling and architecture etc.
These systems don't do online learning, but that doesn't mean you can spoon feed them what they should know and mutate that knowledge over time.
The current challenge is not to create a patch, but to verify it.
Testing a fix in a big application is a very complex task. First of all, you have to reproduce the issue, to verify steps (or create them, because many issues don't contain clear description). Then you should switch to the fixed version and make sure that the issue doesn't exists. Finally, you should apply little exploratory testing to make sure that the fix doesn't corrupted neighbour logic (deep application knowledge required to perform it).
To perform these steps you have to deploy staging with the original/fixed versions or run everything locally and do pre-setup (create users, entities, etc. to achieve the corrupted state).
This is very challenging area for the current agents. Now they just can't do these steps - their mental models just not ready for a such level of integration into the app and infra. And creation of 3/5/10/100 unverified pull requests just slow down software development process.
All the things you describe are already being done by any team with a modern CI/CD workflow, and none of it requires AI.
At my last job, all of those steps were automated and required exactly zero human input.
Are you sure about "all"? Because I mentioned not only env deployment, but also functional issue reproduction using UI/API, which is also require necessary pre-setup.
Automated tests partially solve the case, but in real world no one writes tests blindly. It's always manual work, and when the failing trajectory is clear - the test is written.
Theoretically agent can interact with UI or API. But it requires deep project understanding, gathered from code, documentation, git history, tickets, slack. And obtaining this context, building an easily accessible knowledge base and puring only necessary parts into the agent context - is still a not solved task.
If your CI/CD process was able to fully verify a fix then it would have stopped the bug from making it to production the first time around and the Jira ticket which was handed to multiple LLMs never would have existed.
There is no fundamental blocker to agents doing all those things. Mostly a matter of constructing the right tools and grounding, which can be fair amount of up-front work. Arming LLMs with the right tools and documentation got us this far. There’s no reason to believe that path is exhausted.
Look at this 18 years old Django ticket: https://code.djangoproject.com/ticket/4140
It was impossible to fix, but it required some experiments and deep research about very specific behaviors.
Or this ticket: https://code.djangoproject.com/ticket/35289
Author proposed one-line solution, but the following discussion includes analysis of RFC, potential negative outcomes, different ways to fix it.
And without deep understanding of the project - it's not clear how to fix it properly, without damage to backward compatibility and neighbor functionality.
Also such a fix must be properly tested manually, because even well designed autotests are not 100% match the actual flow.
You can explore other open and closed issues and corresponding discussions. And this is the complexity level of real software, not pet projects or simple apps.
I guess that existing attention mechanism is the fundamental blocker, because it barely able to process all the context required for a fix.
And feature requests a much, much more complex.
Have you tried building agents? They will go from PhD level smart to making mistakes a middle schooler would find obvious, even on models like gemini-2.5 and o1-pro. It's almost like building a sandcastle where once you get a prompt working you become afraid to make any changes because something else will break.
> Have you tried building agents?
I think the issue right now is so many people want to believe in the moonshot and are investing heavily in it, when the reality is we should be focusing on the home runs. LLMs are a game changer, but there is still A LOT of tooling that can be created to make it easier to integrate humans in the loop.
The fundamental blocker: Context
you can just even tell cursor to use any cli tools you use normally in your development, like git, gh, railway, vercel, node debugging, etc etc
Tools is not the problem. Knowledge is.
> Tools is[sic] not the problem. Knowledge is.
This is the most difficult concept to convey, expressed in a succinct manner rarely found.
Correct! Over at https://ghuntley.com/mcp I propose that each company develops their own tools for their particular codebase that shapes LLM actions on how to work with their codebase.
I’ve been using Cursor and Code regularly for a few months now and the idea of letting three of them run free on the codebase seems insane. The reason for the chat interface is that the agent goes off the rails on a regular basis. At least 25% of the time I have to hit the stop button and go back to a checkpoint because the automatic lawnmower has started driving through the flowerbed again. And paradoxically, the more capable the model gets, the more likely it seems to get random ideas of how to fix things that aren’t broken.
Had a similar experience with Claude Code lately. I got a notice some credits were expiring, so I opened up Claude Code and asked it to fix all the credo errors in an elixir project (style guide enforcement).
I gave it incredibly clear steps of what to run in what process, maybe 6 steps, 4 of which were individual severity levels.
Within a few minutes it would as to commit code, create branches, run tests, start servers — always something new, none of which were in my instructions. It would also often run mix credo, get a list of warnings, deem them unimportant, then try to go do its own thing.
It was really cool, I basically worked through 1000 formatting errors in 2 hours with $40 of credits (that I would have had no use for otherwise).
But man, I can’t imagine letting this thing run a single command without checking the output
So... I know that people frame these sorts of things as if it's some kind of quantization conspiracy, but as someone who started using Claude Code the _moment_ that it came out, it felt particularly strong. Then, it feels like they... tweaked something, whether in CC or Sonnet 3.7 and it went a little downhill. It's still very impressive, but something was lost.
I've found Gemini 2.5 Pro to be extremely impressive and much more able to run in an extended fashion by itself, although I've found very high variability in how well 'agent mode' works between different editors. Cursor has been very very weak in this regard for me, with Windsurf working a little better. Claude Code is excellent, but at the moment does feel let down by the model.
I've been using Aider with Gemini 2.5 Pro and found that it's very much able to 'just go' by itself. I shipped a mode for Aider that lets it do so (sibling comment here) and I've had it do some huge things that run for an hour or more, but assuredly it does get stuck and act stupidly on other tasks as well.
My point, more than anything, is that... I'd try different editors and different (stronger) models and see - and that small tweaks to prompt and tooling are making a big difference to these tools' effectiveness right now. Also, different models seem to excel at different problems, so switching models is often a good choice.
> I've had it do some huge things that run for an hour or more,
Can you clarify this? If I am reading this right, you let the llm think/generate output for an hour? This seems bonkers to me.
Eh I am happy waiting many years before any of that. If it only work right with the right model for the right job, and it’s very fuzzy which models work for which tasks, and the models change all the time (often times silently)… at some point it’s just easier to do the easy task I’m trying to offload then juggle all off this.
If and when I go about trying these tools in the future, I’ll probably looks for and open source TUI, so keep up the great work on aider!
> letting three of them run free on the codebase seems insane
That seems like an unfair characterization of the process they described here.
They only allowed the agents to create pull requests for a specific bug. Both the bug report and the decision of which, if any, PR to accept is done by a human being.
Right, but it seems like that would just generate three PRs I don’t want to review, given the likelihood the agent went into the weeds without someone supervising it.
Over the last two days, I've built out support for autonomy in Aider (a lot like Claude Code) that hybridizes with the rest of the app:
https://github.com/Aider-AI/aider/pull/3781
Edit: In case anyone wants to try it, I uploaded it to PyPI as `navigator-mode`, until (and if!) the PR is accepted. By I, I mean that it uploaded itself. You can see the session where it did that here: https://asciinema.org/a/9JtT7DKIRrtpylhUts0lr3EfY
Edit 2: And as a Show HN, too: https://news.ycombinator.com/item?id=43674180
and, because Aider's already an amazing platform without the autonomy, it's very easy to use the rest of Aider's options, like using `/ask` first, using `/code` or `/architect` for specific tasks [1], but if you start in `/navigator` mode (which I built, here), you can just... ask for a particular task to be done and... wait and it'll often 'just get done'.
It's... decidedly expensive to run an LLM this way right now (Gemini 2.5 Pro is your best bet), but if it's $N today, I don't doubt that it'll be $0.N by next year.
I don't mean to speak in meaningless hype, but I think that a lot of folks who are speaking to LLMs' 'inability' to do things are also spending relatively cautiously on them, when tomorrow's capabilities are often here, just pricey.
I'm definitely still intervening as it goes (as in the Devin demos, say), but I'm also having LLMs relatively autonomously build out large swathes of functionality, the kind that I would put off or avoid without them. I wouldn't call it a programmer-replacement any time soon (it feels far from that), but I'm solo finishing architectures now that I know how to build, but where delegating them to a team of senior devs would've resulted in chaos.
[1]: also for anyone who hasn't tried it and doesn't like TUI, do note that Aider has a web mode and a 'watch mode', where you can use your normal editor and if you leave a comment like '# make this darker ai!', Aider will step in and apply the change. This is even fancier with navigator/autonomy.
> It's... decidedly expensive to run an LLM this way right now
Does it work ok with local models? Something like the quantized deepseeks, gemma3 or llamas?
It does for me, yes -- models seem to be pretty capable of adhering to the tool call format, which is really all that they 'need' in order to do a good job.
I'm still tweaking the prompts (and I've introduced a new, tool-call based edit format as a primary replacement to Aider's usual SEARCH/REPLACE, which is both easier and harder for LLMs to use - but it allows them to better express e.g. 'change the name of this function').
So... if you have any trouble with it, I would adjust the prompts (in `navigator_prompts.py` and `navigator_legacy_prompts.py` for non-tool-based editing). In particular when I adopted more 'terseness and proactively stop' prompting, weaker LLMs started stopping prematurely more often. It's helpful for powerful thinking models (like Sonnet and Gemini 2.5 Pro), but for smaller models I might need to provide an extra set of prompts that let them roam more.
Since you've got the aider hack session going...
One thing I've had in the back of my brain for a few days is the idea of LLM-as-a-judge over a multi-armed bandit, testing out local models. Locally, if you aren't too fussy about how long things take, you can spend all the tokens you want. Running head-to-head comparisons is slow, but with a MAB you're not doing so for every request. Nine times out of ten it's the normal request cycle. You could imagine having new models get mixed in as and when they become available, able to take over if they're genuinely better, entirely behind the scenes. You don't need to manually evaluate them at that point.
I don't know how well that gels with aider's modes; it feels like you want to be able to specify a judge model but then have it control the other models itself. I don't know if that's better within aider itself (so it's got access to the added files to judge a candidate solution against, and can directly see the evaluation) or as an API layer between aider and the vllm/ollama/llama-server/whatever service, with the complication of needing to feed scores out of aider to stoke the MAB.
You could extend the idea to generating and comparing system prompts. That might be worthwhile but it feels more like tinkering at the edges.
Does any of that sound feasible?
It's funny you say this! I was adding a tool just earlier (that I haven't yet pushed) that allows the model to... switch model.
Aider can also have multiple models active at any time (the architect, editor and weak model is the standard set) and use them for different aspects. I could definitely imagine switching one model whilst leaving another active.
So yes, this definitely seems feasible.
Aider had a fairly coherent answer to this question, I think: https://gist.github.com/tekacs/75a0e3604bc10ea88f9df9a909b5d...
This was navigator mode + Gemini 2.5 Pro's attempt at implementing it, based only on pasting in your comment:
https://asciinema.org/a/EKhno9vQlqk9VkYizIxsY8mIr
https://github.com/tekacs/aider/commit/6b8b76375a9b43f9db785...
I think it did a fairly good job! It took just a couple of minutes and it effectively just switches the main model based on recent input, but I don’t doubt that this could become really robust if I had poked or prompted it further with preferences, ideas, beliefs and pushback! I imagine that you could very quickly get it there if you wished.
It's definitely not showing off the most here, because it's almost all direct-coding, very similar to ordinary Aider. :)
Very cool. Even cooler to see it upload itself!!
The trend with LLMs so far has been: if you have an issue with the AI, wait 6 months for a more advanced model. Cobbling together workarounds for their deficiencies is basically a waste of effort.
I've noticed that large models from different vendors often end up converging on more or less the same ideas (probably because they're trained on more or less the same data). A few days ago, I asked both Grok and ChatGPT to produce several stories with an absurd twist, and they consistently generated the same twists, differing only in minor details. Often, they even used identical wording!
Is there any research into this phenomenon? Is code generation any different? Isn't there a chance that several "independent" models might produce the same (say, faulty) result?
Plandex[1] uses a similar “wasteful” approach for file edits (note: I’m the creator). It orchestrates a race between diff-style replacements plus validation, writing the whole file with edits incorporated, and (on the cloud service) a specialized model plus validation.
While it sounds wasteful, the calls are all very cheap since most of the input tokens are cached, and once a valid result is achieved, other in-flight requests are cancelled. It’s working quite well, allowing for quick results on easy edits with fallbacks for more complex changes/large files that don’t feel incredibly slow.
1 - https://github.com/plandex-ai/plandex
This is a very interesting idea and I really should consider Aider in the "scriptable" sense more, I only use interactively.
I might add another step after each PR is created where another agent(s?) review and compare the results (maybe have the other 2 agents review the first agents code?).
Thanks, and having another step for reviewing each other's code is a really cool extension to this, I'll give it a shot :) Whether it works or it doesn't it could be really interesting for a future post!
Wonder if you could have the reviewer characterize any mistakes and feed those back into the coding prompt: “be sure to… be sure not to…”
We're going to have no traditional programming in 2 years? Riiight.
It would also be nice to see a demo where the task was something that I couldn't have done myself in essentially no time. Like, what happens if you say "tasks should support tags, and you should be able to filter/group tasks by tag"?
Gave it a shot real quick, looks like I need to fix something up about automatically running the migrations either in the CI script or locally...
But if you're curious, task was this:
----
Title: Bug: Users should be able to add tags to a task to categorize them
Description: Users should be able to add multiple tags to a task but aren't currently able to.
Given I am a user with multiple tasks When I select one Then I should be able to add one or many tags to it
Given I am a user with multiple tasks each with multiple tags When I view the list of tasks Then I should be able to see the tags associated with each task
----
And then we ended up with:
GPT-4o ($0.05): https://github.com/sublayerapp/buggy_todo_app/pull/51
Claude 3.5 Sonnet ($0.09): https://github.com/sublayerapp/buggy_todo_app/pull/52
Gemini 2.0 Flash ($0.0018): https://github.com/sublayerapp/buggy_todo_app/pull/53
One thing to note that I've found - I know you had the "...and you should be able to filter/group tasks by tag" on the request - usually when you have a request that is "feature A AND feature B" you get better results when you break it down into smaller pieces and apply them one by one. I'm pretty confident that if I spent time to get the migrations running, we'd be able to build that request out story-by-story as long as we break it out into bite-sized pieces.
You can have a larger model split things out into more manageable steps and create new tickets - marked as blocked or not on each other, then have the whole thing run.
Wouldnt AI be perfect for those easy tasks? They still take time if you wanna do it "properly" with a new branch etc. I get lots of "can you change the padding for that component". And that is all. Is it easy? Sure. But still takes time to open the project, create a new branch, make the change, push the change, create a merge request, etc. That probably takes me 10 min.
If I could just let the AI do all of them and just go in and check the merge requests and approve them it would save me time.
In 2 years the entire industry will be in cybersecurity threat prevention. You think C is bad because of memory safety - wait until you see the future where every line was written by AI.
I wouldn't be surprised if someone tries to leverage this with their customer feature request tool.
Imagine having your customers write feature requests for your saas, that immediately triggers code generation and a PR. A virtual environment with that PR is spun up and served to that customer for feedback and refinement. Loop until customer has implemented the feature they would like to see in your product.
Enterprise plan only, obviously.
It's cute but I don't see the benefit. In my experience, if one LLM fails to solve a problem, the other ones won't be too different.
If you picked a problem where LLMs are good, now you have to review 3 PRs instead of just 1. If you picked a problem where they're bad, now you have 3 failures.
I think there are not many cases where throwing more attempts at the problem is useful.
Sincere question: Has anyone figured out how we're going to code review the output of an agent fleet?
Insincere answer that will probably be attempted sincerely nonetheless: throw even more agents at the problem by having them do code review as well. The solution to problems caused by AI is always more AI.
Technically that's known as "LLM-as-judge" and it's all over the literature. The intuition would be that the capability to choose between two candidates doesn't exactly overlap with the ability to generate either one of them from scratch. It's a bit like how (half of) generative adversarial networks work.
s/AI/tech
Most of the people pushing this want to just sell an MVP and get a big exit before everything collapses, so code review is irrelevant.
sincere question: why would you not be able to code review it in the same way you would for humans?
Agents could generate more PRs in a weekend than my team could code review in a month.
Initially we can absolutely just review them like any other PR, but at some point code review will be the bottleneck.
Simple, just ask an(other) AI! But seriously, different models are better/worse at different tasks, so if you can figure out which model is best at evaluating changes, use that for the review.
I suspect this will indeed be part of it, but it won't work with today's AIs on today's codebases.
Models will improve, but also I predict code style and architecture will evolve towards something easier for machine review.
You just don't. Choose randomly and then try to quickly sell the company. /s
I wonder if using thinking models would work better here. They generally have less variance and consider more options, which could achieve the same goal.
I love this! I have a similar automation for moving a feature through ideation/requirements/technical design, but I usually dump the result into Cursor for last mile and to save on inference. Seeing the cost analysis is eye opening.
There’s probably also some upside to running the same model multiple times. I find Sonnet will sometimes fail, I’ll roll back and try again with same prompt but clean context, and it will succeed.
re: cost analysis
There's something cooked about Windsurf/Cursors' go-to-market pricing - there's no way they are turning a profit at $50/month. $50/month gets you a happy meal experience. If you want more power, you gotta ditch snacking at McDonald’s.
In the future, companies should budget $100 USD to $500 USD per day, per dev, on tokens as the new normal for business, which is circa $25k USD (low end) to $50k USD (likely) to $127k USD (highest) per year.
Above from https://ghuntley.com/redlining/
This napkin math is based upon my current spend in bring a self-compiled compiler to life.
I've been lucky enough to have a few conversations with Scott a month or so ago and he is doing some really compelling work around the AISDLC and creating a factory line approach to building software. Seriously folks, I recommend following this guy closely.
There's another guy in this space I know who's doing similar incredible things but he doesn't really speak about it publicly so don't want to discuss w/o his permission. I'm happy to make an introduction for those interested just hmu (check my profile for how).
Really excited to see you on the FP of HN Scott!
love to see "Why It Matters" turn into the heading equivalent to "delve" in body text (although different in that the latter is a legitimate word while the former is a "we need to talk about…"–level turn of phrase)
Feels like a way to live with a bad decision rather than getting rid of it.
I don't really think having an agent fleet is a much better solution than having a single agent.
We would like to think that having 10 agents working on the same task will improve the chances of success 10x.
But I would argue that some classes of problems are hard for LLMs and where one agent will fail, 10 agents or 100 agents will fail too.
As an easy example I suggest leetcode hard problems.
I'm authoring a self-compiling compiler with custom lexical tokens via LLM. I'm almost at stage 2, and approximately 50 "stdlib" concerns have specifications authored for them.
The idea of doing them individually in the IDE is very unappealing. Now that the object system, ast, lexer, parser, and garbage collection have stabilized, the codebase is at a point where fanning out agents makes sense.
As stage 3 nears, it won't make sense to fan out until the fundamentals are ready again/stabilised, but at that point, I'll need to fan out again.
https://x.com/GeoffreyHuntley/status/1911031587028042185
The fleet approach can work well particularly because: 1) different models are trained differently, even though using mostly same data (think someone who studied SWE at MIT, vs one who studied at Harvard), 2) different agents can be given different prompts, which specializes their focus (think coder vs reviewer), and 3) the context window content influences the result (think someone who's seen the history of implementation attempts, vs one seeing a problem for the first time). Put those traits in various combinations and the results will be very different from a single agent.
Nit: it doesn't 10x the chance of success, it (the chance of failure)^10.
neither, probably
We need The Mythical Man-Month: LLM version book.
Makes me think of The Sorcerers Apprentice.
The 10 cents is BS. It was only that because it was a trivial bug. A non-trivial bug requires context and the more context something requires, the more expensive it gets. Also once you are working with larger apps you have to pick the context, especially with LLMs that have smaller windows.
I see 'Waste Inferences' as a form of abductive reasoning.
I see LLMs as a form of inductive reasoning, and so I can see how WI could extend LLMs.
Also, I have no doubt that there are problems that can't be solved with just an LLM but would need abductive extensions.
Same comments apply to deductive (logical) extensions to LLMs.
> Also, I have no doubt that there are problems that can't be solved with just an LLM but would need abductive extensions.
And we're back to expert systems.
[dead]