Hacker Newsnew | past | comments | ask | show | jobs | submit | afro88's commentslogin

Similar result on our kotlin coding benchmark at work. It measures how close agents can get to a small mergable PR (according to my team). 20 tasks of varying difficulty, with 5 attempts each, LLM as judge to evaluate accuracy (same outcome and quality but allowing for acceptable variances).

Fable 5 sits ahead of Opus 4.7, but behind Opus 4.6, Sonnet 4.6, Opus 4.8, GPT-5.4, GPT-5.5.

Fable isn't a good coding workhorse. That doesn't mean it's not good for actually complex problems and long horizon tasks (big POCs, complex research and such). But I only have vibes and Anthropics own benchmarks and marketing to guide me there.


I'm starting a repository of LLM reviews [1] with the goal of creating a catalog that is more task-oriented and less marketing-y than corporate blogs or benchmark leaderboards. You seem to have a lot of experience across a bunch of different models: if you have a chance and feel like sharing, you'd be one of the first.

[1] - https://model.reviews/ - all the user-submitted content is CC licensed and will be available for download in periodic dumps.


Does your team then manually decide the results by going over the PRs? I suppose you know what you're looking for now, but isn't this still quite painful?

We selected PRs (real ones we merged over the 6 months prior) and have an "LLM as judge" score how close the AI generated code is to the PR. Same as how other benchmarks do it, but it's with tasks we actually do and code we have decided is actually up to scratch for us

I'd love to read about the predictions that have been wrong (genuinely)


He has said every month for the past three years that there is no technical progress left to be made in LLMs and that there is no more room in the market for inference spending to grow.

Here[0] is a fun selection of excerpts from his July 2024 post "How Does OpenAI Survive?"[1]

"I see no signs that the transformer-based architecture can do significantly more than it currently does."

[0]: https://xcancel.com/pathsnotchosen/status/206360940100129633...

[1]: https://www.wheresyoured.at/to-serve-altman/


He’s started a cult around anti AI basically.

That statement isn't proof of anything.

If I start a climate crisis cult, doesn't mean my predictions are wrong


I don’t have much time to compile his predictions. I’ll let someone else do that.

That said, I just wanted to point out the grift he is doing. He’s making money by telling some people what they want to hear.

If AI invents a cure for cancer, he will tell you it’s still useless because it didn’t invent immortality.


I wonder if there's a way to include data that's so unique you can prove it was trained on and sue later

Unique data like that is unlikely to have any impact on the learned/final weights after training. SGD, Adam and the other hillclimbing solvers abhor jagged edges from "novel" trade secrets and the like. Unless it turns out everyone had the same secret genius idea (and it became a pattern to learn).

> The dynamic of agent codes human reviews does seem like the only sane one for the foreseeable future. Even Anthropic themselves still fall back to this.

Do they? I saw some crazy stat from the guy who built claude code that he was pushing hundreds of PRs a day. There's no way you can human review that much code. It's probably closer to heavily AI assisted review and planning.


This is a branching point. One dev would find someone else and convince them to approve it. Another would redo the task (code is cheap now, right?) in a PR stack that can actually be reviewed, cleaned up etc.

I hope they were the latter.


That's an example of why it would be useful for someone to actually do it. A random commenter on HN is one thing. A direct comparison on a brand new app that isn't part of any training is another

I’m highly confident that prior exposure is irrelevant at this point. I work on vulnerability detection at a hyperscaler.

That's an example of why it would be useful for someone to actually do it. A random commenter on HN is one thing. A direct comparison on a brand new app that isn't part of any training is another

It's very addictive when you're working on something cool and the agents are iterating nicely. Instead of browsing reddit / HN / instagram etc during downtime, I find it much more fun to build something.

> When we first started experimenting with AI code review, we took the path that most other people probably take: we tried out a few different AI code review tools and found that a lot of these tools worked pretty well, and a lot of them even offered a good amount of customisation and configurability! Unfortunately, though, the one recurring theme that kept coming up was that they just didn’t offer enough flexibility and customisation for an organisation the size of Cloudflare.

Most people I know had the experience that signal to noise was way off, regardless of scale. So it was a burden rather than a help. Code review by AI ended up being a skill before creating the PR so the dev owning the PR addressed everything before the team got bogged down with it in review


It's the later. You can view it and see fine grained progress, but you can't interact with it. I hope that's coming next, because it would be useful to steer later phases or even agents


I tried this out yesterday - lucky enough to have access through EAP at work. The workflows that are generated are quite good - smart parallelisation and phasing. End results for larger chunks of work are also much better, which I attribute to more of the work having clean context windows (Opus 4.7 is unusable past 200k conversation length, and each subagent ends up using less than that IME). They also seem to have a validation phase hint in the workflow generator which also helps a lot. Speed is a bonus.

You can achieve a similar result manually prompting to use subagents, yes. But the TUI for in flight dynamic workflows is really nice - great visibility into exactly what's happening.

Honesty, for anything larger than a 1 shot PR, it's worth firing off a workflow for better automatic context management alone (more work done in the first 20% sweet spot)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: