More

afro88 · 2026-06-11T19:48:18 1781207298

Similar result on our kotlin coding benchmark at work. It measures how close agents can get to a small mergable PR (according to my team). 20 tasks of varying difficulty, with 5 attempts each, LLM as judge to evaluate accuracy (same outcome and quality but allowing for acceptable variances).

Fable 5 sits ahead of Opus 4.7, but behind Opus 4.6, Sonnet 4.6, Opus 4.8, GPT-5.4, GPT-5.5.

Fable isn't a good coding workhorse. That doesn't mean it's not good for actually complex problems and long horizon tasks (big POCs, complex research and such). But I only have vibes and Anthropics own benchmarks and marketing to guide me there.

m-dot-reviews · 2026-06-12T00:51:10 1781225470

I'm starting a repository of LLM reviews [1] with the goal of creating a catalog that is more task-oriented and less marketing-y than corporate blogs or benchmark leaderboards. You seem to have a lot of experience across a bunch of different models: if you have a chance and feel like sharing, you'd be one of the first.

[1] - https://model.reviews/ - all the user-submitted content is CC licensed and will be available for download in periodic dumps.

munksbeer · 2026-06-12T14:57:30 1781276250

Does your team then manually decide the results by going over the PRs? I suppose you know what you're looking for now, but isn't this still quite painful?

afro88 · 2026-06-13T02:22:40 1781317360

We selected PRs (real ones we merged over the 6 months prior) and have an "LLM as judge" score how close the AI generated code is to the PR. Same as how other benchmarks do it, but it's with tasks we actually do and code we have decided is actually up to scratch for us

afro88 · 2026-06-09T09:18:20 1780996700

I'd love to read about the predictions that have been wrong (genuinely)

JimDabell · 2026-06-09T11:07:56 1781003276

> AI's biggest critic has lost the plot

— https://www.theargumentmag.com/p/ais-biggest-critic-has-lost...

dcre · 2026-06-09T18:01:17 1781028077

He has said every month for the past three years that there is no technical progress left to be made in LLMs and that there is no more room in the market for inference spending to grow.

Here[0] is a fun selection of excerpts from his July 2024 post "How Does OpenAI Survive?"[1]

"I see no signs that the transformer-based architecture can do significantly more than it currently does."

[0]: https://xcancel.com/pathsnotchosen/status/206360940100129633...

[1]: https://www.wheresyoured.at/to-serve-altman/

aurareturn · 2026-06-09T09:28:25 1780997305

He’s started a cult around anti AI basically.

Peaches4Rent · 2026-06-09T10:12:55 1780999975

That statement isn't proof of anything.

If I start a climate crisis cult, doesn't mean my predictions are wrong

aurareturn · 2026-06-09T13:53:06 1781013186

I don’t have much time to compile his predictions. I’ll let someone else do that.

That said, I just wanted to point out the grift he is doing. He’s making money by telling some people what they want to hear.

If AI invents a cure for cancer, he will tell you it’s still useless because it didn’t invent immortality.

afro88 · 2026-06-07T22:56:30 1780872990

I wonder if there's a way to include data that's so unique you can prove it was trained on and sue later

Chu4eeno · 2026-06-08T03:32:09 1780889529

Unique data like that is unlikely to have any impact on the learned/final weights after training. SGD, Adam and the other hillclimbing solvers abhor jagged edges from "novel" trade secrets and the like. Unless it turns out everyone had the same secret genius idea (and it became a pattern to learn).

afro88 · 2026-06-07T22:51:42 1780872702

> The dynamic of agent codes human reviews does seem like the only sane one for the foreseeable future. Even Anthropic themselves still fall back to this.

Do they? I saw some crazy stat from the guy who built claude code that he was pushing hundreds of PRs a day. There's no way you can human review that much code. It's probably closer to heavily AI assisted review and planning.

afro88 · 2026-06-05T04:39:02 1780634342

This is a branching point. One dev would find someone else and convince them to approve it. Another would redo the task (code is cheap now, right?) in a PR stack that can actually be reviewed, cleaned up etc.

I hope they were the latter.

afro88 · 2026-06-04T02:18:01 1780539481

That's an example of why it would be useful for someone to actually do it. A random commenter on HN is one thing. A direct comparison on a brand new app that isn't part of any training is another

CaveTech · 2026-06-04T02:52:25 1780541545

I’m highly confident that prior exposure is irrelevant at this point. I work on vulnerability detection at a hyperscaler.

HDBaseT · 2026-06-04T03:45:31 1780544731

That's an example of why it would be useful for someone to actually do it. A random commenter on HN is one thing. A direct comparison on a brand new app that isn't part of any training is another

afro88 · 2026-06-03T06:57:48 1780469868

It's very addictive when you're working on something cool and the agents are iterating nicely. Instead of browsing reddit / HN / instagram etc during downtime, I find it much more fun to build something.

afro88 · 2026-05-29T20:12:10 1780085530

> When we first started experimenting with AI code review, we took the path that most other people probably take: we tried out a few different AI code review tools and found that a lot of these tools worked pretty well, and a lot of them even offered a good amount of customisation and configurability! Unfortunately, though, the one recurring theme that kept coming up was that they just didn’t offer enough flexibility and customisation for an organisation the size of Cloudflare.

Most people I know had the experience that signal to noise was way off, regardless of scale. So it was a burden rather than a help. Code review by AI ended up being a skill before creating the PR so the dev owning the PR addressed everything before the team got bogged down with it in review

afro88 · 2026-05-28T18:14:58 1779992098

It's the later. You can view it and see fine grained progress, but you can't interact with it. I hope that's coming next, because it would be useful to steer later phases or even agents

afro88 · 2026-05-28T18:11:45 1779991905

I tried this out yesterday - lucky enough to have access through EAP at work. The workflows that are generated are quite good - smart parallelisation and phasing. End results for larger chunks of work are also much better, which I attribute to more of the work having clean context windows (Opus 4.7 is unusable past 200k conversation length, and each subagent ends up using less than that IME). They also seem to have a validation phase hint in the workflow generator which also helps a lot. Speed is a bonus.

You can achieve a similar result manually prompting to use subagents, yes. But the TUI for in flight dynamic workflows is really nice - great visibility into exactly what's happening.

Honesty, for anything larger than a 1 shot PR, it's worth firing off a workflow for better automatic context management alone (more work done in the first 20% sweet spot)