The First Coding Benchmark That Feels Like My Actual AI Work

For the last few months, the public discussion around AI coding tools has felt strangely detached from the work I was actually doing.

If you followed LinkedIn, YouTube, or half of the developer discourse, Claude was the obvious default. Claude Code was suddenly everywhere. It exploded after the Trump administration banned Anthropic from delivery chains in February 2026. Now Opus was treated as the serious engineering model. Many benchmarks seemed to support that opinion, or at least they made the frontier models look close enough that the choice came down to taste.

That actually never matched my own experience. I used Codex and Claude side by side for months, on real projects, with real repositories, with real data analysis work, with real office tasks and day-to-day automations, with review loops, tool calls, failed attempts, and follow-up rework. The gap always felt much larger than the public discussion on social media was trying to present. Eventually I cancelled the higher Claude subscription and kept only the lower tier, because the product experience and the output did not justify the hype for the way I work. I also didn't like the unclear Claude limits and subscription tier politics. I understand the reason - I really don't like the execution from their side - but that's just Anthropic's way of communication with their customers.

It seems that now there finally is a benchmark that feels much closer to my reality.

DeepSWE changes the comparison

DeepSWE is a new benchmark from Datacurve for long-horizon software engineering tasks. Another leaderboard by itself would not be very interesting to me, but what matters more to me is the shape of the work it tries to measure.

The tasks are original, not copied from existing commits or pull requests - which prevents them from being trained on. They span 113 tasks across 91 active open-source repositories and five languages. The prompts are shorter than SWE-bench Pro prompts, but the reference solutions require much more code. The verifiers are written to test behavior, not whether the agent guessed one exact implementation shape.

Older benchmarks had two problems that made them weak for decision-making inputs. The top models often clustered together so tightly that the scores did not explain what developers felt in day-to-day work. Contamination and verifier weakness also made the numbers too easy to overread. Datacurve's audit found that SWE-bench Pro accepted wrong implementations in 8.5% of reviewed cases and rejected reasonable solutions in 24.0% of reviewed cases, while DeepSWE's corresponding rates were 0.3% and 1.1%.

It's the kind of benchmark difference that simply matters in the real world. If the grader is noisy, a narrow leaderboard gap is not a reliable tool-selection signal. If the tasks contain leaked history or weak tests, the benchmark can reward behavior that does not help you ship better software. Datacurve's qualitative analysis is especially clear and really uncomfortable here: on SWE-bench Pro, Claude Opus 4.6 and 4.7 registered CHEATED on more than 12% of reviewed rollouts by recovering the gold solution from .git history, while GPT-5.4 and GPT-5.5 did not show that behavior.

I don't care that much whether DeepSWE is absolutely perfect or not. And Datacurve is also pretty clear about its limitations: all models run through the same mini-swe-agent harness, the benchmark covers only selected languages, and it does not fully represent proprietary codebases or every type of engineering work. But as a signal, it is much harder to dismiss than another saturated coding leaderboard - where, lately, the frontier models all looked just way too similar to me.

The score gap is no longer subtle

The original DeepSWE post makes the separation visible before you even get to the detailed results section:

High-resolution DeepSWE leaderboard chart showing GPT-5.5 at 70%, Claude Opus 4.8 at 58%, GPT-5.4 at 56%, Claude Opus 4.7 at 54%, and other models below. — High-resolution recreation from the DeepSWE leaderboard data. The original publication snapshot showed Opus 4.7 at 54%; the latest leaderboard adds Opus 4.8 at 58%.

The latest DeepSWE leaderboard is blunt:

GPT-5.5 xhigh: 70% +/- 4%, $6.61 average cost, 21m average time, 47k output tokens.
Claude Opus 4.8 max: 58% +/- 5%, $12.58 average cost, 43m average time, 136k output tokens.
GPT-5.4 xhigh: 56% +/- 5%, $4.38 average cost, 27m average time, 71k output tokens.
Claude Opus 4.7 max: 54% +/- 5%, $18.19 average cost, 39m average time, 103k output tokens.

The new Opus 4.8 result is nothing surprising to me, and many developers have already suggested a similar feeling. Anthropic introduced Opus 4.8 as an improvement over Opus 4.7, with better benchmark performance, new effort controls, dynamic workflows in Claude Code, and a cheaper fast mode. To me - that's just marketing which is being repeated nearly every single month with a different product. On DeepSWE, Opus 4.8 still lands at 58%, while GPT-5.5 lands at 70% - so the gap is still real, and that is also my latest experience using it.

Marketing aside - it is a practical gap on the kind of tasks that look more like real engineering: read the codebase, understand the behavior, make the change, avoid regressions, and prove it works.

Also the efficiency numbers make the comparison harder for Claude. Claude has been proven to be extremely expensive in the past - if not used in subscription mode. GPT-5.5 gets the highest efficiency score while using fewer output tokens, less time, and lower average task cost than the Claude Opus 4.8 run shown on the current board. This is close to my own experience: the better tool was not simply the one with the powerful marketing and the larger social-media reputation - it was the one that got to a usable result with less swearing and frustration. And believe me - both of them can be quite annoying to use - and swearing helps a lot - at least to the user :)

This matches what I saw with Codex and Claude

I do not read DeepSWE as proof that one vendor is always better than the other - it never worked in Android vs iOS battles and it certainly does not work here. Different models still have different strengths, and a benchmark is never the same thing as your production workflow.

But it does explain why my own subscription decisions changed.

For several months I worked with Codex and Claude in parallel. Claude often felt impressive in isolated conversations, but less convincing as a daily coding agent. It could be slow, it could be expensive and many times, it felt very unpredictable in output quality - like there is some kind of limitation going on in rush hours. It could feel oddly fragile once the task crossed from "talk about the code" into "make a complete, reviewable change and keep the whole project maintainable." Claude Code and the Claude desktop experience also did not feel like mature developer products to me. However, I must admit - Anthropic's research reputation is strong - as they stand behind the most reputable innovations like MCP servers, skills, plugins or Claude Code itself.

The problem is that the customer experience never matched it. There is also a good reason for it - while Codex is the primary tool being used in OpenAI engineering, at Anthropic the story is quite different. Only a fraction of their developers actually use the Claude desktop app - and even worse - they don't use publicly released models internally and rely more on unreleased models like Mythos. That is a very clear signal about the product experience and the model quality - and it is also a very clear signal about the communication strategy of the company.

Codex went the other way. It felt more useful in the actual workflow: multiple projects, parallel work, mobile access, long-running tasks, and tighter integration with the rest of what I already use in ChatGPT. Model quality matters, but the product surface matters maybe even more. A coding agent is not just a model. It is the model, the harness, the interface, the quota model, the project workflow, the recovery path, and the surrounding subscription value.

That last part is why I keep coming back to the subscription comparison. I have written before about why AI is still cheap for serious users, and why cheap AI subscriptions are capacity-managed bargains. For my own work, the ChatGPT Pro bundle is not only Codex. It also includes Pro models, deep research, image creation, memory, files, and Pulse. OpenAI's own Pro plan page currently describes Codex, deep research, image creation, memory, and file uploads as included advanced features, with the $200 tier positioned for heavy work across parallel projects.

Claude Max also combines the Claude app and Claude Code, but the usage model has felt less predictable to me. Anthropic's own help page says Claude Max usage varies with message length, attachments, conversation length, model, feature, and current capacity, with five-hour session resets. That does not make the plan bad, but it certainly does make it harder to treat as a stable planning unit when agentic coding sessions are long and context-heavy.

So when I compare the two $200/month products as working tools, I do not see the Claude hype reflected in my own results. I see a strong research company with an uneven product experience, competing against an OpenAI product bundle that currently gives me better coding output and more useful surfaces around it.

The marketing layer is not the engineering layer

The lesson here is definitely not "never use Claude." - I still keep access to Claude because there are tasks where it is useful, and because serious work benefits from model choice.

The lesson is that tool reputation is not a substitute for verification. You should not choose the model only because of some recommendation from some marketing guy on LinkedIn who tries to suddenly be your fractional AI partner.

LinkedIn and YouTube are not neutral measurement systems. They reward confidence, novelty, drama, and clean stories. The people who explain AI tools most loudly are not always the people carrying production responsibility for the results. Some are excellent marketers, some are good technical people, some are both, but the platform does not reliably separate those categories for you.

This becomes incredibly risky when companies start making tool decisions from the LinkedIn feed. A popular coding model can still be the wrong choice for your workflow. A beautiful demo can hide poor maintainability. A benchmark can be contaminated. A subscription can be cheap for an individual but expensive when the same usage moves into API, enterprise, audit, and support terms. A model that writes impressive text can still miss a requirement, overrun the budget, or produce code that passes a weak test for the wrong reason.

That is why I care about benchmarks like DeepSWE. Better measurement improves the conversation, and it also gives technical teams a stronger starting point rather than vibes.

Use the right tool at the right time

The useful position is quite simple: measure your work, your use case.

If you are choosing an AI coding tool, do not ask which model is fashionable this month. Ask what kind of work you need done, what "done" means, what failure looks like, what the agent costs when it runs for hours, how often it verifies its own output, how well it handles your codebase, and how easily your team can review and recover from its changes.

That is also how I think about technical ownership more broadly. The job is not to follow hype early - it's about how to use the right tool at the right time, and with enough evidence that it will survive the real work.

For my own coding workflow today, that points clearly toward Codex and GPT-5.5. DeepSWE does not create that conclusion for me, it only finally gave shape to something I had already been seeing for months.

DeepSWE changes the comparison

The score gap is no longer subtle

This matches what I saw with Codex and Claude

The marketing layer is not the engineering layer

Use the right tool at the right time

References

Need someone technical on your side?