Video Coding Benchmarks

Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks

As large language models (LLMs) continue to improve at coding, the benchmarks used to evaluate their performance are steadily becoming less useful. That's because though many LLMs have similar high ...

InfoQ

CodeClash Benchmarks LLMs through Multi-Round Coding Competitions

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources. Dany Lepage discusses the architectural ...

Hosted on MSN

AI tools expand from coding benchmarks to classroom transparency

On April 27, multiple AI developments showcased how the technology is advancing in both professional and educational contexts. Open benchmarks revealed ChatGPT 5.5’s strengths in short, well-defined ...

InfoWorld

Why benchmarks are key to AI progress

Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high. Benchmarks are often reduced to leaderboard ...

Bleeping Computer

Grok 4 benchmark results: Tops math, ranks second in coding

Grok 4 is a huge leap from Grok 3, but how good is it compared to other models in the market, such as Gemini 2.5 Pro? We now have answers, thanks to new independent benchmarks. LMArena.ai, which is an ...

Yahoo Finance

Endor Labs Launches Agentic Code Security Benchmark, Finds Top-Performing AI Coding Agents Pass Tests But Still Fail Security

The benchmark extends the Carnegie Mellon SusVibes framework to continuously evaluate leading AI coding agents, updates as new agents and models are released PALO ALTO, Calif., April 15, 2026 ...

InfoQ

Show inaccessible results

Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks

CodeClash Benchmarks LLMs through Multi-Round Coding Competitions

AI tools expand from coding benchmarks to classroom transparency

Why benchmarks are key to AI progress

Grok 4 benchmark results: Tops math, ranks second in coding

Endor Labs Launches Agentic Code Security Benchmark, Finds Top-Performing AI Coding Agents Pass Tests But Still Fail Security

Dynamic Languages Faster and Cheaper in 13-Language Claude Code Benchmark

Study finds newer LLMs introduce more severe coding bugs despite higher benchmark scores

ChatGPT 4.1 early benchmarks compared against Google Gemini