Humaneval - Search News

Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks

As large language models (LLMs) continue to improve at coding, the benchmarks used to evaluate their performance are steadily becoming less useful. That's because though many LLMs have similar high ...

GIGAZINE

Microsoft announces ``phi-1'' that hits HumanEval 50.6% exceeding GPT-3.5 with only 1.3 billion parameters

Below is a comparison of the phi-1's performance with other models. phi-1 showed high accuracy of 50.6% in HumanEval, a dataset for evaluating programming ability, and 55.5% in MBPP. This result is ...

NextBigFuture

Qwen 2.5 Coder and Qwen 3 Lead in Open Source LLM Over DeepSeek and Meta

Qwen 2.5 Coder/Max is currently the top open-source model for coding, with the highest HumanEval (~70–72%), LiveCodeBench (70.7), and Elo (2056) scores among open models. DeepSeek V3/Coder V2 remains ...

InfoQ

Mistral Introduces AI Code Generation Model Codestral

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources. Dany Lepage discusses the architectural ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results