Large Language Models Benchmarks

Omni Calculator Publishes ORCA V3 Research Report on AI Model Performance in Quantitative Reasoning

Omni Calculator announced the publication of the third iteration of its Omni Research on Calculation in AI (ORCA) Benchmark, an independent benchmarking initiative designed to evaluate the ...

Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch

Frontier AI models corrupt 25% of document content in multi-step workflows — rewriting rather than deleting, which makes the ...

1don MSN

ChatGPT passes classic benchmark as AI-human distinction narrows

ChatGPT passes classic Alan Turing benchmark as AI-human distinction narrows - ...

European Medical Journal

Advanced AI Language Model Outperforms Physicians in Reasoning Tasks

Large language model outperformed physicians in diagnostic reasoning tasks, highlighting potential for AI in clinical care.

Google DeepMind Features Hirundo’s Security-Hardened Gemma 4 Model – Outperforms LLMs 170x Its Size on Security

Google DeepMind has featured Hirundo’s security-hardened variant of Gemma 4 in its Gemmaverse – the official showcase for the ...

Geeky Gadgets

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

ZDNet

With AI models clobbering every benchmark, it's time for human evaluation

Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language ...

Bloomberg L.P.

Introducing BloombergGPT, Bloomberg’s 50-billion parameter large language model, purpose-built from scratch for finance

NEW YORK – Bloomberg today released a research paper detailing the development of BloombergGPT TM, a new large-scale generative artificial intelligence (AI) model. This large language model (LLM) has ...

Microsoft’s Phi-3 shows the surprising power of small, locally run AI language models

Microsoft’s 3.8B parameter Phi-3 may rival GPT-3.5, signaling a new era of “small language models.” ...

Results that may be inaccessible to you are currently showing.

Hide inaccessible results