Quit Emailing Yourself

# benchmarking → models → ai

3 links tagged with all of: benchmarking + models + ai

Click any tag below to further narrow down your results

Links

GitHub - InternScience/SGI-Bench: Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

SGI-Bench is a benchmark designed to assess AI systems' capabilities in scientific inquiry, covering stages like deliberation, conception, action, and perception. It includes over 1,000 expert-curated samples from 10 disciplines, focusing on tasks such as deep research, idea generation, and experimental reasoning.

Saved by tldr-importer · Last saved February 14, 2026 · 4 min read

ai ✓ benchmarking ✓ + scientific-research + evaluation models ✓

Evaluating AI Agents in Security Operations (December 2025) - Cotool

This article benchmarks GPT-5.1, Claude Opus 4.5, and Gemini 3 Pro for security operations tasks. GPT-5.1 and Opus 4.5 show improved accuracy and speed, while Gemini 3 Pro lags behind. The findings help teams choose the best AI model for automation in SecOps.

Saved by tldr-importer · Last saved February 14, 2026 · 4 min read

+ security ai ✓ benchmarking ✓ + automation models ✓

Epoch Capabilities Index | Epoch AI

The Epoch Capabilities Index (ECI) is a composite metric that integrates scores from 39 AI benchmarks into a unified scale for evaluating and comparing model capabilities over time. Utilizing Item Response Theory, the ECI provides a statistical framework to assess model performance against benchmark difficulty, allowing for consistent scoring of AI models such as Claude 3.5 and GPT-5. Future details on the methodology will be published in an upcoming paper funded by Google DeepMind.

Saved by hn_user_13 · Last saved October 28, 2025 · 3 min read

ai ✓ benchmarking ✓ models ✓