Quit Emailing Yourself

# models → benchmarking

4 links tagged with all of: models + benchmarking

Click any tag below to further narrow down your results

Links

GitHub - InternScience/SGI-Bench: Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

SGI-Bench is a benchmark designed to assess AI systems' capabilities in scientific inquiry, covering stages like deliberation, conception, action, and perception. It includes over 1,000 expert-curated samples from 10 disciplines, focusing on tasks such as deep research, idea generation, and experimental reasoning.

Saved by tldr-importer · Last saved February 14, 2026 · 4 min read

+ ai benchmarking ✓ + scientific-research + evaluation models ✓

Evaluating AI Agents in Security Operations (December 2025) - Cotool

This article benchmarks GPT-5.1, Claude Opus 4.5, and Gemini 3 Pro for security operations tasks. GPT-5.1 and Opus 4.5 show improved accuracy and speed, while Gemini 3 Pro lags behind. The findings help teams choose the best AI model for automation in SecOps.

Saved by tldr-importer · Last saved February 14, 2026 · 4 min read

+ security + ai benchmarking ✓ + automation models ✓

[no-title]

The article discusses the benchmarking of various open-source models for optical character recognition (OCR), highlighting their performance and capabilities. It provides insights into the strengths and weaknesses of different models, aiming to guide developers in selecting the best tools for their OCR needs.

Saved by tldr-importer · Last saved October 29, 2025 · 1 min read

+ ocr + open-source benchmarking ✓ models ✓ + performance

Epoch Capabilities Index | Epoch AI

The Epoch Capabilities Index (ECI) is a composite metric that integrates scores from 39 AI benchmarks into a unified scale for evaluating and comparing model capabilities over time. Utilizing Item Response Theory, the ECI provides a statistical framework to assess model performance against benchmark difficulty, allowing for consistent scoring of AI models such as Claude 3.5 and GPT-5. Future details on the methodology will be published in an upcoming paper funded by Google DeepMind.

Saved by hn_user_13 · Last saved October 28, 2025 · 3 min read

+ ai benchmarking ✓ models ✓