GPT-5.5 Just Shipped. The Number That Matters Is 84.9%.

What Dropped Today

OpenAI released GPT-5.5 this morning to Plus, Pro, Business, and Enterprise subscribers in ChatGPT. API access is coming shortly, priced at $5 per million input tokens and $30 per million output tokens, which is higher than GPT-5.4. OpenAI's position on the increase is that the model uses significantly fewer tokens to complete the same tasks, so the effective cost per workflow for most users should stay comparable.

The model is natively omnimodal, processing text, images, audio, and video within a single system rather than separate modules bolted on after the fact. It ships with a 1 million token context window.

The headline capability claim from OpenAI is worth quoting directly. Greg Brockman described it as a model that "can look at an unclear problem and figure out what needs to happen next." Prior models required more explicit, step-by-step prompting to stay on track. GPT-5.5 is designed to receive a messy brief and resolve its own path forward without needing to be guided through every step.

The Numbers

The most relevant benchmark for anyone doing agency-adjacent knowledge work is GDPval. It tests an AI agent completing real tasks across 44 occupations, from legal research to finance to product management, and measures how often the output matches or beats what an industry professional produces. GPT-5.5 hits 84.9% on that benchmark. Claude Opus 4.7 sits at 80.3% and Gemini 3.1 Pro at 67.3%.

On OSWorld-Verified, which measures whether the model can operate real computer environments on its own, navigating interfaces and moving between apps without a human in the loop, GPT-5.5 scores 78.7%.

On Terminal-Bench 2.0, which tests complex workflows requiring planning, iteration, and tool coordination, it scores 82.7%. Claude Opus 4.7 is at 69.4% and Gemini 3.1 Pro is at 68.5%.

Two more worth noting for practical work: Tau2-bench Telecom, which tests complex customer service workflows, comes in at 98.0% without any prompt tuning. OfficeQA Pro, covering document and spreadsheet tasks, scores 54.1% versus 43.6% for Claude Opus 4.7.

Where it does not lead: software engineering. Claude Opus 4.7 scores 64.3% on SWE-bench Pro versus GPT-5.5's 58.6%. If your primary use is code generation and debugging, Anthropic's model still has an edge there.