Introducing GPT-5.4 | OpenAI – automatictester.novatopic.com

In the present day, we’re releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Considering), the API, and Codex. It’s our most succesful and environment friendly frontier mannequin for skilled work. We’re additionally releasing GPT‑5.4 Professional in ChatGPT and the API, for individuals who need most efficiency on complicated duties.

GPT‑5.4 brings collectively one of the best of our latest advances in reasoning, coding, and agentic workflows right into a single frontier mannequin. It incorporates the industry-leading coding capabilities of GPT‑5.3‑Codex⁠ whereas enhancing how the mannequin works throughout instruments, software program environments, {and professional} duties involving spreadsheets, displays, and paperwork. The result’s a mannequin that will get complicated actual work executed precisely, successfully, and effectively—delivering what you requested for with much less forwards and backwards.

In ChatGPT, GPT‑5.4 Considering can now present an upfront plan of its considering, so you possibly can modify course mid-response whereas it’s working, and arrive at a ultimate output that’s extra intently aligned with what you want with out extra turns. GPT‑5.4 Considering additionally improves deep internet analysis, significantly for extremely particular queries, whereas higher sustaining context for questions that require longer considering. Collectively, these enhancements imply higher-quality solutions that arrive sooner and keep related to the duty at hand.

In Codex and the API, GPT‑5.4 is the primary general-purpose mannequin we’ve launched with native, state-of-the-art computer-use capabilities, enabling brokers to function computer systems and perform complicated workflows throughout functions. It helps as much as 1M tokens of context, permitting brokers to plan, execute, and confirm duties throughout lengthy horizons. GPT‑5.4 additionally improves how fashions work throughout giant ecosystems of instruments and connectors with device search, serving to brokers discover and use the proper instruments extra effectively with out sacrificing intelligence. Lastly, GPT‑5.4 is our most token environment friendly reasoning mannequin but, utilizing considerably fewer tokens to unravel issues when in comparison with GPT‑5.2—translating to decreased token utilization and sooner speeds.

Along with advances generally reasoning, coding, {and professional} information work, GPT‑5.4 permits extra dependable brokers, sooner developer workflows, and higher-quality outputs throughout ChatGPT, the API, and Codex.

*Beforehand reported as 64.7%. GPT‑5.3‑Codex achieves 74.0% with a newly launched API parameter that preserves the unique picture decision.

Constructing on GPT‑5.2’s basic reasoning capabilities, GPT‑5.4 delivers much more constant and polished outcomes on real-world duties that matter to professionals.

On GDPval⁠, which assessments brokers’ talents to supply well-specified information work throughout 44 occupations, GPT‑5.4 achieves a brand new cutting-edge, matching or exceeding {industry} professionals in 83.0% of comparisons, in comparison with 70.9% for GPT‑5.2.

In GDPval, fashions try well-specified information work spanning 44 occupations from the highest 9 industries contributing to U.S. GDP. Duties request actual work merchandise, corresponding to gross sales displays, accounting spreadsheets, pressing care schedules, manufacturing diagrams, or quick movies. Reasoning effort was set to xhigh for GPT‑5.4 and heavy for GPT‑5.2 (a barely decrease stage in ChatGPT).

“GPT-5.4 is one of the best mannequin we’ve ever tried. It’s now high of the leaderboard on our APEX-Brokers benchmark, which measures mannequin efficiency for skilled companies work. It excels at creating long-horizon deliverables corresponding to slide decks, monetary fashions, and authorized evaluation, delivering high efficiency whereas operating sooner and at a decrease price than aggressive frontier fashions.”

— Brendan Foody, CEO at Mercor

We put a selected concentrate on enhancing GPT‑5.4’s means to create and edit spreadsheets, displays, and paperwork. On an inner benchmark of spreadsheet modeling duties {that a} junior funding banking analyst would possibly do, GPT‑5.4 achieves a imply rating of 87.3%, in comparison with 68.4% for GPT‑5.2. On a set of presentation analysis prompts, human raters most popular displays from GPT‑5.4 68.0% of the time over these from GPT‑5.2 on account of stronger aesthetics, higher visible selection, and simpler use of picture era.

Side by side example of spreadsheet outputs from GPT-5.2 vs GPT-5.4

Paperwork have been generated with reasoning effort set to xhigh

To make GPT‑5.4 higher at real-world work, we continued our progress at driving down hallucinations and errors. GPT‑5.4 is our most factual mannequin but: on a set of de-identified prompts the place customers flagged factual errors, GPT‑5.4’s particular person claims are 33% much less prone to be false and its full responses are 18% much less prone to comprise any errors, relative to GPT‑5.2.

“GPT-5.4 units a brand new bar for document-heavy authorized work. On our BigLaw Bench eval, it scored 91%. In comparison with different fashions, GPT-5.4 is at present higher at structuring complicated transactional evaluation, sustaining accuracy throughout prolonged contracts, and delivering the excessive stage of element authorized practitioners require.”

— Niko Grupen, Head of Utilized Analysis at Harvey

GPT‑5.4 is our first general-purpose mannequin with native computer-use capabilities and marks a significant step ahead for builders and brokers alike. It’s one of the best mannequin at present accessible for builders constructing brokers that full actual duties throughout web sites and software program methods.

We’ve designed GPT‑5.4 to be performant throughout a variety of computer-use workloads. It’s glorious at writing code to function computer systems through libraries like Playwright, in addition to issuing mouse and keyboard instructions in response to screenshots. Its habits is steerable through developer messages, that means that builders can modify habits to swimsuit specific use instances. Builders may even configure the mannequin’s security habits to swimsuit totally different ranges of threat tolerance by specifying customized affirmation insurance policies.

The mannequin’s efficiency and suppleness are mirrored throughout benchmarks that take a look at laptop use throughout totally different settings. On OSWorld-Verified, which measures a mannequin’s means to navigate a desktop surroundings via screenshots and keyboard/mouse actions, GPT‑5.4 achieves a state-of-the-art 75.0% success fee, far exceeding GPT‑5.2’s 47.3%, and surpassing human efficiency at 72.4%.¹

On WebArena-Verified, which assessments browser use, GPT‑5.4 achieves a number one 67.3% success fee when utilizing each DOM- and screenshot-driven interplay, in comparison with GPT‑5.2’s 65.4%. On On-line-Mind2Web, which additionally assessments browser use, GPT‑5.4 achieves a 92.8% success fee utilizing screenshot-based observations alone, enhancing over ChatGPT Atlas’s Agent Mode, which achieves successful fee of 70.9%.

A device yield is when an assistant yields to await device responses. If 3 instruments are known as in parallel, adopted by 3 extra instruments known as in parallel, the variety of yields can be 2. Instrument yields are a greater proxy of latency than device calls as a result of they mirror the advantages of parallelization.

GPT‑5.4 interprets screenshots of a browser interface and interacts with UI components via coordinate-based clicking to ship emails and schedule a calendar occasion. Video just isn’t sped up.

GPT‑5.4’s improved laptop use is constructed on the mannequin’s improved basic visible notion capabilities. On MMMU-Professional, a take a look at of a mannequin’s visible understanding and reasoning, GPT‑5.4 achieves an 81.2% success fee with out device use, an enchancment over GPT‑5.2’s 79.5%. Improved visible notion additionally interprets into higher doc parsing capabilities. On OmniDocBench, GPT‑5.4 with out reasoning effort achieves a mean error (measured by normalized edit distance between mannequin prediction and floor reality) of 0.109, improved from GPT‑5.2’s 0.140.

MMMUPro was run with reasoning effort set to xhigh. OmniDocBench was run with reasoning effort set to none, to mirror low-cost, low-latency efficiency.

We’re additionally enhancing visible understanding for dense, high-resolution photographs the place full constancy issues. Beginning with GPT‑5.4, we’re introducing an unique picture enter element⁠(opens in a brand new window) stage which helps full-fidelity notion as much as 10.24M complete pixels or 6000-pixel most dimension, whichever is decrease; the excessive picture enter element stage now helps as much as 2.56M complete pixels or a 2048-pixel most dimension. In early testing with API customers, we noticed sturdy positive aspects in localization means, picture understanding, and click on accuracy when utilizing unique or excessive element.

GPT‑5.4 combines the coding strengths of GPT‑5.3‑Codex with main information work and computer-use capabilities, which matter most on longer-running duties the place the mannequin can use instruments, iterate, and push work additional with much less handbook intervention. It matches or outperforms GPT‑5.3‑Codex on SWE-Bench Professional whereas being decrease latency throughout reasoning efforts.

We estimate latency by trying on the manufacturing habits of our fashions, and simulating this offline. The latency estimate accounts for device name period (code execution time), sampled tokens, and enter tokens. Actual-world latency might fluctuate considerably, and is dependent upon many elements not captured in our simulation. Reasoning efforts have been swept from none to xhigh.

When toggled on, /quick mode in Codex delivers as much as 1.5x sooner token velocity with GPT‑5.4. It’s the identical mannequin and the identical intelligence, simply sooner. Which means customers can transfer via coding duties, iteration, and debugging whereas staying in stream. Builders can entry GPT‑5.4 on the identical quick speeds through the API through the use of precedence processing⁠(opens in a brand new window).

In analysis and inner testing we discovered that GPT‑5.4 excels at complicated frontend duties, with noticeably extra aesthetic and extra practical outcomes than any fashions we’ve launched beforehand.

As an illustration of the mannequin’s improved computer-use and coding capabilities working in tandem, we’re additionally releasing an experimental Codex ability known as “Playwright (Interactive)⁠(opens in a brand new window)”. This enables Codex to visually debug internet and Electron apps; it could actually even be used to check an app it’s constructing, because it’s constructing it.

Theme park simulation recreation made with GPT‑5.4 from a single flippantly specified immediate, utilizing Playwright Interactive for browser playtesting and picture era for the isometric asset set. The simulation contains tile-based path placement, trip and surroundings development, visitor pathfinding, queueing, and trip cycles, whereas park metrics like cash, visitor depend, happiness, cleanliness, and ranking rise or fall based mostly on how the format performs and the way visitors reply to it. Playwright was used to automate browser playtests by constructing and increasing the park, putting and eradicating paths and points of interest, checking digicam navigation, and verifying that visitors, queues, trip states, and UI metrics up to date accurately over a number of rounds of play.

Immediate: Use $playwright-interactive and $imagegen. Create an interactive isometric theme park simulation recreation that I can construct and navigate within the browser. Use imagegen to ascertain the general visible imaginative and prescient and generate the sport’s belongings, together with rides, paths, terrain, timber, water, meals stalls, decorations, buildings, icons, and UI illustrations. The world ought to really feel cohesive, polished, and visually wealthy, with a premium artwork course that works properly from an isometric perspective. Let me place and take away paths, add points of interest, place surroundings, and transfer across the park easily whereas monitoring visitor exercise, trip standing, and park development. Embody plausible visitor motion, easy park administration methods like cash, cleanliness, queueing, and happiness, and make the expertise really feel playful, clear, and full moderately than like a tough prototype. Prioritize appeal, readability, and robust recreation really feel over realism.

When play testing, be sure you construct and develop a park via a number of rounds of play, confirm that placement and navigation work easily, verify that visitors react to the park format and points of interest, and make sure the visuals, UI, and interactions really feel steady and cohesive.

“GPT-5.4 is at present the chief on our inner benchmarks. Our engineers discover it to be extra pure and assertive than earlier fashions. It really works via ambiguous issues with out second-guessing itself, and it is proactive about parallelizing work to maintain issues transferring.”

— Lee Robinson, VP of Developer Training at Cursor

With GPT‑5.4, we’ve considerably improved how fashions work with exterior instruments. Brokers can now function throughout bigger device ecosystems, select the proper instruments extra reliably, and full multi-step workflows with decrease price and latency.

Beforehand, when a mannequin was given instruments, all device definitions have been included within the immediate upfront. For methods with many instruments, this might add 1000’s—and even tens of 1000’s—of tokens to each request, growing price, slowing responses, and crowding the context with data the mannequin would possibly by no means use.

With device search, GPT‑5.4 as an alternative receives a light-weight record of accessible instruments together with a device search functionality. When the mannequin wants to make use of a device, it could actually search for that device’s definition and append it to the dialog at that second.

This method dramatically reduces the variety of tokens required for tool-heavy workflows and preserves the cache, making requests sooner and cheaper. It additionally permits brokers to reliably work with a lot bigger device ecosystems. For MCP servers which will comprise tens of 1000’s of tokens of device definitions, the effectivity positive aspects might be substantial.

To reveal the effectivity positive aspects, we evaluated 250 duties from Scale’s MCP Atlas⁠(opens in a brand new window) benchmark with all 36 MCP servers enabled in two modes: (1) exposing each MCP operate straight within the mannequin context, and (2) putting all MCP servers behind device search. The tool-search configuration decreased complete token utilization by 47% whereas attaining the identical accuracy.

Instance token counts come from averaging 250 duties within the MCP-Atlas public dataset.

GPT‑5.4 additionally improves device calling, making it extra correct and environment friendly when deciding when and the best way to use instruments throughout reasoning, significantly within the API. In comparison with GPT‑5.2, it achieves greater accuracy in fewer activates Toolathlon, a benchmark that assessments how properly AI brokers can use real-world instruments and APIs to finish multi-step duties. For instance, an agent must learn emails, extract project attachments, add them, grade them and file ends in a spreadsheet.

For latency-sensitive use instances the place reasoning effort None is most popular, GPT‑5.4 additional improves upon its predecessors.

In τ2-bench⁠⁠(opens in a brand new window), a mannequin should use instruments to perform a customer support process, the place there could also be a simulated person who can talk and take actions on the world state. Reasoning effort was set to None.

GPT‑5.4 is healthier at agentic internet search. On BrowseComp, a measurement of how properly AI brokers can persistently browse the online to search out hard-to-locate data, GPT‑5.4 leaps 17%_abs over GPT‑5.2, and GPT‑5.4 Professional units a brand new cutting-edge of 89.3%.

In observe, this implies GPT‑5.4 Considering is stronger at answering questions that require pulling collectively data from many sources on the net. It may well extra persistently search throughout a number of rounds to establish essentially the most related sources, significantly for “needle-in-a-haystack” questions, and synthesize them into a transparent, well-reasoned reply.

In BrowseComp, we used a search blocklist excluding web sites containing benchmark solutions from analysis to stop contamination and guarantee a good measure of efficiency. GPT‑5.4 was measured on a later date than GPT‑5.2, so scores mirror adjustments within the mannequin, our search system, and state of the web. GPT‑5.4 was examined with an extended, up to date blocklist. Fashions use the ChatGPT search device, which might have small variations from API search.

“GPT-5.4 xhigh is the brand new cutting-edge for multi-step device use. Zapier runs a number of the most rigorous device use benchmarks within the {industry}, testing fashions throughout a whole lot of superior real-world workflows. GPT-5.4 completed the job the place earlier fashions gave up – essentially the most persistent mannequin to this point.”

— Wade, CEO at Zapier

Equally to how Codex outlines its method when it begins working, GPT‑5.4 Considering in ChatGPT will now define its work with a preamble for longer, extra complicated queries. You may as well add directions or modify its course mid-response. This makes it simpler to information the mannequin towards the precise end result you need with out beginning over or requiring a number of extra turns. This function is offered now on chatgpt.com⁠(opens in a brand new window) and the Android app, coming quickly to the iOS app.

The mannequin can even suppose longer on tough duties whereas sustaining stronger consciousness of earlier steps within the dialog. This enables it to deal with longer workflows and extra complicated prompts whereas maintaining solutions coherent and related all through.

This video was sped up for illustrative functions.

Over latest months, we’ve continued enhancing the safeguards we launched with GPT‑5.3‑Codex whereas making ready GPT‑5.4 for deployment. Just like GPT‑5.3‑Codex, we’re treating GPT‑5.4 as Excessive cyber functionality below our Preparedness Framework, and we’re deploying it with the corresponding protections as documented within the system card⁠(opens in a brand new window). These embody an expanded cyber security stack, together with monitoring methods, trusted entry controls, and asynchronous blocking for higher-risk requests for purchasers on Zero Information Retention (ZDR) surfaces, alongside ongoing funding within the broader safety ecosystem.

As a result of cybersecurity capabilities are inherently dual-use, we keep a precautionary method to deployment whereas persevering with to calibrate our insurance policies and classifiers. For sure prospects on ZDR surfaces, request-level blocking stays a part of our cyber threat mitigation stack; as a result of classifiers are nonetheless enhancing, some false positives might happen as we proceed refining these safeguards. These updates intention to enhance how safeguards function in observe, together with decreasing pointless refusals and overly caveated responses, whereas preserving sturdy protections in opposition to misuse.

We’ve continued our security analysis on Chain-of-Thought (CoT) monitorability to higher perceive how fashions purpose and assist detect potential misbehavior. As a part of this work, we introduce a brand new open-source analysis, CoT controllability⁠, measuring whether or not fashions can intentionally obfuscate their reasoning to evade monitoring. We discover that GPT‑5.4 Considering’s means to regulate its CoT is low, which is a optimistic property for security, suggesting that the mannequin lacks the flexibility to cover its reasoning and that CoT monitoring stays an efficient security device.

GPT‑5.4 is rolling out steadily immediately throughout ChatGPT and Codex. Within the API, GPT‑5.4 is offered now as gpt-5.4. GPT‑5.4 Professional can be accessible within the API as gpt-5.4-pro for builders who want most efficiency on essentially the most complicated duties.

In ChatGPT, GPT‑5.4 Considering is offered beginning immediately to ChatGPT Plus, Crew, and Professional customers, changing GPT‑5.2 Considering. GPT‑5.2 Considering will stay accessible for 3 months for paid customers within the mannequin picker below the Legacy Fashions part, after which it will likely be retired on June 5, 2026. These on Enterprise and Edu plans can allow early entry through admin settings. GPT‑5.4 Professional is offered to Professional and Enterprise plans. Context home windows⁠(opens in a brand new window) in ChatGPT for GPT‑5.4 Considering stay unchanged from GPT‑5.2 Considering.

GPT‑5.4 is our first mainline reasoning mannequin that comes with the frontier coding capabilities of GPT‑5.3‑codex and that’s rolling out throughout ChatGPT, the API and Codex. We’re calling it GPT‑5.4 to mirror that leap, and to simplify the selection between fashions when utilizing Codex. Over time, you possibly can count on our On the spot fashions and Considering fashions to evolve at totally different speeds.

GPT‑5.4 in Codex contains experimental help for the 1M context window. Builders can do that by configuring model_context_window and model_auto_compact_token_limit. Requests that exceed the usual 272K context window depend in opposition to utilization limits at 2x the conventional fee.

Within the API, GPT‑5.4 is priced greater per token than GPT‑5.2 to mirror its improved capabilities, whereas its higher token effectivity helps cut back the entire variety of tokens required for a lot of duties. Batch and Flex pricing can be found at half the usual API fee, whereas Precedence processing is offered at twice the usual API fee.

Evals have been run with reasoning effort set to xhigh, besides the place specified in any other case. Benchmarks have been performed in a analysis surroundings, which can present barely totally different output from manufacturing ChatGPT in some instances.

Source link

Article Tags:

Article Categories:

Water Purifiers & Accessories

Introducing GPT-5.4 | OpenAI

Leave a Reply Cancel reply

Neo Co-Founder Proposes $461M Overhaul to Finish ‘Belief Me’ Governance

Bitcoin Liquidates $660M In Shorts As BTC Worth Rallied Previous $78K