Huang Renxun's full GTC speech: The era of inference has arrived, with revenue reaching at least one trillion dollars by 2027, and lobster is the new operating system

Mar 17, 2026 09:38:43

Share to

On March 16, 2026, the NVIDIA GTC 2026 conference officially opened, and NVIDIA founder and CEO Jensen Huang delivered a keynote speech.

At this conference, regarded as the "annual pilgrimage of the AI industry," Huang elaborated on NVIDIA's transformation from a "chip company" to an "AI infrastructure and factory company." In response to market concerns about performance sustainability and growth potential, Huang detailed the underlying business logic driving future growth—"Token Factory Economics."

Performance guidance is extremely optimistic, "At least $1 trillion in demand by 2027"

In the past two years, global AI computing demand has exploded exponentially. As large models evolved from "perception" and "generation" to "reasoning" and "action (executing tasks)," the consumption of computing power has surged sharply. Addressing the market's high concern about order and revenue ceilings, Huang provided a very strong outlook.

Huang stated in his speech:

Last year at this time, I mentioned that we saw a high-confidence demand of $500 billion, covering Blackwell and Rubin until 2026. Now, right here and now, I see at least $1 trillion in demand by 2027.

Huang's trillion-dollar expectation once pushed NVIDIA's stock price up over 4.3%.

Moreover, he further elaborated on this figure:

Is this reasonable? That's what I'm going to talk about next. In fact, we may even face a supply shortage. I'm sure the actual computing demand will be much higher.

Huang pointed out that today's NVIDIA systems have proven to be the world's "lowest-cost infrastructure." Because NVIDIA can run AI models across almost all fields, this versatility allows the $1 trillion invested by customers to be fully utilized and maintain a long lifecycle.

Currently, 60% of NVIDIA's business comes from the top five hyperscale cloud service providers, while the remaining 40% is widely distributed across sovereign clouds, enterprises, industries, robotics, and edge computing.

Token Factory Economics, where performance per watt determines the lifeblood of business

To explain the reasonableness of this $1 trillion demand, Huang presented a new business mindset to global CEOs. He pointed out that future data centers will no longer be warehouses for storing files but "factories" for producing Tokens (the basic units generated by AI).

Huang emphasized:

Every data center, every factory, is defined as being limited by power. A 1GW (gigawatt) factory will never become 2GW; that's the law of physics and atoms. At fixed power, whoever has the highest token throughput per watt will have the lowest production costs.

Huang categorized future AI services into four business tiers:

Free Tier (high throughput, low speed)

Mid Tier (~$3 per million tokens)

High Tier (~$6 per million tokens)

High-Speed Tier (~$45 per million tokens)

Ultra-High-Speed Tier (~$150 per million tokens)

He noted that as models grow larger and contexts become longer, AI will become smarter, but the token generation rate will decrease. Huang stated:

In this Token Factory, your throughput and token generation speed will directly translate into your precise revenue for next year.

Huang emphasized that NVIDIA's architecture allows customers to achieve extremely high throughput in the free tier while achieving an astonishing 35 times performance improvement at the highest value inference tier.

Vera Rubin achieves 350 times acceleration in two years, Groq fills the gap for ultra-fast inference

Under the constraints of physical limits, NVIDIA introduced its most complex AI computing system ever, Vera Rubin. Huang stated:

In the past, when mentioning Hopper, I would hold up a chip, which was cute. But when mentioning Vera Rubin, everyone thinks of the entire system. In this 100% liquid-cooled system, which completely eliminates traditional cabling, a rack that used to take two days to install now takes only two hours.

Huang pointed out that through extreme end-to-end hardware-software co-design, Vera Rubin created an astonishing leap in data within the same 1GW data center:

In just two years, we increased the token generation rate from 22 million to 700 million, achieving a 350-fold growth. Moore's Law during the same period could only bring about a 1.5-fold improvement.

To address the bandwidth bottleneck under ultra-fast inference conditions (such as 1000 tokens/second), NVIDIA provided a final solution integrating the acquired company Groq: asymmetric separated inference. Huang explained:

These two processors have completely different characteristics. The Groq chip has 500MB of SRAM, while a Rubin chip has 288GB of memory.

Huang noted that NVIDIA, through the Dynamo software system, assigned the "pre-fill" phase, which requires massive computation and video memory, to Vera Rubin, while the "decoding" phase, which is extremely sensitive to latency, was assigned to Groq. Huang also provided suggestions for enterprise computing power configuration:

If your workload is primarily high throughput, use 100% Vera Rubin; if you have a large number of high-value programming-level token generation needs, allocate 25% of your data center capacity to Groq.

It was revealed that the Groq LP30 chip, manufactured by Samsung, has entered mass production and is expected to ship in the third quarter, while the first Vera Rubin rack is already running on Microsoft Azure.

In addition, regarding optical interconnect technology, Huang showcased the world's first mass-produced co-packaged optical (CPO) switch, Spectrum X, and quelled market concerns about the "copper-to-optical transition" route:

We need more copper cable capacity, more optical chip capacity, and more CPO capacity.

Agent ends traditional SaaS, "salary + Token" becomes standard in Silicon Valley

In addition to hardware barriers, Huang devoted a significant portion of his speech to the revolution in AI software and ecosystems, particularly the explosion of Agents.

He described the open-source project OpenClaw as "the most popular open-source project in human history," claiming it surpassed the achievements of Linux over the past 30 years in just a few weeks. Huang stated that OpenClaw is essentially the "operating system" for agent computers.

Huang asserted:

Every SaaS (Software as a Service) company will become an AaaS (Agent-as-a-Service) company. Undoubtedly, to ensure the safe deployment of these agents, which have the ability to access sensitive data and execute code, NVIDIA has launched an enterprise-level NeMo Claw reference design, adding a policy engine and privacy router.

For ordinary workers, this transformation is also just around the corner. Huang envisioned a new workplace form in the future:

In the future, every engineer in our company will need an annual token budget. Their base salary may be hundreds of thousands of dollars, and I will allocate about half of that amount as a token quota to enable them to achieve a 10x efficiency improvement. This has already become a new hiring chip in Silicon Valley: how many tokens are included in your offer?

At the end of the speech, Huang also "spoiled" the next-generation computing architecture Feynman, which will achieve the first horizontal scaling of copper wires and CPO. More intriguingly, NVIDIA is developing a data center computer "Vera Rubin Space-1" to be deployed in space, completely opening up the imagination space for AI computing power to extend beyond Earth.

The full text of Jensen Huang's GTC 2026 speech is as follows (with the assistance of AI tools):

Host: Welcome NVIDIA founder and CEO Jensen Huang to the stage.

Jensen Huang, Founder and CEO:

Welcome to GTC. I want to remind everyone that this is a technology conference. I am very pleased to see so many people lining up to enter early in the morning and to see all of you here.

At GTC, we will focus on three main themes: technology, platform, and ecosystem. NVIDIA currently has three major platforms: the CUDA-X platform, the system platform, and our newly launched AI factory platform.

Before we officially begin, I want to thank our warm-up session hosts—Sarah Guo from Conviction, Alfred Lin from Sequoia Capital (NVIDIA's first venture capitalist), and Gavin Baker, NVIDIA's first major institutional investor. These three have profound insights into technology and have a wide influence in the entire technology ecosystem. Of course, I also want to thank all the distinguished guests I personally invited to attend today. Thank you to this all-star team.

I also want to thank all the companies present today. NVIDIA is a platform company; we have technology, platforms, and a rich ecosystem. The companies present today represent almost all participants in the $100 trillion industry, with 450 companies sponsoring this event, for which I am deeply grateful.

This conference features 1,000 technical forums and 2,000 speakers, covering every level of the AI "five-layer cake" architecture—from infrastructure such as land, power, and data centers to chips, platforms, models, and various applications that drive the entire industry forward.

CUDA: Twenty Years of Technological Accumulation

Everything starts here. This year marks the 20th anniversary of CUDA.

For twenty years, we have been committed to the development of this architecture. CUDA is a revolutionary invention—SIMT (Single Instruction Multiple Threads) technology allows developers to write programs in scalar code and extend them into multi-threaded applications, with programming difficulty far lower than that of previous SIMD architectures. We recently added the Tiles feature to help developers program Tensor Cores more conveniently, as well as various mathematical operation structures relied upon by today's AI. Currently, CUDA has thousands of tools, compilers, frameworks, and libraries, with hundreds of thousands of public projects in the open-source community, and has been deeply integrated into every technology ecosystem.

This chart reveals NVIDIA's 100% strategic logic; I have been talking about this slide since the beginning. The most difficult and core element is the "installed base" at the bottom of the chart. Over the past twenty years, we have accumulated hundreds of millions of GPUs and computing systems running CUDA worldwide.

Our GPUs cover all cloud platforms, serving almost all computer manufacturers and industries. The large installed base of CUDA is the fundamental reason this flywheel continues to accelerate. The installed base attracts developers, developers create new algorithms and breakthroughs, breakthroughs spawn new markets, new markets form new ecosystems and attract more companies to join, thereby expanding the installed base—this flywheel is continuously accelerating.

The download rate of NVIDIA's libraries is growing at an astonishing speed, large in scale and increasing in speed. This flywheel enables our computing platform to support massive applications and continuous new breakthroughs.

More importantly, it also gives these infrastructures a very long lifespan. The reason is obvious: there are a wealth of applications that can run on NVIDIA CUDA, covering every stage of the AI lifecycle, various data processing platforms, and various scientific principle solvers. Therefore, once NVIDIA GPUs are installed, their actual usage value is extremely high. This is also why the cloud price of the Ampere architecture GPU we released six years ago has actually increased.

The fundamental reason for all this is: a large installed base, a strong flywheel, and a wide developer ecosystem. When these factors work together, combined with our continuous software updates, computing costs will continue to decline. Accelerated computing significantly improves application performance, and as we maintain and iterate software over the long term, users can not only achieve performance leaps in the early stages but also continue to enjoy declining computing costs. We are willing to provide long-term support for every GPU globally because they are completely compatible at the architectural level.

The reason we are willing to do this is that the installed base is so large—every time we release a new optimization, it benefits millions of users. This dynamic combination allows NVIDIA's architecture to continuously expand its coverage and accelerate its growth while continuously lowering computing costs, ultimately stimulating new growth. CUDA is at the core of all this.

From GeForce to CUDA: A 25-Year Evolution

Our journey with CUDA actually began 25 years ago.

GeForce—many of you have grown up with GeForce. GeForce is NVIDIA's most successful marketing project. We started cultivating future customers when you couldn't afford the products—your parents became NVIDIA's earliest users, buying our products year after year until one day you grew up to be excellent computer scientists, becoming true customers and developers.

This is the foundation laid by GeForce 25 years ago. Twenty-five years ago, we invented programmable shaders—an obvious yet profoundly significant invention that made accelerators programmable, and the world's first programmable accelerator, namely pixel shaders. Five years later, we created CUDA—one of our most important investments ever. At that time, the company had limited financial resources, but we bet most of our profits on it, committed to extending CUDA from GeForce to every computer. We were so determined because we believed in its potential. Despite initial hardships, the company held this belief for 13 generations, a full twenty years, and today CUDA is everywhere.

It was the pixel shaders that drove the revolution of GeForce. About eight years ago, we launched RTX—a comprehensive overhaul of the architecture for the modern computer graphics era. GeForce brought CUDA to the world, and because of this, many scholars such as Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton, and Andrew Ng discovered that GPUs could become powerful tools for accelerating deep learning, igniting the explosion of artificial intelligence a decade ago.

Ten years ago, we decided to merge programmable shading with two brand new concepts: one is hardware ray tracing, which is technically very challenging; the other is a forward-looking idea—we foresaw about ten years ago that AI would completely transform computer graphics. Just as GeForce brought AI to the world, AI is now reshaping the way computer graphics are implemented.

Today, I want to show you the future. This is our next-generation graphics technology, which we call neural rendering—deep integration of 3D graphics and artificial intelligence. This is DLSS 5, please take a look.

Neural Rendering: The Fusion of Structured Data and Generative AI

Isn't this breathtaking? Computer graphics have come to life.

What did we do? We combined controllable 3D graphics (the real foundation of the virtual world) with its structured data, then infused it with generative AI and probabilistic computing. One is completely deterministic, while the other is probabilistic yet highly realistic—we merged these two concepts into one, achieving precise control through structured data while generating in real-time. Ultimately, the content is both beautiful and stunning, yet completely controllable.

The concept of merging structured information with generative AI will continue to replicate across various industries. Structured data is the cornerstone of trustworthy AI.

Accelerated Platform for Structured and Unstructured Data

Now I want to show you a technical architecture diagram.

Structured data—familiar SQL, Spark, Pandas, Velox, and important platforms like Snowflake, Databricks, Amazon EMR, Azure Fabric, Google BigQuery, all handle data frames. These data frames are like giant spreadsheets, carrying all the information of the business world, and are the basic facts of enterprise computing (Ground Truth).

In the AI era, we need to enable AI to use structured data and achieve extreme acceleration. In the past, accelerating structured data processing was to make enterprises operate more efficiently. In the future, AI will use these data structures at speeds far exceeding human capabilities, and AI agents will also heavily call upon structured databases.

In terms of unstructured data, vector databases, PDFs, videos, audio, etc., constitute the vast majority of data forms in the world—about 90% of the data generated each year is unstructured. In the past, this data was almost entirely unusable: we read it, stored it in file systems, and that was it. We couldn't query it, nor could we retrieve it, because unstructured data lacks simple indexing methods and must be understood in terms of meaning and context. Now, AI can do this—thanks to multimodal perception and understanding technology, AI can read PDF documents, understand their meanings, and embed them into larger structures for querying.

NVIDIA has created two foundational libraries for this purpose:

cuDF: for accelerated processing of data frames and structured data

cuVS: for vector storage, semantic data, and processing of unstructured AI data

These two platforms will become one of the most important foundational platforms for the future.

Today, we announce partnerships with multiple companies. IBM—the inventor of SQL—will use cuDF to accelerate its WatsonX Data platform. Dell has collaborated with us to create the Dell AI data platform, integrating cuDF and cuVS, achieving significant performance improvements in actual projects with NTT Data. Regarding Google Cloud, we are now not only accelerating Vertex AI but also BigQuery, and we have partnered with Snapchat to reduce its computing costs by nearly 80%.

The benefits of accelerated computing are threefold: speed, scale, and cost. This is consistent with the logic of Moore's Law—achieving performance leaps through accelerated computing while continuously optimizing algorithms, allowing everyone to enjoy continuously declining computing costs.

NVIDIA has built an accelerated computing platform that brings together numerous libraries: RTX, cuDF, cuVS, and more. These libraries are integrated into global cloud services and OEM systems, reaching users worldwide.

Deep Collaboration with Cloud Service Providers

Collaboration with major cloud service providers

Google Cloud: We accelerate Vertex AI and BigQuery, deeply integrating with JAX/XLA, while performing excellently on PyTorch—NVIDIA is the only company in the world that performs well on both PyTorch and JAX/XLA accelerators. We have brought customers like Base10, CrowdStrike, Puma, and Salesforce into the Google Cloud ecosystem.

AWS: We accelerate EMR, SageMaker, and Bedrock, with deep integration with AWS. This year, I am particularly excited that we will bring OpenAI into AWS, which will significantly boost AWS cloud consumption growth and help OpenAI expand regional deployments and computing scale.

Microsoft Azure: NVIDIA's 100 PFLOPS supercomputer is the first supercomputer we built and the first supercomputer deployed on Azure, laying an important foundation for cooperation with OpenAI. We accelerate Azure cloud services and AI Foundry, collaborating to promote Azure regional expansion and deeply cooperating on Bing search. Notably, our confidential computing capability—ensuring that even operators cannot view user data and models—makes NVIDIA GPUs among the first in the world to support confidential computing, enabling confidential deployments of OpenAI and Anthropic models in cloud environments across the globe. For example, we accelerate all EDA and CAD workflows for Synopsys and deploy them on Microsoft Azure.

Oracle: We are Oracle's first AI customer, and I am proud to have been able to explain the concept of AI cloud to Oracle for the first time. Since then, they have developed rapidly, and we have introduced many partners such as Cohere, Fireworks, and OpenAI to them.

CoreWeave: The world's first AI-native cloud, born for GPU hosting and AI cloud services, has an excellent customer base and strong growth momentum.

Palantir + Dell: The three parties jointly created a new AI platform based on Palantir's ontology platform and AI platform, capable of fully localized deployment of AI in any country and any air-gapped environment—from data processing (vectorized or structured) to a complete accelerated computing stack for AI.

NVIDIA has established this special collaborative relationship with global cloud service providers—we bring customers to the cloud, creating a mutually beneficial ecosystem.

Vertical Integration, Horizontal Openness: NVIDIA's Core Strategy

NVIDIA is the world's first vertically integrated and horizontally open company.

The necessity of this model is very simple: accelerated computing is not just a chip issue or a system issue; its complete expression should be application acceleration. CPUs can make computers run faster overall, but this path has reached a bottleneck. In the future, only through application or domain-specific acceleration can we continue to achieve performance leaps and cost reductions.

This is precisely why NVIDIA must delve into one library after another, one field after another, and one vertical industry after another. We are a vertically integrated computing company, and there is no other path to take. We must understand applications, understand domains, deeply understand algorithms, and be able to deploy them in any scenario—data centers, cloud, on-premises, edge, and even robotic systems.

At the same time, NVIDIA remains horizontally open, willing to integrate technology into any partner's platform, allowing the whole world to enjoy the dividends of accelerated computing.

The structure of attendees at this GTC fully reflects this. The proportion of attendees from the financial services industry is the highest—we hope developers come, not traders. Our ecosystem covers the entire upstream and downstream supply chain. Whether a company has been established for 50, 70, or 150 years, last year marked its best year in history. We are at the starting point of something very, very significant.

CUDA-X: The Accelerated Computing Engine for Various Industries

NVIDIA has deeply laid out in various vertical fields:

Autonomous Driving: Wide coverage and far-reaching impact

Financial Services: Quantitative investing is shifting from manual feature engineering to supercomputer-driven deep learning, ushering in its "Transformer moment"

Healthcare: It is welcoming its own "ChatGPT moment," covering AI-assisted drug discovery, AI agent-supported diagnosis, medical customer service, and more

Industry: The largest construction wave in the world is unfolding, with AI factories, chip factories, and data center factories being established

Entertainment and Gaming: Real-time AI platforms support translation, live streaming, gaming interaction, and intelligent shopping agents

Robotics: With over a decade of deep cultivation, three major computing architectures (training computers, simulation computers, onboard computers) are in place, with 110 robots showcased at this exhibition

Telecommunications: A $2 trillion industry, base stations will evolve from single communication functions to AI infrastructure platforms, with related platforms named Aerial, and deep collaborations with companies like Nokia and T-Mobile.

The core of all these fields is our CUDA-X library—this is the fundamental essence of NVIDIA as an algorithm company. These libraries are the company's most core assets, enabling the computing platform to deliver actual value across various industries.

One of the most important libraries is cuDNN (CUDA Deep Neural Network Library), which has completely revolutionized artificial intelligence and triggered the explosion of modern AI.

(Play CUDA-X demonstration video)

Everything you just saw is simulation—including physics-based solvers, AI agent physical models, and physical AI robot models. Everything is simulated, with no manual animation or joint binding. This is where NVIDIA's core capability lies: unlocking these opportunities through a profound understanding of algorithms and organic integration with the computing platform.

AI-native Enterprises and the New Computing Era

You just saw industry giants defining today's society, such as Walmart, L'Oréal, JPMorgan Chase, Roche, and Toyota, as well as a large number of companies you may have never heard of—we call them AI-native enterprises. This list is vast, including OpenAI, Anthropic, and many emerging companies serving different vertical fields.

In the past two years, this industry has experienced an astonishing leap. The scale of venture capital flowing into startups reached $150 billion, the highest in human history. More importantly, the scale of single investments has jumped from millions of dollars to hundreds of millions and even billions of dollars for the first time. The reason is simple: this is the first time in history that every such company requires massive computing resources and a large number of tokens. This industry is creating, generating tokens, or adding value to tokens from institutions like Anthropic and OpenAI.

Just as the PC revolution, internet revolution, and mobile cloud revolution each birthed a batch of epoch-making companies, this generation of computing platform transformation will also give rise to a number of highly influential companies, becoming an important force in the future world.

Three Historic Breakthroughs Driving All This

What has happened in the past two years? Three major events.

First: ChatGPT, ushering in the era of generative AI (late 2022 to 2023)

It can not only perceive and understand but also generate unique content. I demonstrated the fusion of generative AI with computer graphics. Generative AI fundamentally changes the way computing works—computing has shifted from retrieval-based to generation-based, profoundly affecting computer architecture, deployment methods, and overall significance.

Second: Reasoning AI, represented by o1

Reasoning capabilities enable AI to self-reflect, plan, and decompose problems—breaking down questions it cannot directly understand into manageable steps. o1 makes generative AI trustworthy, capable of reasoning based on real information. To achieve this, the amount of input context tokens and output tokens used for thinking has significantly increased, leading to a substantial rise in computing demands.

Third: Claude Code, the first agent model

It can read files, write code, compile, test, evaluate, and iterate. Claude Code has completely revolutionized software engineering—100% of NVIDIA's engineers are using one or more of Claude Code, Codex, and Cursor; there is not a single software engineer who does not leverage AI assistance.

This is a new turning point—you no longer ask AI "what is it, where is it, how to do it," but rather let it "create, execute, build," allowing it to actively use tools, read files, decompose problems, and take action. AI has evolved from perception to generation, to reasoning, and now truly being able to get the work done.

In the past two years, the computing demand for reasoning has increased by about 10,000 times, and usage has grown by about 100 times. I have always believed that the computing demand has grown by a million times over the past two years—this is a shared feeling among everyone, including OpenAI and Anthropic. If we can obtain more computing power, we can generate more tokens, revenue will increase, and AI will become smarter. The reasoning turning point has indeed arrived.

The Era of Trillion-Dollar AI Infrastructure

A year ago, I stated here that we had high confidence in the demand and purchase orders for Blackwell and Rubin before 2026, amounting to about $500 billion. Today, one year after GTC, I stand here to tell you: looking ahead to 2027, I see a number of at least $1 trillion. And I am confident that the actual computing demand will far exceed this.

2025: The Year of Inference for NVIDIA

2025 will be NVIDIA's Year of Inference. We hope to ensure excellence not only in training and post-training but also in every stage of the AI lifecycle, so that the invested infrastructure can continue to operate efficiently, and the longer the effective lifespan, the lower the unit cost.

At the same time, Anthropic and Meta officially joined the NVIDIA platform, together representing one-third of global AI computing demand. Open-source models are approaching the cutting edge and are ubiquitous.

NVIDIA is currently the only platform in the world capable of running all AI fields—language, biology, computer graphics, computer vision, speech, protein and chemistry, robotics, etc.—all AI models, whether at the edge or in the cloud, regardless of language. The NVIDIA architecture is universal for all these scenarios, making us the lowest-cost and most reliable platform.

Currently, 60% of NVIDIA's business comes from the top five hyperscale cloud service providers, while the remaining 40% is distributed across regional clouds, sovereign clouds, enterprises, industries, robotics, and edge computing. The breadth of AI coverage itself is its resilience—this is undoubtedly a new computing platform transformation.

Grace Blackwell and NVLink 72: Bold Architectural Innovation

While the Hopper architecture was still at its peak, we decided to completely re-architect the system, expanding NVLink from 8-way to NVLink 72, fully decomposing and reconstructing the computing system. Grace Blackwell NVLink 72 is a significant technological bet, not easy for all partners, and I sincerely thank everyone for this.

At the same time, we launched NVFP4—not just an ordinary FP4, but a brand new type of tensor core and computing unit. We have demonstrated that NVFP4 can achieve inference without any loss of precision while delivering significant performance and energy efficiency improvements, and it is also suitable for training. In addition, a series of new algorithms such as Dynamo and TensorRT-LLM have emerged, and we even invested billions of dollars to build a supercomputer specifically for optimizing kernels, called DGX Cloud.

The results show that our inference performance is remarkable. Data from Semi Analysis—the most comprehensive AI inference performance evaluation to date—shows that NVIDIA leads by a wide margin in both tokens per watt and cost per token. Originally, Moore's Law might have brought a 1.5-fold performance improvement to H200, but we achieved 35 times. Dylan Patel from Semi Analysis even said, "Jensen sandbagged; it's actually 50 times." He is right.

I quote him: "Jensen sandbagged."

NVIDIA's cost per token is the lowest in the world, currently unmatched. The reason lies in extreme co-design.

Take Fireworks as an example; before NVIDIA updated the entire suite of software and algorithms, its average token speed was about 700 tokens per second; after the update, it approached 5,000 tokens per second, an increase of about 7 times. This is the power of extreme co-design.

AI Factory: From Data Centers to Token Factories

Data centers were once places for storing files; now they are factories for producing tokens. Every cloud service provider and every AI company will use "token factory efficiency" as a core operational metric in the future.

This is my core argument:

Vertical Axis: Throughput—number of tokens generated per second at fixed power

Horizontal Axis: Interaction Speed—response speed for each inference; the faster the speed, the larger the models that can be used, the longer the context, and the smarter the AI

Tokens are the new commodity, and once mature, will be priced in tiers:

Free Tier (high throughput, low speed)

Mid Tier (~$3 per million tokens)

High Tier (~$6 per million tokens)

High-Speed Tier (~$45 per million tokens)

Ultra-High-Speed Tier (~$150 per million tokens)

Compared to Hopper, Grace Blackwell improves throughput by 35 times at the highest value tier and introduces new tiers. Simplifying model estimates, if 25% of power is allocated to each of the four tiers, Grace Blackwell can generate 5 times more revenue than Hopper.

Vera Rubin: The Next Generation AI Computing System

(Play Vera Rubin system introduction video)

Vera Rubin is a complete, end-to-end optimized system designed for agentic workloads:

Large language model computing core: NVLink 72 GPU cluster, handling pre-fill and KV Cache

New Vera CPU: designed for extremely high single-thread performance, using LPDDR5 memory, with excellent energy efficiency, the world's only data center CPU using LPDDR5, suitable for AI agent tool calls

Storage system: BlueField 4 + CX 9, a new storage platform for the AI era, with 100% participation from the global storage industry

CPO Spectrum X switch: the world's first co-packaged optical Ethernet switch, now in full mass production

Kyber rack: a new rack system supporting 144 GPUs to form a single NVLink domain, with front-end computing and back-end NVLink switching, forming a giant computer

Rubin Ultra: next-generation supercomputer node, vertical design, paired with Kyber rack, supporting larger-scale NVLink interconnect

Vera Rubin is 100% liquid-cooled, reducing installation time from two days to two hours, using 45°C hot water cooling, significantly alleviating cooling pressure in data centers. This time, Satya (Nadella) has confirmed that the first Vera Rubin rack is now running on Microsoft Azure, which I am very excited about.

Groq Integration: Extreme Extension of Inference Performance

We acquired the Groq team and obtained its technology license. Groq is a deterministic data flow processor, using static compilation and compiler scheduling, with a large amount of SRAM, optimized for single inference workloads, featuring extremely low latency and high token generation speed.

However, Groq's memory capacity is limited (500MB on-chip SRAM), making it difficult to independently carry the parameters and KV Cache of large models, limiting its large-scale application.

The solution is Dynamo—a set of inference scheduling software. We disaggregated the inference pipeline through Dynamo:

Pre-fill and attention mechanism decoding are completed on Vera Rubin (requiring massive computing power and KV Cache storage)

Feed-forward network decoding, i.e., token generation, is completed on Groq (requiring extremely high bandwidth and low latency)

The two are tightly coupled via Ethernet, reducing latency by about half through special modes. Under the unified scheduling of Dynamo, the "AI factory operating system," overall performance improves by 35 times and opens up new inference performance tiers previously unreachable by NVLink 72.

Recommendations for the combination of Groq and Vera Rubin:

If the workload is primarily high throughput, use 100% Vera Rubin

If a large number of workloads involve high-value token generation such as code generation, introduce Groq, with a recommended ratio of about 25% Groq + 75% Vera Rubin

The Groq LP30 is being manufactured by Samsung and has entered mass production, with shipments expected to start in Q3. Thanks to Samsung for their full cooperation.

Historic Leap in Inference Performance

Quantifying previous technological advancements: within two years, the token generation rate of a 1GW AI factory will increase from 22 million tokens/second to 700 million tokens/second, a 350-fold increase. This is the power of extreme co-design.

Technology Roadmap

Blackwell: currently in production, Oberon standard rack system, copper cable expanded to NVLink 72, with optional optical expansion to NVLink 576

Vera Rubin (current): Kyber rack, NVLink 144 (copper cable); Oberon rack, NVLink 72 + optical, expanded to NVLink 576; Spectrum 6, the world's first CPO switch

Vera Rubin Ultra (coming soon): next-generation Rubin Ultra GPU, LP35 chip (first to integrate NVFP4), further enhancing performance several times

Feynman (next generation): new GPU, LP40 chip (jointly developed by NVIDIA and the Groq team, integrating NVFP4); new CPU—Rosa (Rosalyn); BlueField 5; CX 10; Kyber rack supporting both copper cable and CPO expansion methods

The roadmap is clear: copper cable expansion, optical expansion (Scale-Up), and optical expansion (Scale-Out) are being advanced in parallel, and we need all partners to continue expanding production in copper cables, optical fibers, and CPO.

NVIDIA DSX: The Digital Twin Platform for AI Factories

AI factories are becoming increasingly complex, but the various technology suppliers that make them up have never collaborated during the design phase, only "meeting" in the data center—this is clearly insufficient.

To address this, we created Omniverse and the NVIDIA DSX platform based on it—a platform for all partners to collaboratively design and operate gigawatt-level AI factories in the virtual world. DSX provides:

Rack-level mechanical, thermal, electrical, and network simulation systems

Connection with the power grid for collaborative energy-saving scheduling

Dynamic power consumption and cooling optimization based on Max-Q within the data center

Conservatively estimated, this system can improve energy utilization efficiency by about 2 times, which is a significant benefit at the scale we are discussing. Omniverse starts from the digital earth and will carry digital twins of various scales; we are working with global partners to build the largest computer in human history.

Additionally, NVIDIA is venturing into space. The Thor chip has passed radiation certification and is running in satellites. We are developing Vera Rubin Space-1 with partners for building space data center computers. In space, we can only rely on radiation for heat dissipation, and thermal management is a core challenge; we are gathering top engineers to tackle this.

OpenClaw: The Operating System for the Agent Era

Peter Steinberger developed a software called OpenClaw. This is the most popular open-source project in human history, surpassing Linux's achievements in just a few weeks.

OpenClaw is essentially an agent system (Agentic System) that can:

Manage resources, access tools, file systems, and large language models

Execute scheduling and timed tasks

Gradually decompose problems and invoke sub-agents

Support arbitrary modal input and output (voice, video, text, email, etc.)

Describing it in the syntax of an operating system, it is indeed an operating system—the operating system for agent computers. Windows made personal computers possible, and OpenClaw makes personal agents possible.

Every enterprise needs to formulate its own OpenClaw strategy, just as we all need Linux strategies, HTML strategies, and Kubernetes strategies.

Comprehensive Restructuring of Enterprise IT

The enterprise IT before OpenClaw: data and files enter the system, flow through tools and workflows, and ultimately become tools for human use. Software companies create tools, and system integrators (GSI) and consulting firms help enterprises use these tools.

The enterprise IT after OpenClaw: every SaaS company will transform into an AaaS (Agentic as a Service) company—not just providing tools but providing AI agents specialized in specific fields.

But there is a key challenge here: internal agents can access sensitive data, execute code, and communicate with the outside world. This must be strictly controlled in the enterprise environment.

To address this, we collaborated with Peter to integrate security into the enterprise version, launching:

NeMo Claw (reference design): an enterprise-level reference framework based on OpenClaw, integrating NVIDIA's full suite of agent AI toolkits

Open Shield (security layer): integrated into OpenClaw, providing policy engines, network fences, and privacy routers to ensure enterprise data security

NeMo Cloud: downloadable and usable, interfacing with the policy engines of all SaaS companies

This is a renaissance of enterprise IT, a $2 trillion industry poised to grow into a multi-trillion dollar scale, shifting from providing tools to providing specialized AI agent services.

I can fully foresee that in the future, every engineer in the company will have an annual token budget. Their salaries may be hundreds of thousands of dollars, and I will additionally provide them with a token quota equivalent to half their salary, allowing their output to be magnified tenfold. "How many tokens are included in the job offer" has become a new hiring topic in Silicon Valley.

Every enterprise will in the future be both a user of tokens (for engineers) and a producer of tokens (providing services to its customers). The significance of OpenClaw cannot be underestimated; it is as important as HTML and Linux.

NVIDIA Open Model Initiative

In the area of custom agents (Custom Claw), we provide NVIDIA's self-developed cutting-edge models:

Model Domain Nemotron Large Language Model Cosmos World Foundation Model GROOT General Humanoid Robot Model Alpamayo Autonomous Driving BioNeMo Digital Biology Phys-AIAI Physics

We are at the forefront of technology in each of these fields and are committed to continuous iteration—after Nemotron 3, there will be Nemotron 4; after Cosmos 1, there will be Cosmos 2; Groq will also iterate to the second generation.

Nemotron 3 ranks among the top three best models globally in OpenClaw, at the forefront level. Nemotron 3 Ultra will become the strongest foundational model ever, supporting countries in building sovereign AI.

Today, we announce the establishment of the Nemotron Alliance, investing billions of dollars to advance the research and development of AI foundational models. Alliance members include: BlackForest Labs, Cursor, LangChain, Mistral, Perplexity, Reflection, Sarvam (India), Thinking Machines (Mira Murati's lab), and more. One after another, enterprise software companies are joining, integrating the NeMo Claw reference design and NVIDIA's agent AI toolkit into their products.

Physical AI and Robotics

Digital agents act in the digital world—writing code, analyzing data; while physical AI refers to embodied agents, namely robots.

This GTC showcased 110 robots, almost encompassing all robot R&D companies globally. NVIDIA provides three computers (training computers, simulation computers, onboard computers) and a complete software stack and AI models.

In autonomous driving, the "ChatGPT moment" for autonomous driving has arrived. Today, we announce four new partners joining NVIDIA's RoboTaxi Ready platform: BYD, Hyundai, Nissan, and Geely, with a total annual production of 18 million vehicles. Along with previous partners like Mercedes-Benz, Toyota, and General Motors, the lineup has further strengthened. We also announced a significant collaboration with Uber to deploy and integrate RoboTaxi Ready vehicles in multiple cities.

In the field of industrial robots, many robot companies such as ABB, Universal Robotics, and KUKA are collaborating with us to combine physical AI models with simulation systems, promoting the deployment of robots on global manufacturing lines.

In telecommunications, Caterpillar and T-Mobile are also among them. In the future, wireless base stations will no longer just be communication nodes but will become an NVIDIA Aerial AI RAN—an intelligent edge computing platform capable of real-time traffic perception and beamforming adjustments to achieve energy-saving and efficiency-enhancing capabilities.

Special Segment: Olaf Robot Appears

(Play Disney Olaf robot demonstration video)

Huang: The snowman is here! Newton is running fine! Omniverse is also running fine! Olaf, how are you?

Olaf: I'm really happy to see you.

Huang: Yes, because I gave you a computer—Jetson!

Olaf: What is that?

Huang: It's inside your belly.

Olaf: That's amazing.

Huang: You learned to walk in Omniverse.

Olaf: I love walking. It's so much better than riding a reindeer and looking up at the beautiful sky.

Huang: That's because of physical simulation—Newton solver running on NVIDIA Warp, which we developed in collaboration with Disney and DeepMind, allowing you to adapt to the real physical world.

Olaf: I was just about to say that.

Huang: That's your cleverness. I'm a snowman, not a snowball.

Huang: Can you imagine? The future Disneyland—all these robotic characters walking freely in the park. But to be honest, I thought you would be taller. I've never seen such a short snowman.

Olaf: (noncommittal)

Huang: Can you help me wrap up today's speech?

Olaf: That would be great!

Keynote Summary

Huang: Today, we explored the following core themes together:

The arrival of the reasoning turning point: reasoning has become the core workload of AI, tokens are the new commodity, and reasoning performance directly determines revenue

The era of AI factories: data centers have evolved from file storage facilities to token production factories, and in the future, every company will measure its competitiveness by "AI factory efficiency"

The OpenClaw agent revolution: OpenClaw has ushered in the era of agent computing, and enterprise IT is transitioning from the tool era to the agent era; every enterprise needs to formulate an OpenClaw strategy

Physical AI and robotics: embodied intelligence is being scaled up, with autonomous driving, industrial robots, and humanoid robots collectively forming the next major opportunity for physical AI

Thank you all, and enjoy GTC!