Infyqro

It’s March 10, 2026, and a curious trend is emerging in the AI space: the companies making the hardware are increasingly making the models too. On a recent Stack Overflow Podcast episode, host Ryan Donovan sat down with Kari Briski, NVIDIA’s VP of Generative AI Software for Enterprise, to peel back the layers on this development. What's clear is that for a company like NVIDIA, this isn't just about diversification; it’s a foundational strategy to push the very limits of what AI hardware can achieve. And frankly, if you’re operating anywhere in the AI stack, you need to understand why.

Article hero image — Credit: Alexandra Francis

While the podcast itself had a few typical introductory segments—shout-outs to an AI robotics competition from Intrinsic, Open Robotics, NVIDIA, and Google DeepMind with a $180,000 prize pool (register by April 17 at [intrinsic.ai/stack](https://intrinsic.ai/stack)), and even a nod to a Stack Overflow user called The4thIceman for a Populist badge on their Pygame text centering answer—the real substance began when Kari Briski laid out NVIDIA’s philosophy.

The Full-Stack Mentality and Extreme Co-Design

Briski, whose career spans from her mother teaching computer programming at night (complete with early modem chat rooms) to securing one of the first computer engineering degrees from the University of Pittsburgh, has always believed in the deep connection between hardware and software. "If there was no software," she states simply, "the hardware would just be a brick." This isn’t a throwaway line; it underpins NVIDIA's entire approach. They see themselves as a "full-stack company," not merely a GPU manufacturer. Their venture into model development, particularly large language models (LLMs), isn't some side project. It's a direct outcome of what they call "extreme co-design." The premise is straightforward: to accelerate crucial workloads, you first need to truly grasp them. This means building models in-house. It’s an evolution dating back to CUDA’s early days, through high-performance computing (HPC) and deep learning, eventually leading to computer vision, speech synthesis, and natural language processing. NVIDIA has been working on LLMs since 2018, giving them a significant head start. This commitment translates into a rapid, "engineer-to-engineer" feedback loop between their model developers and hardware architects. They’re effectively putting their GPUs through their paces, uncovering bottlenecks and opportunities in real-time. This isn’t just about optimizing compute; it extends to networking and storage, influencing the design of future hardware generations during the "plan of record" (POR) process.

Innovating with Precision and Memory

The results of this intense co-design are tangible. One significant learning involves precision training. Traditionally, models were trained in higher precision like FP16, then "quantized" down for inference. The issue? Quantization often means a 1-2% loss in accuracy. NVIDIA's approach, exemplified by FP8 for Hopper GPUs and NVFP4 for the new Blackwell architecture, involves training directly in reduced precision. This retains full accuracy while dramatically improving memory efficiency, sometimes by as much as half the space. That's a huge win for both training and inference. Consider the memory issue: models typically demand significant memory to operate. The co-design process has directly led to innovations like the "context memory engine," announced at CES, which specifically addresses how GPUs manage massive context lengths—think a million tokens or words for large codebases. They’ve also developed software frameworks like Dynamo, for disaggregated serving of large models, and NIXL, for efficient inter-GPU communication, all aimed at tackling the memory and performance challenges of scaling these systems.

Meet Nemotron: NVIDIA's Open Model Family

NVIDIA’s commitment to walking the walk has coalesced into the Nemotron family of open models. The name itself is an homage, blending "Megatron" (a nod to one of the largest transformers from 2018) with "Nemo" (for Neural Modules). This isn’t just about LLMs; Nemotron encompasses a broader ecosystem including vision language models, embedding models, and speech models. For their LLMs, they’ve introduced Nano, Super, and Ultra tiers, essentially small, medium, and large versions. More recently, they’ve innovated with a hybrid model within the Nemotron family. It combines the efficiency of a Mamba State Space model with a traditional transformer architecture, further adopting techniques like "Mixture of Experts." Why Mamba? As dense transformer models scale, their inference time grows quadratically. Mamba State Space models offer a more efficient sequence processing, mitigating that quadratic growth and boosting token efficiency during both training and inference. While transformer and diffusion architectures dominate the current conversation, Briski pointed to diffusion models as showing particular promise for future breakthroughs.

The Case for General-Purpose GPUs

A critical takeaway is NVIDIA's firm stance against highly specialized chips for individual models. While others might theorize about purpose-built silicon for specific AI architectures, NVIDIA remains committed to the general-purpose GPU. Their reasoning is pragmatic: the most effective "agentic systems" rarely rely on a single model or architecture. Instead, they leverage complex "systems of models"—including speech, ASR, and traditional machine learning components. Model architectures are fluid, constantly changing. Designing a chip for one specific model would be a gamble, risking obsolescence before it even leaves the fab. Instead, NVIDIA focuses on making their general-purpose GPUs incredibly adaptable, while optimizing performance through software and sophisticated serving techniques, like using different GPU SKUs for various parts of an inference pipeline (e.g., prefill on one, decode on another). This maintains flexibility while still pushing the boundaries of efficiency. In essence, NVIDIA isn't just selling the picks and shovels for the AI gold rush; they're actively digging with them. Their deep engagement in model development isn't a distraction; it's a strategic necessity, allowing them to refine their hardware in lockstep with the demands of the cutting-edge AI. For anyone investing in or building AI solutions, that co-design feedback loop is what’s driving the next generation of performance. You can connect with Kari Briski on [LinkedIn](https://www.linkedin.com/in/karibriski/) to follow her work, and learn more about Nemotron on its [developer page](https://developer.nvidia.com/nemotron), [Hugging Face collection](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3), or at [NVIDIA GTC](https://nvda.ws/3NVv7OT) from March 16-19.The pace of AI development feels less like steady progress and more like a high-speed sprint. If you're observing the industry, you'll have noticed how quickly discussions around "context windows" have given way to much more complex considerations. Forget simple memory; we're now deep into the intricacies of **agentic systems**, grappling with issues like 'context rot' and the infamous 'needle in the haystack' problem. It's a far cry from a year ago when Retrieval Augmented Generation (RAG) was the buzzword. While RAG remains important—essentially an online recommendation system for models to fetch and re-rank information—it's now just one tool in a much larger agentic toolbox. Kari Briski, NVIDIA's VP of Generative AI Software for Enterprise, makes it clear: "one model does not rule them all, it's systems of models." Managing these systems, determining what memory to offload to disk, when to recall it, and how often to re-index data, reveals a problem domain surprisingly analogous to traditional computer system design.

AI's Software Engineering Remix

Briski and her team find themselves revisiting fundamental concepts. Ryan Donovan, the interviewer, hit on something vital when he suggested it sounded like "developing a caching system for AI." Briski readily agrees, noting the shared principles with traditional computer system design. The way agents can spin off, autonomously think and act, then return with a result, even prompts comparisons to object-oriented programming. It’s almost like you send an object off to do its thing, and it eventually comes back, having executed its task within the larger program. Donovan took that analogy further, suggesting we're "speed running all of the networked software stuff, like we're already at the microservices part of AI." And that's exactly it. The speed is astonishing. This rapid evolution also means innovation isn't just happening at the model level; it's extending into the underlying hardware. There's significant innovation in storage solutions specifically tailored to these agentic demands, with a wide ecosystem of partners contributing to these specialized capabilities. NVIDIA can't do everything, but they're fostering an environment where a rising tide lifts all boats, seeing agents and models integrate directly into storage solutions to retrieve "real answers," not just raw data.

The Power of a "Complete Recipe"

Perhaps the most significant aspect of NVIDIA's strategy with Nemotron isn't just the models themselves, but their steadfast commitment to **fully open source**. This goes beyond just "open weights." As Briski explains, they release "model architectures, the model weights, the data that we've used to train the models, as well as all of the libraries." This "complete recipe" approach is a game-changer for fostering rapid research and development. Why does this matter so much? For enterprises, it tackles a critical hurdle: liability and trust. Many companies are wary of using models where they can't inspect the training data or understand *how* answers are derived. By opening up the data sets, NVIDIA allows customers to interrogate, audit, and build upon a trusted source. This isn't just about giving away data; it's about providing a robust "bootstrap" for companies to fine-tune models on their specific domains, complete with tools to generate domain-specific data and create "reinforcement learning gym environments" for specialized verification. This approach has spurred significant engagement. Partners, like ServiceNow, have taken this foundational work and applied it to their own domains, releasing models like Apriel and specific gym environments. The demand for domain expertise is booming across all verticals, from chip design and industrial design to coding and cybersecurity. The ability to easily verify code quality via unit tests or compilation makes coding applications a hotbed for AI tools. Other domains, like cybersecurity, require teams to build custom environments and verifiers to identify issues like false positives in threat detection. Even model builders who have their own stacks appreciate NVIDIA’s contributions, particularly the data sets and the gym environments. These environments, often considered "secret sauce" by other large model providers, are crucial for training and validation, and NVIDIA's decision to open them up has been met with eagerness.

Nemotron's Roadmap and the Future of AI as Software

NVIDIA isn't just talking about openness; they're living it. Their Nemotron roadmap is publicly available. We've already seen Nano V3 released in December, with Super slated for early February, and the Ultra model arriving around April, just after their flagship GTC event in San Jose. These releases aren't just about new models; they're about demonstrating the impact of open models, accessible libraries, data sets, and architectures. What's truly forward-looking is NVIDIA's vision of these models as a "new type of software development platform." Briski frames Nemotron models like software libraries. Just as traditional libraries need updates, bug fixes, and refreshes, so too will these AI models. This means integrating feedback, addressing bugs, incorporating feature requests, and continuously retraining and re-releasing them—a standard software development cycle applied to AI. While you can't push pull requests (PRs) to their core model designs *yet*, Briski confirms that feature is coming. The final leg of their open-source journey is enabling external contributions to model architecture. This worldwide R&D, with its global validation of new architectures (like the hybrid model with Space Force Transform) and community-driven "Red teams" providing critical feedback, will drive the next wave of AI innovation. The push for openness isn't just a philosophical stance; it's a pragmatic recognition that collaboration, inspection, and rapid iteration are the fastest ways to build reliable, specialized, and trustworthy AI. **[Contact Information/Outro - as per original fragment]** If you're curious to dive deeper into Nemotron, you can find more information on Hugging Face or the Nvidia developer pages. And if you want to connect with Kari Briski and the NVIDIA research and development teams directly, catch them at GTC in March in San Jose.The pace of AI development feels less like steady progress and more like a high-speed sprint. If you're observing the industry, you'll have noticed how quickly discussions around "context windows" have given way to much more complex considerations. Forget simple memory; we're now deep into the intricacies of **agentic systems**, grappling with issues like 'context rot' and the infamous 'needle in the haystack' problem. It's a far cry from a year ago when Retrieval Augmented Generation (RAG) was the buzzword. While RAG remains important—essentially an online recommendation system for models to fetch and re-rank information—it's now just one tool in a much larger agentic toolbox. Kari Briski, NVIDIA's VP of Generative AI Software for Enterprise, makes it clear: "one model does not rule them all, it's systems of models." Managing these systems, determining what memory to offload to disk, when to recall it, and how often to re-index data, reveals a problem domain surprisingly analogous to traditional computer system design.

From Chips to Code: Semiconductor Firms Build Large Language Models

The Full-Stack Mentality and Extreme Co-Design

Innovating with Precision and Memory

Meet Nemotron: NVIDIA's Open Model Family

The Case for General-Purpose GPUs

AI's Software Engineering Remix

The Power of a "Complete Recipe"

Nemotron's Roadmap and the Future of AI as Software

AI's Software Engineering Remix

The Power of a "Complete Recipe"

Nemotron's Roadmap and the Future of AI as Software

You Might Also Like

Marie Osmond Pays Tribute to Brother Alan Following His Death

Avengers: Doomsday: Recreated CinemaCon Trailer Offers Deepest Look Yet

13 Unexpected Style Discoveries from Walmart's New Arrivals