NVIDIA NCCL 2.28 Enhances AI Workflows by Merging Communication and Computation

Ink drawing showing abstract GPUs linked with arrows illustrating data communication and computation fusion in AI
Infrastructure reality check

This post is informational only (not professional advice). Performance and stability depend on your hardware, topology, software stack, and operating procedures, and responsibility remains with your engineering team. Tooling and best practices can change over time, so validate any approach with your own benchmarks and reliability requirements.

NCCL is the part of the stack that rarely shows up in glossy architecture diagrams—but it decides whether “distributed training” feels smooth or fragile. When your model is spread across many GPUs, the system spends a large share of its time synchronizing. If synchronization is slow, jittery, or poorly overlapped with compute, your expensive GPUs end up waiting for each other.

NCCL 2.28 is interesting because it shifts the mental model. Instead of treating communication as something the host schedules around compute, it introduces mechanisms that let communication be integrated into compute in more direct ways. The goal is simple: less orchestration overhead, more overlap, and fewer idle cycles when jobs scale.

Key takeaways

  • Device-side communication API: communication primitives can be initiated from within CUDA kernels, reducing host synchronization overhead.
  • Copy-engine (CE) collectives: data movement can be offloaded to GPU copy engines, freeing streaming multiprocessors (SMs) to keep computing while communication progresses.
  • Practical impact: better overlap and smoother scaling for workloads that are communication-heavy or sensitive to tail latency.

Communication–Compute Fusion in NCCL 2.28

Traditional distributed training often follows a predictable rhythm: compute, then synchronize, then compute again. Even when the algorithm is sound, the implementation can leave performance on the table because communication becomes a separate phase with its own kernel launches, host coordination, and timing variability.

NCCL 2.28’s direction is to narrow that gap. The new device API makes it possible to integrate communication directly into user kernels—so a kernel can compute and move data in tighter coordination, rather than relying on host-initiated collectives as a separate step. NVIDIA’s technical write-up on the release provides the clearest high-level explanation of this shift: Fusing communication and compute with the NCCL 2.28 device API and copy-engine collectives.

What changes when communication becomes “kernel-adjacent”
  • Less host choreography: fewer points where the CPU must coordinate progress and synchronization.
  • More predictable overlap: communication can happen as part of the compute schedule rather than after it.
  • New design space: developers can build custom fusion patterns instead of accepting a one-size-fits-all phase boundary.

GPU-Initiated Networking: Why It Matters Even When You Don’t “Use It Directly”

A major reason distributed systems struggle at scale is coordination overhead. The host-driven model works well—until the number of ranks grows and the cost of orchestration and synchronization becomes visible in step time. Device-initiated approaches reduce that dependency by letting kernels participate more directly in the communication timeline.

Even if most teams won’t write custom device-API kernels on day one, the direction matters because it points to a serving and training stack that can be less “stop-and-go.” As the infrastructure becomes more fabric-like (large NVLink domains, advanced networking, and in-network features), host-driven scheduling becomes the limiting factor more often than raw bandwidth.

Copy Engine Collectives: Freeing SMs for the Work You Actually Care About

In many training runs, the GPU is asked to do two very different jobs: compute and move data. When communication consumes streaming multiprocessors, it competes directly with model math. That competition often shows up as reduced utilization and uneven step timing.

Copy-engine collectives take a different approach: offload parts of collective data movement to the GPU’s dedicated copy engines (DMA engines), allowing SMs to remain focused on computation. The result is typically better overlap of communication and compute and fewer “idle valleys” where a GPU has nothing useful to do while waiting for data movement to complete.

This isn’t a silver bullet. The benefit depends on the workload’s communication pattern, the topology, and the ability of the stack to actually overlap the phases. But it meaningfully changes the ceiling for communication-heavy training where overlap is the difference between “scales” and “scales painfully.”

Rack-Scale Fabrics and the “Latency Tail” Problem

At small scale, engineers can optimize around averages. At large scale, averages become misleading because a job’s pace is set by the slowest participant. Tail latency—the worst-case delays caused by congestion, topology distance, or transient routing pressure—becomes the practical enemy.

NCCL improvements matter in this context because they reduce the number of synchronization choke points and increase the ability to overlap. That doesn’t eliminate the tail, but it makes the system less brittle: fewer hard stops, more continuous progress, and less time spent waiting for orchestration rather than doing useful work.

Resiliency: Planning for Failure as a Normal Condition

As GPU fleets grow, the probability of a component failure during a long run rises sharply. The operational question stops being “how do we prevent any failure?” and becomes “how do we recover without losing the entire run?”

NCCL’s evolution toward more flexible communication patterns pairs naturally with fault-tolerant training designs: detect failure, isolate the affected part of the communicator, and resume progress with minimal disruption. The exact recovery mechanism depends on the training framework and orchestration layer, but the key idea is consistent: resilient scaling requires communication layers that can be managed, reconfigured, and observed without turning every failure into a full restart.

Signals operators should track (beyond “it’s slow”)
  • Step time variance: spikes often indicate congestion or straggler behavior.
  • Collective timing breakdown: which collective dominates the critical path (all-reduce, all-gather, all-to-all).
  • Overlap ratio: how much communication is actually hidden under compute.
  • Failure patterns: whether faults correlate with specific links, nodes, or job phases.

Considerations Before You Adopt New NCCL Capabilities

New primitives and performance features are powerful, but they also create a new responsibility: validation. Before treating an optimization as a “win,” teams should confirm it under realistic traffic, real batch sizes, and realistic contention. A small improvement in a microbenchmark can behave differently inside a full training system with data loading, checkpointing, and mixed parallelism.

If you need to audit communication behavior in your environment, NCCL’s benchmarking tools are a practical starting point: NVIDIA/nccl-tests.

Summary of NCCL 2.28’s Impact

NCCL 2.28 advances distributed AI by tightening the relationship between communication and computation. Device-side communication APIs reduce reliance on host orchestration, while copy-engine collectives help keep SMs focused on model math. Together, these changes aim to improve throughput, reduce latency overhead, and make scaling behavior smoother—especially for workloads where tail latency and synchronization costs dominate.

Keep exploring

FAQ: Tap a question to expand.

▶ What does “communication–compute fusion” mean in practical terms?

It means reducing the hard boundary between “compute now” and “communicate later.” With device-side APIs and overlap-focused collectives, a training step can make progress more continuously, reducing host synchronization overhead and lowering time lost to stop-and-go scheduling.

▶ Why is GPU-initiated communication useful at scale?

Because host-driven orchestration becomes increasingly expensive as the number of ranks grows. When kernels can initiate or coordinate communication more directly, the system can reduce synchronization points and improve overlap—helping step time stay more stable as deployments scale.

▶ What are copy-engine collectives, and when do they help?

They offload parts of collective data movement to GPU copy engines so SMs can keep computing. They are most valuable when communication would otherwise compete with compute on SM resources and when the workload can benefit from improved overlap.

▶ How should teams validate improvements from NCCL 2.28 features?

Test under realistic contention and job shapes, not only microbenchmarks. Track step time variance, collective timing breakdowns, and overlap behavior. If a change improves average throughput but increases tail latency, the end-to-end job may still get worse.

Final takeaway

At scale, the “nervous system” of your training stack is communication. NCCL 2.28 matters because it reduces orchestration overhead and improves overlap, helping distributed jobs spend more time learning and less time waiting—without pretending that topology, tail latency, and operational discipline can be ignored.

Comments