OpenAI introduces Multipath Reliable Connection (MRC) via OCP to optimize large-scale AI training networks.
Traditional RoCEv2 struggles with load balancing and rapid link failures at the 100k+ GPU scale required for frontier models. By open-sourcing MRC, OpenAI is pushing a crucial standard for multipath routing and sub-millisecond recovery in Ethernet-based AI fabrics. This accelerates the industry's shift away from proprietary interconnects like InfiniBand toward highly resilient, commodity Ethernet architectures.
What happened
OpenAI has introduced Multipath Reliable Connection (MRC), a new networking protocol designed to enhance the resilience and performance of massive AI training clusters. Released as an open specification through the Open Compute Project (OCP), MRC aims to solve the networking bottlenecks inherent in scaling GPU clusters to the tens or hundreds of thousands.Technical details
Modern AI training relies heavily on synchronous operations across thousands of GPUs, where a single dropped packet or failed link can stall the entire cluster. Currently, RDMA over Converged Ethernet (RoCEv2) is the standard for Ethernet AI fabrics, but it relies on single-path routing (ECMP) which is prone to hash collisions, incast congestion, and slow failure recovery.MRC acts as an extension to the transport layer that natively supports multipathing. It enables dynamic per-packet or per-flowlet load balancing across multiple network paths, maximizing bisection bandwidth utilization. More importantly, MRC implements rapid, sub-millisecond path switching when a link degrades or fails. Instead of waiting for higher-level protocols or centralized SDN controllers to route around the failure—which causes expensive idle GPU time—MRC handles it transparently at the transport level.