Ultra Ethernet Consortium wants to optimize networking for AI and HPC

Not changing the standard, but tweaking how applications work over top

A group of tech companies has kicked off a project to adapt the Ethernet standard to make it better suited for the demanding network requirements of AI and high performance computing (HPC) applications.

The Ultra Ethernet Consortium (UEC) aims to create a "complete Ethernet-based communication stack architecture" that will be as ubiquitous and cost-effective as Ethernet while offering the performance of a supercomputing interconnect.

Founding members of the consortium include those heavily involved in HPC and networking, including Intel, AMD, HPE, Arista, Broadcom, Cisco, Meta and Microsoft, with the project itself is hosted within The Linux Foundation.

UEC chair Dr. J Metz told The Register the goal of the project is not to change Ethernet but to tune it to better accommodate the more demanding characteristics of both AI and HPC workloads.

"Ethernet is the base technology on top of which we build, since it's the industry's best example of long lasting, flexible and adaptable basic networking technology," he said.

"UEC's goal is to focus on how to best carry AI and HPC workload traffic on top of Ethernet. Of course, there have been a few attempts to do that before, but none has been designed from the ground up for highly demanding AI and HPC workloads and none has been open, easy to use and won broad adoption."

The project targets multiple layers of the networking stack with working groups tasked with developing "specifications that enhance the performance, latency and management" of both the physical layer and link layer, as well as developing specifications for the transport layer and the software layer.

According to a whitepaper [PDF], networking is becoming increasingly critical for the training of AI models, which are ballooning in size; some have trillions of parameters and need to be trained on large compute clusters, and the network needs to be as efficient as possible in order to keep those clusters busy.

While AI workloads tend to be extremely bandwidth-hungry, HPC also includes workloads that are more latency sensitive, and both requirements need to be met.

To satisfy these needs, the UEC has identified the following as desirable characteristics: flexible delivery order; modern congestion control mechanisms; multi-pathing and packet spraying; plus greater scalability and end-to-end telemetry.

According to the whitepaper, the rigid packet ordering used by older technologies limits efficiency by preventing out-of-order data from being delivered straight from the network to the application. Support for modern APIs that relax the packet ordering requirements is critical to cutting "tail latencies."

Multi-pathing and packet spraying involves simultaneously sending packets along all available network paths between the source and destination to achieve the best performance.

Network congestion in AI and HPC is chiefly an issue on the link between the switch and a receiving node if multiple senders are all targeting the same node. However, current algorithms to manage congestion do not meet all the needs of a network optimized for AI, the UEC claims.

Chiefly, it appears that the UEC aims to replace the RDMA over Converged Ethernet (RoCE) protocol with a new transport layer protocol that delivers the required characteristics. This Ultra Ethernet Transport will support multipath, packet-spraying delivery, efficient rate control algorithms, and expose a simple API to AI and HPC workloads – or at least that is the intention.

HPE's involvement in the UEC is notable because it already has an HPC interconnect based on Ethernet. The Cray Slingshot technology is a "superset" of Ethernet, as described in detail by our colleagues over at The Next Platform, while keeping compatibility with standard Ethernet frames, and has featured in many of the supercomputer projects that HPE has been involved with in recent years, such as the Frontier exascale system.

HPE General Manager for High Performance Interconnects Mike Vildibill told us the company's motivation in backing UEC is driven by a desire to ensure that Slingshot operates within an open ecosystem.

"We would like for UEC-compliant NICs to experience some of the performance and scalability benefits of a Slingshot fabric," he said. ®

Development of Slingshot by HPE will continue into the future, Vildibill confirmed, but he reckons there will always be some third party NIC or SmartNIC that may have features which are not implemented on its Slingshot NIC.

“Therefore, UEC provides a mechanism to establish a robust ecosystem of third party NICs to ensure that we can support the broad range of customer requirements, while delivering some of Slingshot’s unique capabilities,” he said.

The UEC is in the early stages of development, and key technical concepts are still being identified and worked on. Dr Metz said the first ratified drafts will likely be ready by the end of 2023 or early 2024, and the first standards-based products are also expected next year. ®

 

More about

TIP US OFF

Send us news


Other stories you might like