Friday, January 10, 2025

Understanding Ethernet and Infiniband on OpenShift

I recently was involved in a conversation around using only infiniband on an OpenShift cluster installation.  That is the customer wanted to only have infiniband connectivity for both the cluster apis and the high speed storage access requirements for the application.   This interaction made me realize we probably need a refresher on the difference between infiniband and ethernet because they are not the same nor can they be swapped interchangeably.

The difference between infiniband and ethernet is very disparate from a design point of view.  Infiniband was designed with the idea of having a high reliability, high bandwidth and low latency to interconnect nodes in a supercomputer cluster.  Whereas ethernet was designed with the idea of how can I move data between multiple systems easily.  This difference becomes more apparent in how each technology is designed to move data.

The design differences show up for example in how latency is handled between the two types of interconnects.  For ethernet interconnects typically a store-and-forward along with MAC address network transport model is used for communication between hosts.  This method increases the process flow because it has to take into account complex services like IP, MPLS and 802.1Q.   Whereas with infiniband layer 2 processing uses a 16 bit LID(local ID) number which is the only one that can be used to search for the forwarding path information.  Further the switching technology in infiniband uses a cut-through approach which reduces the forwarding delay making it significantly faster than ethernet.

Another difference also shows up in network reliability.  The infiniband protocol is a complete network protocol with its own defined layers from layer 1 to layer 4.  This end to end flow control provides the basis for infiniband's network packeting sending and receiving which can provide a lossless network.  Ethernet on the other hand does not have a scheduling based flow control mechanism which results in the lack of a guarantee whether a node on the other end will end up being congested upon packet receipt.  This is why ethernet switches are built with a cache to absorb these sudden bursts of traffic.

Networking mode or methods is another distinction between these two technologies.  An software defined network is built into infiniband by design.  There is a subnet manager present on each layer 2 infiniband network to configure the LID of the nodes.  The subnet manager also calculates the forwarding path through the control plane and issues it to the infiniband exchange.  Conversely ethernet uses a networking mode that generates MAC addresses and the IP protocol must cooperate with the ARP protocol.  Nodes in the ethernet network are required to send packets on a regular basis to guarantee entries, in a ARP table for example, are updated in real time.  All of this leads to more overhead in a ethernet network compared to infiniband.

We can see from the above there are significant differences between the two technologies which makes it impossible to swap them out like for like as in the case of our OpenShift installation request from the customer.  For example take the OpenShift installation which will leverage OVN/OVS for networking.  During the installation there is an expectation that the MAC address will exist on the interface marked for the cluster api.   However in an infiniband network there is no MAC address concept.  Once might see an error similar to the below:

Error: failed to modify 802-3-ethernet.cloned-mac-address: '00:00:01:49:fe:80:00:00:00:00:00:00:00:11:22:33:01:32:02:00' is not a valid Ethernet MAC.

Further, drivers also become an issue for networking devices like a Mellonox CX-7 or BlueField-3.  This is because the default mlx upstream drivers that ship with Red Hat CoreOS in OpenShift do not contain the RDMA component which is required for infiniband.  To obtain the the RDMA component one needs to leverage the NVIDIA DOCA driver which is part of the NVIDIA network operator.  However this operator cannot be leveraged in an OpenShift day 0 installation.  Even if it could though again the expectation of OVS/OVN networking is to have a MAC address to work with from an ethernet network.

Given all these differences we had to explore how could we meet the customers needs but still apply the correct technology to the systems.   If the customers goal was to ensure a high speed interconnect between the nodes in the cluster we can still do this with OpenShift.  However we need to approach it differently and also break out the cluster apis so they are still working with an ethernet network.   A suitable approach might look like this example below.

In the diagram we have a six node OpenShift cluster each with two single port Mellanox CX-7 cards.  For each node we have one card plugged into an ethernet switch and another plugged into an infiniband switch.   With this design we can now install OpenShift using the one CX-7 card operating in ethernet mode.   Once OpenShift is installed we can then layer on the NVIDIA network operator to provide the RDMA infiniband driver and leverage the second CX-7 card operating in infiniband mode.   This design enables us to not only get OpenShift installed but still provide a secondary network to our workloads with access to the high speed infiniband network.   This same design would also work if we had just one dual ported CX-7 card as we can use the Mellanox tools to configure one port for ethernet and one for infiniband.

Hopefully this blog provided some insight into the difference between infiniband and ethernet and why one simply cannot swap out ethernet for infiniband on an OpenShift installation.