This Document has limited distribution rights. It may not be distributed to the general public or shared beyond the employees of the licensee company. Upon the termination of an employee, whether voluntary or involuntary, access to this document must immediately cease. If the licensee company wishes to utilize this information in outward facing communications, please contact firstname.lastname@example.org.
Several weeks ago, we provided a small glimpse into the world of Composable Infrastructure (CI). Neuralytix even went to the extent of suggesting it might be Datacenter 4.0. We also suggested that CI will be built around a PCIe fabric, but we did not go into the details.
PCIe has broad acceptance, compatibility, and it scales easily. By using multiple lanes, 8G (x1), 32G (x4), 64G (x8), or 128G (x16), it is easy to reach very high performance. PCIe also has the benefit of having very low latency (~150ns per hop). The combination of which makes it an ideal technology for scaling out hundreds of nodes (either compute, storage or network nodes). In many CPUs, PCIe is already directly embedded, making it easier to integrate, and very low cost.
One of the initial challenges for PCIe is its lack of high-availability features. After all, it was designed initially as a point-to-point connection. However, the PCI-SIG standards committee has come up with Downstream Port Containment to isolate endpoint errors. The committee also added other features to provide SAS-like robustness (such as multi-pathing, failover and zoning). According to Avago, which manufactures PCIe switches, these new features provide up to six 9’s of availability.
Avago’s PEX9700 series switches include their ExpressFabric architecture, that provides a transparent bridging function to connect two topologies together. This system can use DMA to transfer data between the clusters, and it also provides failover functionality. The PEX9700 platform interconnects up to 24 (+1 management host) nodes with a single chip. With a Tunneled Windows Connection, users can cascade multiple chips to connect up to 72 nodes. Each port features a NIC-DMA engine to streamline data transfers between multiple hosts’ memory, and flexible port configuration allows up to 23 endpoints to be combined with ports up to 16 lanes wide (nearly 16 GB/s of throughput per link).
In 2014, according to Larry Chisvin, vice president of strategic initiatives at PLX (now part of Avago), “It is immediately obvious that there are a lot of extra parts in a system and a rack of systems. And the thing to note is that all of these devices start off as PCI-Express. Since almost every component – CPUs, storage devices like hard disks or flash, communications devices, FPGAs, ASICs, whatever – all have PCI-Express coming out of them, and so there is really no need to translate it to something else and then translate it back. If you could keep it in PCI-Express, then you eliminate a lot of the fairly expensive and power-hungry bridging devices. A 10 Gb/sec Ethernet card is anywhere from $300 to $500 and it dissipates 5 to 10 watts, and if you multiply that out for a rack of servers, the math gets pretty easy.”
PCIe cables are compatible with Ethernet and Infiniband – using QSFP+ or MiniSAS-HD cables. Essentially, any cable that can support the 8 Gb/sec transfer speed of the PCI-Express 3.0 protocol can be used, again driving down the cost of implementation.
The biggest challenge comes in the software side. While Fibre Channel and Ethernet have well established and mature drivers, PCIe fabric drivers are fairly new and immature.
That said, Neuralytix believes that there will be a great deal of focus on the use of PCIe as a fabric over the next three to five years, particularly as CI takes off.