Scaling Inference with NVIDIA’s T4: A Supermicro Solution with 320 PCIe Lanesby Ian Cutress on November 19, 2018 3:00 PM EST
When visiting the Supercomputing conference this year, there were plenty of big GPU systems on display for machine learning. A large number were geared towards the heavy duty cards for training, but also a number of inference-only systems cropped up. In heeding the NVIDIA CEO’s mantra of ‘the more you buy, the more you save’, Supermicro showed me one of their scalable inference systems based around the newly released NVIDIA Tesla T4 inference GPU.
The big server, based on two Xeon Scalable processors and 24 memory slots, has access to 20 PCIe slots, each running at PCIe 3.0 x16, for a total of 320 PCIe lanes. This is achieved using Broadcom 9797-series PLX chips, splitting each PCIe x16 root complex from each processor into five x16 links, each of which can take a T4 accelerator.
These accelerators are full-length but half-height, so even though the chassis supports full height cards, Supermicro told me that the whole system is designed to be modular, should the customer wants a different PCIe layout or a different CPU design, it could potentially be arranged.
The reason why Supermicro says that this design is scalable is that they expect to deploy it to customers with varying numbers of inference cards equipped. It was explained that some customers only want a development platform to start with, and may only request four cards to begin. Over time, as their needs change, or as new hardware is developed, new GPUs or new accelerators can be added, especially if a combination requirement develops. Each of the slots can run PCIe 3.0 x16 and up to 75W, which is the sweet spot for a lot of inference deployments.
With only one card in, it did look a little empty (!)
There looks like plenty of cooling. Those are some big Delta fans, so this thing is going to be loud.
There are technically 21 PCIe slots in this configuration. The slot in the middle is for an additional non-inference card, either a low powered FPGA for offload or a custom networking card.
Post Your CommentPlease log in or sign up to comment.
View All Comments
osteopathic1 - Monday, November 19, 2018 - linkIf I fill it with cards, will it run Crysis?
coder543 - Monday, November 19, 2018 - link> This is achieved using Broadcom 9797-series PLX chips, splitting each PCIe x16 root complex from each processor into five x16 links
320 / 5 = 64 real PCIe lanes total.
AMD Epyc processors offer 128 PCIe lanes, right? It just seems like poor planning to start a project like this and decide to choose a processor platform that only offers 64 PCIe lanes as the foundation, so... half the actual throughput that they could have had just by choosing Epyc.
phoenix_rizzen - Monday, November 19, 2018 - linkIf they split the mobo in half and used single-CPU setups, they'd have 256 full PCIe lanes to play with (128 on each half). Then just cluster the two systems together to create a single server image. :)
Or just go with a dual-CPU setup with 3-way PCIe switches (one 16x link split into three x16 links) instead of the 5-way of the Xeon setup.
Kevin G - Monday, November 19, 2018 - linkNot necessarily. There are GPU-to-GPU transactions that can be performed through the PCIe bridge chip at full 16x PCIe bandwidth.
There is also room for potential multipathing using those Avago/Broadcomm/PLX chips as they support more than one root complex. This would permit all five 9797 chips to have an 8x PCIe 3.0 link to each Xeon. Further more, the multiplathing also works to each slot: the 16 electrical lanes could be coming as two sets of 8 lanes from different 9797 chips. For GPU-to-GPU communications, they would not need to touch the root complex in the Xeon CPU's at only. Each 9797's uplink to the Xeons would be exclusively for main memory access. Those 9797 chips are very flexible in terms of how they can be deployed.
As for Epyc, it still lacks the number of PCIe lanes to do without bridge chips. At best, this would permit some high speed networking (dual 100 Gbit Ethernet) on the motherboard while still having the same arrangement of bridge chips. However, Epyc could do this with a single socket instead of two (though the two socket solutions would still leverage the bridge chips in a similar fashion).
The coming Rome version of Epyc may include more flexibility in terms of how the IO is handled due to centralization in the packaging (i.e. 96 + 96 PCIe lane dual socket configuration maybe possible). Avago/Broadcomm/PLX 9797 chips would still be necessary, but with 16 full lanes of bandwidth from each socket to six of those bridge chips.
Spunjji - Friday, November 23, 2018 - linkThat was a well-written comment and nothing you said was inaccurate, but it doesn't fundamentally change the mathematics of 128 PCIe lanes being better than 64. You'd benefit from either fewer bridge chips, greater bandwidth or some combination thereof.
HStewart - Tuesday, November 20, 2018 - linkI would think some smart person could make a system with has a Broadcom 9797-series PLX chip on extending the PCI express lanes - this technology could make the number of PCI almost meanless. One in theory could take your system of choice and add more PCI express lanes.
Some one in theory could take an external TB3 systems and support more video cards on it.
Kevin G - Tuesday, November 20, 2018 - linkPerformance will start to dwindle if you got from PCIe switch to PCIe switch. Latency will increase and host bandwidth will become an ever increasing bottleneck. Sure, you can leverage a 9797 chip to provide eleven 8x PCIe 3.0 slots for GPUs but performance is going to be chronically limited by the 4x PCIe 3.0 uplink that Thunderbolt 3 provides. Only niche were this would make sense would be mining.
rahvin - Tuesday, April 2, 2019 - linkThese chips don't magically create bandwidth, they share the bandwidth. Something like this makes sense in a compute situation (which is what this is designed for where you aren't bouncing tons of data back and forth) but it would fall on it's face in a any other situation where where high speed and high bandwidth access to each card was needed.
This is designed for compute clusters that aren't passing a lot of data to or from the PCIe cards. Don't expect to see this type of solution used elsewhere, particularly your thunderbolt example as it would provide no benefit.
abufrejoval - Tuesday, November 20, 2018 - linkTwo comments for the price of one:
1. Looks like a crypto mining setup being repurposed
2. Sure looks like this Avago's smart IP grabbing efforts are paying off big time in this design
Does anyone know, btw. if the PLX design teams are still busy at work doing great things for PCIe 4.0 or is Avago just cashing in?
Cygni - Tuesday, November 20, 2018 - linkNobody would bother with pricey PLX chips to run full 16x slots for crypto.