Sweet... But I hope the software side of things gets some love?
I don't know much about datacenter compute, but CUDA code is absolutely *everywhere* in academic and tinkerer AI stuff, and I've had no luck getting the ROCM PyTorch branch to compile, much less run existing projects...
I got bad news for you: HIP uses ROCm. So, you're still stuck with getting the ROCm stack to work. Compiling CUDA code will just add another step of using their HIP toolchain to convert it.
You've hit the problem right on the nail. AMD is super bad at the software side, No hardware in the world will sell if you can't figure out the API/software side. There is barely any serious DeepLearning framework not even an inference engine framework out there that support AMD hardware. Why? Because they have no API/framework that is anywhere as good as CuDNN/CUDA. And no engineers / manpower that actually would help those frameworks to implement and support these backend at day ONE.
If AMD does not fix this issue they will never get any traction. You can not do this on a shoestring budget and always say yeah but we prefer OpenAPIs blablablabla. Barely anybody in the DL world give a shit of open compute APIs anymore. OpenCL has pretty much failed to be of any serious competition.
AMD has maintained forks of these frameworks for years. I don't know how much of this code has been merged back into their trunks, but you incorrectly speak as if AMD hasn't been working on it. Here are two biggies. I remember seeing more, but perhaps that means they've gotten support for the other frameworks merged.
Where AMD messed up is by putting all their eggs in the OpenCL basket, and hoping that would be enough. Meanwhile, Nvidia was seeding CUDA usage throughout academia and they were fairly quick to offer the cudnn library of accelerated deep learning primitives. So, it's natural that CUDA gained some momentum, even as far back as when AMD hardware still had a compute power advantage.
Meanwhile, AMD decided it needed to overhaul its software architecture, so a lot of resources went into KFD and ROCm, which involved years of out-of-tree patches and a frustrating user experience for anyone trying to get the stack running with AMD hardware.
In spite of all of this, AMD does have a first class in-tree kernel driver (not like Nvidia, whose NVLink no longer works it new kernels for related reasons) and it *does* have a good amount of momentum on the software front. These are still advantages it holds over many of the AI hardware startups trying to enter this space.
So, it's *possible* that AMD could finally succeed in being an AI hardware player, but still not a given. AMD would need to put forth a level of effort and focus beyond what they've done 'till now, which was barely enough to catch up to where Nvidia had been, a couple years prior. I'm not betting on their success, here, but they *have* been doing considerably more than you say.
AMD's Frontier project ENA nodes appear to be modifying the 64 core Rome chiplet design, cutting it back to 4 cores per chiplet, then adding another 8 chiplets of GPUs ... with 32 CUs per GPU chiplet. Their stated requirement if for each node to be within 200W and to deliver 10TFlops FP64.
"Given the performance goal of 1 exaflop and a power budget of 20MW for a 100,000-node exascale machine, we need to architect the ENA node to provide10 teraflops of performance in a 200W power envelope."
"The EHP uses eight GPU chiplets. Our initial configuration provisions 32 CUs per chiplet. "
"The EHP also employs eight CPU chiplets (four cores each), for a total of 32 cores, with greater parallelism through optional simultaneous multi-threading."
That's all excerpts from this 2017 document ... I haven't seen any other description. Has there been an update?
"AMD's goals for CDNA are simple and straightforward: build family of big,..." Missing "a": "AMD's goals for CDNA are simple and straightforward: build a family of big,..."
"And while they have a lot of catching do to realize those ambitions,..." "up" not "do": "And while they have a lot of catching up to realize those ambitions,..."
Unfortunately AMD isn't talking about specific GPUs or SKUs at this time. But considering the target market, I don't expect to see AMD sell CDNA parts for cheap if they can avoid it. They're trying to compete with high-end Teslas, after all.
Right now, I see an XFX-branded card selling for $600. That's $100 cheaper than where it launched, a year ago, and the most memory bandwidth & 64-bit floating point horsepower you're going to find at that price point for at least a couple more years. Similar gaming performance to a RTX 2080.
For what kinds of algorithms will each CDNA design's floating point operations per second be optimized? The customers of El Capitan might prioritize different algorithms compared to the customers of Frontier. Such as different weights for dense vs. sparse matriices, or bandwidth vs. routing. Are exaflop supercomputers large enough that AMD can customize the GPU design for each supercomputer?
Maybe these contracts are the primary source of AMD's funding for ROCm development, so the algorithms for which the customers require implementations has a strong influence on what parts of the ROCm ecosystem are developed by AMD. For example, the HPE El Capitan supercomputer announcement also mentioned LLNL will work with National Cancer Institute through the ATOM consortium, which has been using tensorflow, so that might explain why most ROCm ML work has been on tensorflow ahead of winning this contract.
I think that's an interesting question and I wonder if they've figured it out yet. I wonder how big the product stack will be, and if part of the reason they're splitting compute and graphics GPUs is so they can offer products targeting half, single and double FP performance (rather than just doing what has frequently happened with consumer GPUs, and creating a base design and offering more/fewer streaming processors/CUDA cores to meet a certain performance or price target).
> Are exaflop supercomputers large enough that AMD can customize the GPU design for each supercomputer?
I seriously doubt it. Those supercomputers use thousands of GPUs, not millions.
That said, the AI market is already branching out into training and inference-specific chips. AMD could do the same, pairing training horsepower with their HPC GPUs and relegating inferencing to their consumer dies. Or, maybe they'll surprise us with 2 or 3 different CDNA dies. Though, I wouldn't be surprised if AMD wants to get some traction in the AI market, before going down the path of specialization.
NextPlatform reported: "While the Frontier system that is being installed in 2021 and put into production in 2022 is based on custom Epyc CPU and custom Radeon Instinct GPU motors, the contract with Lawrence Livermore specifies that El Capitan will be built with standard Epyc CPU and standard Radeon Instinct GPU parts, according to Forrest Norrod, general manager of the Datacenter and Embedded Systems Group at AMD." https://www.nextplatform.com/2020/03/04/lawrence-l...
What the mean by "custom" is worth questioning. Google uses "custom" Vega GPUs in Stadia, but the hardware specs make it pretty darn clear that it's basically a Vega 56, but with faster memory. As such, I think it's a custom card, maybe even down to the GPU package level, but I highly doubt anything at the die-level.
This Also means that when there is rumor/leak. It is only for datacenter or Gaming. Not for both! It is very likely that cu80 version is only datacenter. And Gamin GPUs gets different amounth of CUs.
It's pretty obvious that for awhile now AMD has enjoyed the "success they seek"...;) What they are doing now is seeking yet more success--just a slight correction. I'm wondering who it is with whom "they have a lot of catching up to do"--for Frontier and El Capitan they beat out Intel, IBM + nVidia--they don't award these things capriciously. AMD alone can do the CDNA architecture, for obvious reasons. AMD, unlike Intel, meets its execution targets--that is inarguable.
> It's pretty obvious that for awhile now AMD has enjoyed the "success they seek"...;)
You're just looking at their overall business success. You can't be talking about AI penetration, because they've utterly failed on that front. Their HPC penetration has also been quite weak, though not quite as dismal.
> for Frontier and El Capitan they beat out Intel, IBM + nVidia--they don't award these things capriciously
I'm not sure about that. DoE seems to be rotating through vendors, giving everyone a bite. Maybe they're trying to avoid putting all their eggs in one basket, out of security concerns. Maybe, they just want to help sustain more than one domestic supplier. I wouldn't assume they do it 100% on price/performance, though.
> AMD, unlike Intel, meets its execution targets--that is inarguable.
Intel's problem is manufacturing, not design. Intel meets its design targets (or, at least we have no evidence they don't). AMD doesn't do manufacturing. Solution: Intel should just use TSMC/Samsung and they can start winning again, too.
Intel still won a big US Government HPC contract (I forget which), even in spite of their current delivery track record.
Intel's design for Aurora includes pcie5/cxl that would get them to full cpu and gpu cache coherency a generation before AMD's cdna2 solution. In addition, the asymmetric coherency solution of cxl seems to be acknowledged as a benefit. Perhaps AMD will adopt it in cdna2.
Yes, Intel had delays due to 10nm fab issues, but it looks like the design people stayed busy.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
25 Comments
Back to Article
brucethemoose - Thursday, March 5, 2020 - link
Sweet... But I hope the software side of things gets some love?I don't know much about datacenter compute, but CUDA code is absolutely *everywhere* in academic and tinkerer AI stuff, and I've had no luck getting the ROCM PyTorch branch to compile, much less run existing projects...
mode_13h - Sunday, March 8, 2020 - link
I got bad news for you: HIP uses ROCm. So, you're still stuck with getting the ROCm stack to work. Compiling CUDA code will just add another step of using their HIP toolchain to convert it.https://gpuopen.com/compute-product/hip-convert-cu...
Personally, I hope they stay committed to OpenCL. I don't want to use CUDA, even on Nvidia GPUs.
One thing I like about Intel's oneAPI strategy is that it's built on OpenCL/SYCL.
JayNor - Monday, June 8, 2020 - link
Intel's oneAPI is open, so AMD could add an openCL backend if they want.Codeplay announced doing a native CUDA backend for oneAPI. The oneAPI use of SYCL doesn't have dependencies on an openCL implementation.
A quote from Codeplay:
"This project enables you to target NVIDIA GPUs using SYCL code, without having to go through the OpenCL layer in the system. "
OranjeeGeneral - Sunday, March 8, 2020 - link
You've hit the problem right on the nail. AMD is super bad at the software side, No hardware in the world will sell if you can't figure out the API/software side. There is barely any serious DeepLearning framework not even an inference engine framework out there that support AMD hardware. Why? Because they have no API/framework that is anywhere as good as CuDNN/CUDA. And no engineers / manpower that actually would help those frameworks to implement and support these backend at day ONE.If AMD does not fix this issue they will never get any traction. You can not do this on a shoestring budget and always say yeah but we prefer OpenAPIs blablablabla. Barely anybody in the DL world give a shit of open compute APIs anymore. OpenCL has pretty much failed to be of any serious competition.
mode_13h - Sunday, March 8, 2020 - link
AMD has maintained forks of these frameworks for years. I don't know how much of this code has been merged back into their trunks, but you incorrectly speak as if AMD hasn't been working on it. Here are two biggies. I remember seeing more, but perhaps that means they've gotten support for the other frameworks merged.https://github.com/ROCmSoftwarePlatform/tensorflow...
https://github.com/ROCmSoftwarePlatform/pytorch
Where AMD messed up is by putting all their eggs in the OpenCL basket, and hoping that would be enough. Meanwhile, Nvidia was seeding CUDA usage throughout academia and they were fairly quick to offer the cudnn library of accelerated deep learning primitives. So, it's natural that CUDA gained some momentum, even as far back as when AMD hardware still had a compute power advantage.
Meanwhile, AMD decided it needed to overhaul its software architecture, so a lot of resources went into KFD and ROCm, which involved years of out-of-tree patches and a frustrating user experience for anyone trying to get the stack running with AMD hardware.
In spite of all of this, AMD does have a first class in-tree kernel driver (not like Nvidia, whose NVLink no longer works it new kernels for related reasons) and it *does* have a good amount of momentum on the software front. These are still advantages it holds over many of the AI hardware startups trying to enter this space.
So, it's *possible* that AMD could finally succeed in being an AI hardware player, but still not a given. AMD would need to put forth a level of effort and focus beyond what they've done 'till now, which was barely enough to catch up to where Nvidia had been, a couple years prior. I'm not betting on their success, here, but they *have* been doing considerably more than you say.
JayNor - Monday, June 8, 2020 - link
AMD's Frontier project ENA nodes appear to be modifying the 64 core Rome chiplet design, cutting it back to 4 cores per chiplet, then adding another 8 chiplets of GPUs ... with 32 CUs per GPU chiplet. Their stated requirement if for each node to be within 200W and to deliver 10TFlops FP64."Given the performance goal of 1 exaflop and a power budget of 20MW for a 100,000-node exascale machine, we need to architect the ENA node to provide10 teraflops of performance in a 200W power envelope."
"The EHP uses eight GPU chiplets. Our initial configuration provisions 32 CUs per chiplet. "
"The EHP also employs eight CPU chiplets (four cores each), for a total of 32 cores, with greater parallelism through optional simultaneous multi-threading."
That's all excerpts from this 2017 document ... I haven't seen any other description. Has there been an update?
https://www.computermachines.org/joe/publications/...
scineram - Monday, August 17, 2020 - link
Frontier is literally Arcturus which is literally a monolithic chip.ballsystemlord - Thursday, March 5, 2020 - link
Spelling and grammar errors:"AMD's goals for CDNA are simple and straightforward: build family of big,..."
Missing "a":
"AMD's goals for CDNA are simple and straightforward: build a family of big,..."
"And while they have a lot of catching do to realize those ambitions,..."
"up" not "do":
"And while they have a lot of catching up to realize those ambitions,..."
ballsystemlord - Thursday, March 5, 2020 - link
@Ryan , will some of these CDNA GPUs be made available for consumers to purchase at consumer level pricing, or are they enterprise only?Ryan Smith - Thursday, March 5, 2020 - link
Unfortunately AMD isn't talking about specific GPUs or SKUs at this time. But considering the target market, I don't expect to see AMD sell CDNA parts for cheap if they can avoid it. They're trying to compete with high-end Teslas, after all.Threska - Friday, March 6, 2020 - link
Guess the compute crowd will still be hanging onto their Vegas then.mode_13h - Sunday, March 8, 2020 - link
Just get a Radeon VII, while you still can. Not long ago, I saw some deals on them for a bit over $500, new.mode_13h - Sunday, March 8, 2020 - link
Right now, I see an XFX-branded card selling for $600. That's $100 cheaper than where it launched, a year ago, and the most memory bandwidth & 64-bit floating point horsepower you're going to find at that price point for at least a couple more years. Similar gaming performance to a RTX 2080.Gc - Thursday, March 5, 2020 - link
For what kinds of algorithms will each CDNA design's floating point operations per second be optimized? The customers of El Capitan might prioritize different algorithms compared to the customers of Frontier. Such as different weights for dense vs. sparse matriices, or bandwidth vs. routing. Are exaflop supercomputers large enough that AMD can customize the GPU design for each supercomputer?Maybe these contracts are the primary source of AMD's funding for ROCm development, so the algorithms for which the customers require implementations has a strong influence on what parts of the ROCm ecosystem are developed by AMD. For example, the HPE El Capitan supercomputer announcement also mentioned LLNL will work with National Cancer Institute through the ATOM consortium, which has been using tensorflow, so that might explain why most ROCm ML work has been on tensorflow ahead of winning this contract.
sing_electric - Friday, March 6, 2020 - link
I think that's an interesting question and I wonder if they've figured it out yet. I wonder how big the product stack will be, and if part of the reason they're splitting compute and graphics GPUs is so they can offer products targeting half, single and double FP performance (rather than just doing what has frequently happened with consumer GPUs, and creating a base design and offering more/fewer streaming processors/CUDA cores to meet a certain performance or price target).mode_13h - Sunday, March 8, 2020 - link
> Are exaflop supercomputers large enough that AMD can customize the GPU design for each supercomputer?I seriously doubt it. Those supercomputers use thousands of GPUs, not millions.
That said, the AI market is already branching out into training and inference-specific chips. AMD could do the same, pairing training horsepower with their HPC GPUs and relegating inferencing to their consumer dies. Or, maybe they'll surprise us with 2 or 3 different CDNA dies. Though, I wouldn't be surprised if AMD wants to get some traction in the AI market, before going down the path of specialization.
Gc - Sunday, March 8, 2020 - link
NextPlatform reported: "While the Frontier system that is being installed in 2021 and put into production in 2022 is based on custom Epyc CPU and custom Radeon Instinct GPU motors, the contract with Lawrence Livermore specifies that El Capitan will be built with standard Epyc CPU and standard Radeon Instinct GPU parts, according to Forrest Norrod, general manager of the Datacenter and Embedded Systems Group at AMD." https://www.nextplatform.com/2020/03/04/lawrence-l...mode_13h - Sunday, March 8, 2020 - link
What the mean by "custom" is worth questioning. Google uses "custom" Vega GPUs in Stadia, but the hardware specs make it pretty darn clear that it's basically a Vega 56, but with faster memory. As such, I think it's a custom card, maybe even down to the GPU package level, but I highly doubt anything at the die-level.haukionkannel - Friday, March 6, 2020 - link
This Also means that when there is rumor/leak. It is only for datacenter or Gaming. Not for both!It is very likely that cu80 version is only datacenter. And Gamin GPUs gets different amounth of CUs.
scineram - Friday, March 6, 2020 - link
No, an 80 CU GPU is probably for rendering. CDNA1 has 128 CU.WaltC - Friday, March 6, 2020 - link
It's pretty obvious that for awhile now AMD has enjoyed the "success they seek"...;) What they are doing now is seeking yet more success--just a slight correction. I'm wondering who it is with whom "they have a lot of catching up to do"--for Frontier and El Capitan they beat out Intel, IBM + nVidia--they don't award these things capriciously. AMD alone can do the CDNA architecture, for obvious reasons. AMD, unlike Intel, meets its execution targets--that is inarguable.mode_13h - Sunday, March 8, 2020 - link
> It's pretty obvious that for awhile now AMD has enjoyed the "success they seek"...;)You're just looking at their overall business success. You can't be talking about AI penetration, because they've utterly failed on that front. Their HPC penetration has also been quite weak, though not quite as dismal.
> for Frontier and El Capitan they beat out Intel, IBM + nVidia--they don't award these things capriciously
I'm not sure about that. DoE seems to be rotating through vendors, giving everyone a bite. Maybe they're trying to avoid putting all their eggs in one basket, out of security concerns. Maybe, they just want to help sustain more than one domestic supplier. I wouldn't assume they do it 100% on price/performance, though.
> AMD, unlike Intel, meets its execution targets--that is inarguable.
Intel's problem is manufacturing, not design. Intel meets its design targets (or, at least we have no evidence they don't). AMD doesn't do manufacturing. Solution: Intel should just use TSMC/Samsung and they can start winning again, too.
Intel still won a big US Government HPC contract (I forget which), even in spite of their current delivery track record.
JayNor - Monday, June 8, 2020 - link
Intel's design for Aurora includes pcie5/cxl that would get them to full cpu and gpu cache coherency a generation before AMD's cdna2 solution. In addition, the asymmetric coherency solution of cxl seems to be acknowledged as a benefit. Perhaps AMD will adopt it in cdna2.Yes, Intel had delays due to 10nm fab issues, but it looks like the design people stayed busy.
ksec - Friday, March 6, 2020 - link
Literally just said this yesterday in the El Capitan comment section. Kind of nice to see this being true.ballsystemlord - Saturday, March 7, 2020 - link
Thanks, ryan!