Quick suggestion: considering Adreno is one of the most widespread GPU architectures for mobile, could you edit the table in the last page to include Adreno 3xx/4xx GPUs?
Unfortunately Qualcomm refuses to disclose much detail about their GPU architectures. I completely disagree with their position and have worked on Qualcomm for years to get them to open up but at this point it's a meaningless effort.
Their loss. Plus, neither Adreno 420 nor their upcoming CPU's look that interesting or competitive anyway. Adreno 420 should still give only about HALF the performance of Tegra K1's GPU.
Really? That's disappointing... I'm really looking forward to a time when ARM, nVidia, and AMD all compete on an architectural level in their GPUs, if it ever comes. The one-horse race with Intel in the desktop CPU space has been pretty lackluster for the past few years, in terms of performance increases. nVidia's Maxwell architecture seems pretty amazing in terms of efficiency, and I'm not yet convinced AMD will be able to compete. They have yet to impress with their APU and mobile processor efficiencies.
Important to note that NVidia's TK1 will be achieving that double GFLOPS performance of the Adreno 420 at a clock speed of around 950Mhz. At that performance level the TDP is listed at <10W, so it's not exactly comparable to the S805 and Adreno 420 which target a TDP half as high. What I can see happening is the TK1 being able to stretch it's legs and thus being superior in large tablets but being too thermally crippled in phones and small tablets to reach those levels. Based on the previews I found, Adreno is more efficient in it's shader resource usage, closing that further.
That's actually pretty bad math there -- if the TK1 achieves double perf at double power, it should achieve the same perf at 1/4 power (well, not quite since it isn't as simple as the basic E&M I learned, but yeah).
And by your logic still, why would the K1 not fit in phones and tablets even as the 420 manages?
Power is linear to f, but squares with V. I don't know that we can say that at half the f you can halve the V. Actually, that's almost certainly not the case, as it's not the case with any common processor tmk.
The adreno 420 provides around 220gflops. The 430 well then be over 300gflops. These are not counting changes in clock speed that could raise our lower performance.
But it still loses to K1 in most gpu stuff (all?). You're forgetting AMD/NV have had 20yrs of trying to figure out how to get the most they can from bandwidth for gaming. The devs have had that long working with their hardware also (game devs I mean). Everyone else has to play catch up here for years as they've never had to do anything game wise until last year or so as android etc gaming pumped up a bit.
That is why you see ZERO Qcom optimized games (or did I miss one?) :) It's easier to optimize for a chip you already know inside out (amd/nv). I even went to Qcom's gaming page just to see if there were any games they had on their list that were REQUIRING snapdragon to see xx effects etc. There were none last I checked. All the games are just on googleplay with no snapdragon mention (like on NV games they say THD, and these games look quite a bit better than the regular versions) as they appear to work on ALL players chips. Google seems to be realizing K1 is where you want to be on gpu's at least for gaming centric stuff/automotive and I'd expect devs to continue to favor NV for optimizations as they don't need to learn a thing about k1 it's KEPLER which they've already spent 2yrs+ playing with (probably longer as they get dev versions long before we get a retail card so games can be made/optimized for them by the time they hit).
At 20nm xbox360/ps3 will be left behind as new games keep getting made on mobile. If you're not on xbox1/ps4 you'll be buying some cheap 20nm console box that has cheap games ($2-20 vs. $60 for xbox1/ps4) and as good or better graphics than last gen xbox360/ps3. GDC 2013 & 2014 surveys show devs are already massively making games for mobile and as 20nm kicks in everyone has K1 power levels or more. These android consoles/tv's etc will have more tricks than those ancient consoles so you should be able to get much better gaming experience on them for $100-200. The games pricing alone is a draw for poor people. With the ports happening right an left now of quality PC/console games and super cheap pricing there is even more reason to run to mobile for poor people who never played them before (half-life2, trine2, Serious Sam3BFE, none sold more than 11mil or so). There are a billion android users and most clearly have played none of this stuff even the console ports like Final Fantasy games, GTA games etc (on or off PC also doesn't matter) haven't been played by more than 10mil or so combined each. Lost of great stuff for poor people to pick up for under $10 in ports until the REAL new games for mobile hit this xmas/next xmas. All of the stuff the dev surveys show they've been working on will hit this year or next, and they are not angry birds games.
If they're worried about lawsuits (odd they'd say that without merit), they must have had their lawyers tell them they'd be sued due to stealing tech that is probably from AMD/NV. DMCA takedowns, completely closing the kimono so to speak shows they are afraid for good reason. It isn't just competitor crap as nobody else is afraid of that it seems. The same tricks are being used by almost everyone to a large degree. So it seems to me they clearly owe someone some money and don't want to pay. They will probably show their details once they remove that stuff from a future gen soc or never I guess if they just can't remove it for some reason :)
You could add Intel HD graphics (baytrail) though. Also quite interesting architecture-wise imho. btw some small correction wavefront size for amd (gcn) is 64, not 16 (I think this was wrong on older anandtech articles too). The simd size is 16 indeed but the same instruction is executed for 4 clocks always (on 16 different elements of the wavefront each clock).
My first touch phone is black LG Viewty in 2008 with Mali GPU :) :) :) If you have new article for android extention pack comparison with full Open GL will be very intrasting.
i'd like to know: 1. how long until the AEP is rolled back into what will be OpenGL ES 4.0 1.1. whether it represents a subset of an existing OpenGL full-fat version (eg 4.4) 2. how this compares to DX 11.2 feature wise 2.1. whether AEP will be expanded in OpenGL ES 4.0 to make it broadly DX 11.2 compliant
1) Never. (Though, separate extensions, can get into ES. AEP is just thin bundle over many other extensions) 1.1) Yes. OpenGL 4.x is still capable of running AEP code. 2) DX11.2 is single vendor en-devour currently... (And You really should compare to F(eature)L(evel)11_2). 2.1) WHY?
Why on earth You need all those things?
Industry move in different direction. (Mantle, DX12, Metal, AZDO)
Doing stuff efficiently is new mantra now.
Adding more stuff from DX FL11_2 (Yes if You talk about features You MUST use F(eature)L(evels)!!!), would only complicate things for OpenGL ES.
I can see Qualcomm's concerns about a shader arms race in mobile: it has already happened on the CPU side without much benefit to the consumer. However, with the explosion in screen resolution in tablets, a spec race here would have a more tangible benefit for consumers. It sitll boggles my mind that a retina iPad has 50% more pixels and a slower GPU than my desktop system with a 1080p monitor driven by a GTX 770. My sole concern would be temperatures and power consumption.
Well if Qualcomm isn't going to disclose the information, how much can be implied from driver information? Qualcomm purchased the mobile Radeon drivision from AMD back in 2009 and then came up with the anagram Adreno. If they're still using a design based upon what they got form AMD, it'd be reflective in similar drivers. If they've come up with a new architecture, it too would be evident in radically different drivers. The details would be lacking of course but some generalities could be made.
There likely is a cross licensing agreement in place. Literally to build any modern GPU a company has to cross license patents. It works out to be zero sum in that money isn't exchanged but one company can use the other's patents (and future patents) royalty free.
This also puts a rather high barrier to entry in the market.
Very interesting. Complete 100% opposite of the AMD architecture.
Also, the fact that they can power down shader cores and even individual ALUs makes this pure ILP and zero TLP architecture really shine. Because the compiler only needs to optimize work on an ALU level.... Not on a shader or wavefront level. So then based upon demand they just feed more or less of these ALU-optomized-packets of work into the GPU and power down the rest of the ALUs. It makes complete sense. Make each ALU independent, compile work on a per ALU basis, then scale your ALU utilization at run time... And also scale your chip portfolio on ALU count. Perfect scalability in price, performance and power combined with straightforward driver work.
Transaction Elimination might work for "render to texture", where the texture is then used later in the scene, but it doesn't make sense to me for the final render target that is shown on the screen.. Typically your final render target is part of a swap chain with 2 or 3 different buffers. So the CRC computed for a tile in Frame N would need to match the CRC for frame N-2 (or maybe N-3), not N-1.
Relying on a 32bit hash for transaction elimination is a very bad idea. Assuming a normal distribution of hash results, the Birthday Paradox means you've got a better than 50% chance of collision if you have more than ~5000 items being hashed. The compartmentalized comparisons (it looks like they only compare against the same tile) means the collision rate will be lower, but there will be collisions, and in some cases they will look terrible.
If there's a glitch, there's an excellent chance it will persist for a while. For example, let's say I'm bringing up a "pause" menu. If the title overlay of the pause menu happens to hash to the same value as the background that it replaced, that one tile of the pause menu title won't appear. Next frame, it still won't appear, because again the hash value hasn't changed. Until something happens to actually change the title or make it disappear, the glitch will persist.
The "fix" will be to do GUI overlays in translucency and keep the background animating somehow to prevent cache misses from sticking around, but it's an ugly hardware hack that will force software workarounds.
If the pause screen fades in, wouldn't that change the image data enough that it would generate entirely different CRCs for each frame until the animation ended?
At any rate, my rule of thumb would be that if it doesn't happen more often than a keyframe is dropped in a video, it'll be fine.
I'd like you guys to do more 30x GFX loops, that how whether these GPU's only prop up high numbers at first, but then quickly get throttled. From what I noticed Imagination is the only one that doesn't throttle in that test. But I'm curious to see if K1 and T760 do.
Once again, AnandTech has published a great article! Thanks ARM and AnandTech.
Just one point. there is a small mistake in the article about Samsung Exynos 3470 GPU. It's not Mali 450MP but the undead Mali 400MP GPU. Although it would be nice to have the latest one.
On transaction elimination. A movie is actually much worse about being eliminated than anything else. The only saving grace for a movie is the fact that the FPS are often much lower than what the device is natively putting out (so 60fps is a typical display refresh rate whereas movies typically operate at 24->30fps). After that, everything changes right down to the smallest detail. This is the grainy effect that you see in movies.
For games, there could be some benefit assuming the game isn't a high action one. The biggest win will be still images (90% of what these displays are going to be displaying).
Movies are not actually that bad. Remember that videos we watch on our devices have already been compressed with lossy algorithms looking for temporal resemblance, which seems to boost the transaction elimination efficiency as well.
BTW "the grainy effect that you see in movies" have absolutely nothing to do with frame rate
its put there (as in artificially) by the post processing due to the fact today everyone's using 8bit per pixel as in Rec. 709 (HDTV) color space that produces banding and other visible anomalies not the new official Rec. 2020 (UHDTV/UHD-1/UHD-2) real 10bit/12bit color space we will see soon.
Consider asking red hat's rob clark. He's been reverse engineering the adreno arch (his driver, freedreno (https://github.com/freedreno/freedreno/wiki) however, is not a reverse engineered adreno driver) for a few years now and can almost certainly give you at least that much info. His blog is at http://bloggingthemonkey.blogspot.com, and he's a super nice guy.
I'm not sure why this is addressed to me. Although I expect AT will ignore what I've written so as not to upset their corporate friends, what I suggested is what they should do if they are really interested in the tech. What's strange to me is that they did something similar with their analysis of Cyclone to what I'm suggesting they do, except in the Qualcomm case the work is done by someone else.
Awesome to see this here! I hope the Adreno team will follow suite soon and lay their doubts to rest.
"LG’s Viewty" Holy shit, that way my 2nd ever phone (after my first flip phone got broken when I rammed a car with my bike). That thing was pretty bad all in all. But the slow motion camera was great for its time! :D It broke too while I was in a fight, but that was the last one. Touch Pro 2, Galaxy S2, Galaxy Nexus and LG G2 all working fine till this day. :D
"From a sales perspective this means ARM can offer the CPU and GPU designs together in a bundle, but perhaps more importantly it means they have the capability design the two in concert with each other, being in the position of the sole creator of the ARM ISA."
lol, the bundle aspect is far more important for ARM gpu sales. ;)
Truth. Basically makes it so a competitor needs to show a significant performance, power, feature, or cost difference before it's worth an integrator investing in breaking apart the bundle
Given the "exotic" the ILP and the 128-bit VLIW SIMD, the Mali looks like an impressive performer. If a Mali T760MP10 is indeed a part fit for smartphone-level power consumption, then a quick linear-scaling -- given GFXBench scores of a T760MP4 -- would imply that such a GPU is very competitive with something like the adreno 420 and certainly impressive if scaled up further to a tablet-level power consumption. If, however, an MP10 consumes roughly as much as a K1 or GX6650, then I'm very sceptical about its competitive performance.
I find the distribution of ALUs to memory units strange given mobile bandwidth limits. The Mali T760 in an 8-core configuration clocked at 600MHz will allow for 19.2 GB/s of load/store access (256-bits/clock: source ARM). This is quite high memory bandwidth, and an increase in GPU clocks or cores will likely yield idling load/stores units doing nothing but taking up valuable die area. Operations with high varying access, cache reads, and tile read/writes will of course make good use of these additional units, but it still seems like overkill on all but very memory-access heavy apps. ARM would know best, though I'm suspicious that so many load/store units are needed for common workloads. I would guess that in common scenarios, bandwidth to main memory would be exhausted long before all of the memory units were fully utilized on a Mali T760 of high core count (eg. MP12-16).
Decoupling the load/stores and texture units as their own "core" may allow more appropriate scaling to fit the bandwidth of the target system. A system with an ultra-high resolution, could be endowed with more load/stores and/or texture units. A system with a lower resolution could use less and opt for more ALUs in the same space. This would be similar to big.LITTLE (different cores for different targeted workloads). In this scenario, the memory unit cores could be scaled independently of the ALU cores, perfectly tailored to the target system.
Remember that there is a cache system between those units and the SoC memory controller. Bandwidth to the cores is quite different from bandwidth to the DDR memory.
Indeed! Bandwidth to cache, tile memory, varying data, or textures would likely be more relevant with a tile-based renderer that often exploits spatial locality when processing batches of pixels. This is especially true with modern screen-space effects that do multiple dependent reads per display pixel (eg. SSAO), but are strongly confined to buffer fragments surrounding the target pixel. Such situations would value having many LS/Tex units at little penalty.
But I do still like the idea of an independently scalable "memory" core (containing load/store/texure pipes) to complement a "math" core (containing ALUs). A high-performance system targeting a 720p display will likely consume far less bandwidth than one targeting a 4K display, and as such, it would be nice to trade LS/Tex units for more ALUs in such a case.
Such an arrangement may also enable ARM more leeway when making predictions about a new architecture -- no doubt the Midgard arch was in development many years before it saw implementation in a retail product, which means that ARM would have had to guess trends (eg. resolution) far in advance to attain the right balance of on-chip units per core; independently scalable memory-cores would be more forgiving if the trends turned out not match the initial predictions.
Actually, I am starting to understand the motivation behind the ratio of ALUs to Memory units in a Midgard core. I notice in GFXBench 3.0 "Manhattan", that the Mali T760MP4 (Rockchip rk3288) performs incredibly well at 720p, but it's performance drops off more than proportionately as the resolution increases. This may imply that in these higher than 720p scenarios, the 4-core variant of GPU may not be able to keep up with the memory demands, as computation should scale very close to proportionally.
Thus the 1080p offscreen score for the Mail T760 MP4 in the GFXBench 3.0 (offline) database may be misleading as the MP4 may be a bit small for this resolution, and thus the performance may be low relative to its competition. A T760 MP8 would likely more than double the performance for a doubling of the resolution, pushing something like the Mali T760 MP8 well beyond the competition, at what I suspect are similar levels of power and die-size. I predict that a T760 MP8 would get slightly north of 16fps in GFXBench 3.0 (assuming adequate bandwidth to DDR). Even an MP6 variant of the GPU (as was the case with the T628 MP6 in the Exynos 5420) should put it more-or-less on-par with the competition!
The Mali performs even better in the GFXBench 2.7 "T-Rex" test, where a small 4-core Mali T760 MP4 surpasses the competition at 720p and even sub-720p resolutions in some instances! This is incredible. In this case, it seems that the test is more computation bound, as there is a more proportional scaling between performance and resolution.
I hope that future GPUs consider using the T760 in higher-core-count configurations. I still like the idea of a Memory-core, though I have little doubt that a Mali GPU of evenly matched size can go toe-to-toe with the competition.
Every time I read an overview of a SoC GPU I am so, so glad I don't do anything with mobile stuff. "We support tessellation! I mean, don't actually do it. Ever. But you know, it's supported."
well.. I don't really agree. Please look at the AT review from last week or so.. http://www.anandtech.com/show/8197/samsung-galaxy-... MEMO pad is a baytrail product and definitely performs better than MALI devices.
Firstly, the GPU in Memo Pad is definitely not definitely performing better than even the Mali T628, in fact those very tests show it trades blows with it, mostly due to much lower resolution screen.
Secondly, do you realize that the T760 is MUCH faster than T628?
You can see here that it is basically comparable to the Tegra K1 and even the intimidating Series 6XT doesn't trounce it.
Rest assured that any of these three, as well as the Adreno 420, is way above the ULP HD graphics chip
I hope you do understand how to read benchmark results. 3dmark and GFXbench(offscreen) results are resolution independent. Now go and check the results in the article. As per T760, I will not comment on theoretical GFLOP numbers unless there is real product. Even the theoretical MAD GFLOPs are not so great (roughly half) compared to others. I don't think anyone going to fall for marketing gimmick of taking dot product also as extra 40 GFLOPs
You said it definitely performs better yet it loses on the T Rex HD onscreen and in Basemark X overall. That's not definite you know. Regardless, even if a case can be made that it performs slightly better overall than the Mali T628, it is without doubt outperformed by the : ULP Gforce 3 Adreno 330 Sgx G6320
.. and is *definitely* far outclassed by the Adreno 420, Kepler K1, SGX G6550, Mali T760MP6-10.
Until Intel shows their next generation of ULP graphics, I don't see a point in comparing the current one
I wish you would have talked more about the GPU in the Nexus 10 since that is a shipping product. It would be nice to know how it differs from the newer midgard designs.
Another interesting point to make about the Mali architecture (that goes unnoticed, but is significant) is that the anti-aliasing is fully pipelined, tiled (read zero bandwidth penalty for the op), and very fast. MSAA 4x costs 1 cycle, MSAA 8x costs 2 cycles, and MSAA 16x costs 4 cycles. This means that if you have a scene full of fragment shaders running for more than 4 cycles (which is not too complex these days) you get the benefit of ultra-high quality MSAA 16x for FREE.
There aren't too many examples of MSAA 16x online, but even at MSAA 8x performs very well, with sharp, non-blurry results and is often compared against. MSAA would produce very crisp edges devoid of aliasing and crawling during animation.
Of course, MSAA isn't perfect -- it isn't terribly helpful for deferred renderers -- but it certainly doesn't hurt them when its costs are nothing, even if you are planning to do a screen-space pass in post.
Well, if they won't cough up the information, then there's always freedreno... Rob Clark has reverse engineered basically everything you would want to know about the Adreno architecture, up to even more detail than this article. All that remains is to fill in the pieces based on the documentation he's written...
Yup, that is indeed what I said earlier. I don't think anand is interested in investigating that avenue, however. For one, he might fear that such published info would blacklist him from future qualcomm info dumps (pretty far fetched, imho, but AT likes the corporate relationships its been able to cultivate). For another, AT doesn't seem to be terribly interested in oss.
Do the GFLOPS give a measure of how much performance an OpenCL kernel can provide in ARM Mali architecture ? Simple fp16/32 kernels doing a MAD and writing to global memory do not seem to match the GFLOPS calculated here. Are there other HW LD/ST limits in the pipeline (assuming no system memory Bandwidth limitations) ?
So how do the dFdx/dFdy operators work for such a non-wavefront based design? Does using them imply a huge overhead/stall/context-switch of some sort on this kind of architecture?
Thanks ARM, Ryan and Anandtech for this very interesting article! This finally gives this "yet another SoC GPU irrelevant to me" a face. And I think ARM could employ this design for some very nice GP-GPU (co-)processors:
- FP64 performance rivaling the best - not relying on TLP could open up the GPU for entire new application ranges where they just didn't make sense yet - ARM can scale the amount of compute per core easily
I know building a massive GP-GPU chip is anything but trivial.. but this seems an architecture worthy of this!
Hi Anand, Nice article!.I am working with mali GPU, and I am a researcher trying to figure out power management capabilities of MALI GPU. last part of this article seems to be covering a bit of that. can you please point me to the source of ARM's slide showing gating techniques of MALI GPU(last figure in article). Thanks in advance, Manoj
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
66 Comments
Back to Article
LemmingOverlord - Thursday, July 3, 2014 - link
Quick suggestion: considering Adreno is one of the most widespread GPU architectures for mobile, could you edit the table in the last page to include Adreno 3xx/4xx GPUs?Thanks!
Anand Lal Shimpi - Tuesday, July 8, 2014 - link
Unfortunately Qualcomm refuses to disclose much detail about their GPU architectures. I completely disagree with their position and have worked on Qualcomm for years to get them to open up but at this point it's a meaningless effort.da_asmodai - Thursday, July 3, 2014 - link
How about adding the Qualcomms Adreno 420 to the comparison.Anand Lal Shimpi - Thursday, July 3, 2014 - link
I wish we could - Qualcomm refuses to disclose any deeper architectural details about any modern Adreno GPU architectures.Krysto - Thursday, July 3, 2014 - link
Their loss. Plus, neither Adreno 420 nor their upcoming CPU's look that interesting or competitive anyway. Adreno 420 should still give only about HALF the performance of Tegra K1's GPU.ChefJeff789 - Thursday, July 3, 2014 - link
Really? That's disappointing... I'm really looking forward to a time when ARM, nVidia, and AMD all compete on an architectural level in their GPUs, if it ever comes. The one-horse race with Intel in the desktop CPU space has been pretty lackluster for the past few years, in terms of performance increases. nVidia's Maxwell architecture seems pretty amazing in terms of efficiency, and I'm not yet convinced AMD will be able to compete. They have yet to impress with their APU and mobile processor efficiencies.frostyfiredude - Thursday, July 3, 2014 - link
Important to note that NVidia's TK1 will be achieving that double GFLOPS performance of the Adreno 420 at a clock speed of around 950Mhz. At that performance level the TDP is listed at <10W, so it's not exactly comparable to the S805 and Adreno 420 which target a TDP half as high. What I can see happening is the TK1 being able to stretch it's legs and thus being superior in large tablets but being too thermally crippled in phones and small tablets to reach those levels. Based on the previews I found, Adreno is more efficient in it's shader resource usage, closing that further.lmcd - Thursday, July 3, 2014 - link
That's actually pretty bad math there -- if the TK1 achieves double perf at double power, it should achieve the same perf at 1/4 power (well, not quite since it isn't as simple as the basic E&M I learned, but yeah).And by your logic still, why would the K1 not fit in phones and tablets even as the 420 manages?
tuxRoller - Friday, July 4, 2014 - link
Power is linear to f, but squares with V. I don't know that we can say that at half the f you can halve the V. Actually, that's almost certainly not the case, as it's not the case with any common processor tmk.tuxRoller - Thursday, July 3, 2014 - link
The adreno 420 provides around 220gflops. The 430 well then be over 300gflops. These are not counting changes in clock speed that could raise our lower performance.darkich - Friday, July 4, 2014 - link
You guys are missing the fact that Snapdragon 805 can reach a much higher memory bandwidth than Tegra K1.TheJian - Saturday, July 5, 2014 - link
But it still loses to K1 in most gpu stuff (all?). You're forgetting AMD/NV have had 20yrs of trying to figure out how to get the most they can from bandwidth for gaming. The devs have had that long working with their hardware also (game devs I mean). Everyone else has to play catch up here for years as they've never had to do anything game wise until last year or so as android etc gaming pumped up a bit.That is why you see ZERO Qcom optimized games (or did I miss one?) :) It's easier to optimize for a chip you already know inside out (amd/nv). I even went to Qcom's gaming page just to see if there were any games they had on their list that were REQUIRING snapdragon to see xx effects etc. There were none last I checked. All the games are just on googleplay with no snapdragon mention (like on NV games they say THD, and these games look quite a bit better than the regular versions) as they appear to work on ALL players chips. Google seems to be realizing K1 is where you want to be on gpu's at least for gaming centric stuff/automotive and I'd expect devs to continue to favor NV for optimizations as they don't need to learn a thing about k1 it's KEPLER which they've already spent 2yrs+ playing with (probably longer as they get dev versions long before we get a retail card so games can be made/optimized for them by the time they hit).
At 20nm xbox360/ps3 will be left behind as new games keep getting made on mobile. If you're not on xbox1/ps4 you'll be buying some cheap 20nm console box that has cheap games ($2-20 vs. $60 for xbox1/ps4) and as good or better graphics than last gen xbox360/ps3. GDC 2013 & 2014 surveys show devs are already massively making games for mobile and as 20nm kicks in everyone has K1 power levels or more. These android consoles/tv's etc will have more tricks than those ancient consoles so you should be able to get much better gaming experience on them for $100-200. The games pricing alone is a draw for poor people. With the ports happening right an left now of quality PC/console games and super cheap pricing there is even more reason to run to mobile for poor people who never played them before (half-life2, trine2, Serious Sam3BFE, none sold more than 11mil or so). There are a billion android users and most clearly have played none of this stuff even the console ports like Final Fantasy games, GTA games etc (on or off PC also doesn't matter) haven't been played by more than 10mil or so combined each. Lost of great stuff for poor people to pick up for under $10 in ports until the REAL new games for mobile hit this xmas/next xmas. All of the stuff the dev surveys show they've been working on will hit this year or next, and they are not angry birds games.
przemo_li - Tuesday, July 8, 2014 - link
Alternative view on Google stance:Nvidia is just first vendor that allowed them to show more features than are possible on Apples A7.
(Mobile-only vendors are not interested in full OpenGL...)
TheJian - Sunday, July 6, 2014 - link
If they're worried about lawsuits (odd they'd say that without merit), they must have had their lawyers tell them they'd be sued due to stealing tech that is probably from AMD/NV. DMCA takedowns, completely closing the kimono so to speak shows they are afraid for good reason. It isn't just competitor crap as nobody else is afraid of that it seems. The same tricks are being used by almost everyone to a large degree. So it seems to me they clearly owe someone some money and don't want to pay. They will probably show their details once they remove that stuff from a future gen soc or never I guess if they just can't remove it for some reason :)mczak - Thursday, July 3, 2014 - link
You could add Intel HD graphics (baytrail) though. Also quite interesting architecture-wise imho.btw some small correction wavefront size for amd (gcn) is 64, not 16 (I think this was wrong on older anandtech articles too). The simd size is 16 indeed but the same instruction is executed for 4 clocks always (on 16 different elements of the wavefront each clock).
mczak - Thursday, July 3, 2014 - link
Here's actually an explanation how the wavefront size of 64 works for gcn:http://devgurus.amd.com/thread/168154
Achtung_BG - Thursday, July 3, 2014 - link
My first touch phone is black LG Viewty in 2008 with Mali GPU :) :) :) If you have new article for android extention pack comparison with full Open GL will be very intrasting.Jedibeeftrix - Thursday, July 3, 2014 - link
yes please.i'd like to know:
1. how long until the AEP is rolled back into what will be OpenGL ES 4.0
1.1. whether it represents a subset of an existing OpenGL full-fat version (eg 4.4)
2. how this compares to DX 11.2 feature wise
2.1. whether AEP will be expanded in OpenGL ES 4.0 to make it broadly DX 11.2 compliant
przemo_li - Tuesday, July 8, 2014 - link
1) Never. (Though, separate extensions, can get into ES. AEP is just thin bundle over many other extensions)1.1) Yes. OpenGL 4.x is still capable of running AEP code.
2) DX11.2 is single vendor en-devour currently... (And You really should compare to F(eature)L(evel)11_2).
2.1) WHY?
Why on earth You need all those things?
Industry move in different direction. (Mantle, DX12, Metal, AZDO)
Doing stuff efficiently is new mantra now.
Adding more stuff from DX FL11_2 (Yes if You talk about features You MUST use F(eature)L(evels)!!!), would only complicate things for OpenGL ES.
We need AZDO.
Kevin G - Thursday, July 3, 2014 - link
I can see Qualcomm's concerns about a shader arms race in mobile: it has already happened on the CPU side without much benefit to the consumer. However, with the explosion in screen resolution in tablets, a spec race here would have a more tangible benefit for consumers. It sitll boggles my mind that a retina iPad has 50% more pixels and a slower GPU than my desktop system with a 1080p monitor driven by a GTX 770. My sole concern would be temperatures and power consumption.Well if Qualcomm isn't going to disclose the information, how much can be implied from driver information? Qualcomm purchased the mobile Radeon drivision from AMD back in 2009 and then came up with the anagram Adreno. If they're still using a design based upon what they got form AMD, it'd be reflective in similar drivers. If they've come up with a new architecture, it too would be evident in radically different drivers. The details would be lacking of course but some generalities could be made.
toyotabedzrock - Friday, July 4, 2014 - link
Didn't imagine also buy from ATI? Maybe that is why they are concerned with patents?Kevin G - Saturday, July 5, 2014 - link
There likely is a cross licensing agreement in place. Literally to build any modern GPU a company has to cross license patents. It works out to be zero sum in that money isn't exchanged but one company can use the other's patents (and future patents) royalty free.This also puts a rather high barrier to entry in the market.
nolaviz - Thursday, July 3, 2014 - link
Ah, the history... I led the integration of MALI55 into Zoran's APPROACH-5C. Sweet memories :)skiboysteve - Thursday, July 3, 2014 - link
Very interesting. Complete 100% opposite of the AMD architecture.Also, the fact that they can power down shader cores and even individual ALUs makes this pure ILP and zero TLP architecture really shine. Because the compiler only needs to optimize work on an ALU level.... Not on a shader or wavefront level. So then based upon demand they just feed more or less of these ALU-optomized-packets of work into the GPU and power down the rest of the ALUs. It makes complete sense. Make each ALU independent, compile work on a per ALU basis, then scale your ALU utilization at run time... And also scale your chip portfolio on ALU count. Perfect scalability in price, performance and power combined with straightforward driver work.
Awesome!
rootheday - Thursday, July 3, 2014 - link
Transaction Elimination might work for "render to texture", where the texture is then used later in the scene, but it doesn't make sense to me for the final render target that is shown on the screen.. Typically your final render target is part of a swap chain with 2 or 3 different buffers. So the CRC computed for a tile in Frame N would need to match the CRC for frame N-2 (or maybe N-3), not N-1.EdvardS - Thursday, July 3, 2014 - link
So each buffer has its own CRC buddy attached to it - problem solved :-) Longer buffer chains just reduces you temporal coherency a bit.hexgrid - Thursday, July 3, 2014 - link
Relying on a 32bit hash for transaction elimination is a very bad idea. Assuming a normal distribution of hash results, the Birthday Paradox means you've got a better than 50% chance of collision if you have more than ~5000 items being hashed. The compartmentalized comparisons (it looks like they only compare against the same tile) means the collision rate will be lower, but there will be collisions, and in some cases they will look terrible.If there's a glitch, there's an excellent chance it will persist for a while. For example, let's say I'm bringing up a "pause" menu. If the title overlay of the pause menu happens to hash to the same value as the background that it replaced, that one tile of the pause menu title won't appear. Next frame, it still won't appear, because again the hash value hasn't changed. Until something happens to actually change the title or make it disappear, the glitch will persist.
The "fix" will be to do GUI overlays in translucency and keep the background animating somehow to prevent cache misses from sticking around, but it's an ugly hardware hack that will force software workarounds.
EdvardS - Thursday, July 3, 2014 - link
What if the CRC is 64 bit for each small tile? Quite overkill and very hard to break.mkozakewich - Sunday, July 6, 2014 - link
If the pause screen fades in, wouldn't that change the image data enough that it would generate entirely different CRCs for each frame until the animation ended?At any rate, my rule of thumb would be that if it doesn't happen more often than a keyframe is dropped in a video, it'll be fine.
Krysto - Thursday, July 3, 2014 - link
I'd like you guys to do more 30x GFX loops, that how whether these GPU's only prop up high numbers at first, but then quickly get throttled. From what I noticed Imagination is the only one that doesn't throttle in that test. But I'm curious to see if K1 and T760 do.3DPowerFX - Thursday, July 3, 2014 - link
Once again, AnandTech has published a great article! Thanks ARM and AnandTech.Just one point. there is a small mistake in the article about Samsung Exynos 3470 GPU. It's not Mali 450MP but the undead Mali 400MP GPU. Although it would be nice to have the latest one.
Cogman - Thursday, July 3, 2014 - link
On transaction elimination. A movie is actually much worse about being eliminated than anything else. The only saving grace for a movie is the fact that the FPS are often much lower than what the device is natively putting out (so 60fps is a typical display refresh rate whereas movies typically operate at 24->30fps). After that, everything changes right down to the smallest detail. This is the grainy effect that you see in movies.For games, there could be some benefit assuming the game isn't a high action one. The biggest win will be still images (90% of what these displays are going to be displaying).
EdvardS - Thursday, July 3, 2014 - link
Movies are not actually that bad. Remember that videos we watch on our devices have already been compressed with lossy algorithms looking for temporal resemblance, which seems to boost the transaction elimination efficiency as well.BMNify - Thursday, July 3, 2014 - link
gem did a writup , but i cant find it now !, but take a look here as regards transaction elimination http://community.arm.com/groups/arm-mali-graphics/...BTW "the grainy effect that you see in movies" have absolutely nothing to do with frame rate
its put there (as in artificially) by the post processing due to the fact today everyone's using 8bit per pixel as in Rec. 709 (HDTV) color space that produces banding and other visible anomalies not the new official Rec. 2020 (UHDTV/UHD-1/UHD-2) real 10bit/12bit color space we will see soon.
tuxRoller - Thursday, July 3, 2014 - link
Consider asking red hat's rob clark. He's been reverse engineering the adreno arch (his driver, freedreno (https://github.com/freedreno/freedreno/wiki) however, is not a reverse engineered adreno driver) for a few years now and can almost certainly give you at least that much info.His blog is at http://bloggingthemonkey.blogspot.com, and he's a super nice guy.
jwcalla - Friday, July 4, 2014 - link
Qualcomm is a really closed company. They just did a massive DCMA takedown on GitHub: https://github.com/github/dmca/blob/master/2014-07...Their software side isn't that great either.
tuxRoller - Friday, July 4, 2014 - link
I'm not sure why this is addressed to me. Although I expect AT will ignore what I've written so as not to upset their corporate friends, what I suggested is what they should do if they are really interested in the tech.What's strange to me is that they did something similar with their analysis of Cyclone to what I'm suggesting they do, except in the Qualcomm case the work is done by someone else.
Death666Angel - Thursday, July 3, 2014 - link
Awesome to see this here! I hope the Adreno team will follow suite soon and lay their doubts to rest."LG’s Viewty" Holy shit, that way my 2nd ever phone (after my first flip phone got broken when I rammed a car with my bike). That thing was pretty bad all in all. But the slow motion camera was great for its time! :D It broke too while I was in a fight, but that was the last one. Touch Pro 2, Galaxy S2, Galaxy Nexus and LG G2 all working fine till this day. :D
Willardjuice - Thursday, July 3, 2014 - link
"From a sales perspective this means ARM can offer the CPU and GPU designs together in a bundle, but perhaps more importantly it means they have the capability design the two in concert with each other, being in the position of the sole creator of the ARM ISA."lol, the bundle aspect is far more important for ARM gpu sales. ;)
skiboysteve - Thursday, July 3, 2014 - link
Truth. Basically makes it so a competitor needs to show a significant performance, power, feature, or cost difference before it's worth an integrator investing in breaking apart the bundleseanlumly - Thursday, July 3, 2014 - link
Given the "exotic" the ILP and the 128-bit VLIW SIMD, the Mali looks like an impressive performer. If a Mali T760MP10 is indeed a part fit for smartphone-level power consumption, then a quick linear-scaling -- given GFXBench scores of a T760MP4 -- would imply that such a GPU is very competitive with something like the adreno 420 and certainly impressive if scaled up further to a tablet-level power consumption. If, however, an MP10 consumes roughly as much as a K1 or GX6650, then I'm very sceptical about its competitive performance.I find the distribution of ALUs to memory units strange given mobile bandwidth limits. The Mali T760 in an 8-core configuration clocked at 600MHz will allow for 19.2 GB/s of load/store access (256-bits/clock: source ARM). This is quite high memory bandwidth, and an increase in GPU clocks or cores will likely yield idling load/stores units doing nothing but taking up valuable die area. Operations with high varying access, cache reads, and tile read/writes will of course make good use of these additional units, but it still seems like overkill on all but very memory-access heavy apps. ARM would know best, though I'm suspicious that so many load/store units are needed for common workloads. I would guess that in common scenarios, bandwidth to main memory would be exhausted long before all of the memory units were fully utilized on a Mali T760 of high core count (eg. MP12-16).
Decoupling the load/stores and texture units as their own "core" may allow more appropriate scaling to fit the bandwidth of the target system. A system with an ultra-high resolution, could be endowed with more load/stores and/or texture units. A system with a lower resolution could use less and opt for more ALUs in the same space. This would be similar to big.LITTLE (different cores for different targeted workloads). In this scenario, the memory unit cores could be scaled independently of the ALU cores, perfectly tailored to the target system.
EdvardS - Friday, July 4, 2014 - link
Remember that there is a cache system between those units and the SoC memory controller. Bandwidth to the cores is quite different from bandwidth to the DDR memory.seanlumly - Friday, July 4, 2014 - link
Indeed! Bandwidth to cache, tile memory, varying data, or textures would likely be more relevant with a tile-based renderer that often exploits spatial locality when processing batches of pixels. This is especially true with modern screen-space effects that do multiple dependent reads per display pixel (eg. SSAO), but are strongly confined to buffer fragments surrounding the target pixel. Such situations would value having many LS/Tex units at little penalty.But I do still like the idea of an independently scalable "memory" core (containing load/store/texure pipes) to complement a "math" core (containing ALUs). A high-performance system targeting a 720p display will likely consume far less bandwidth than one targeting a 4K display, and as such, it would be nice to trade LS/Tex units for more ALUs in such a case.
Such an arrangement may also enable ARM more leeway when making predictions about a new architecture -- no doubt the Midgard arch was in development many years before it saw implementation in a retail product, which means that ARM would have had to guess trends (eg. resolution) far in advance to attain the right balance of on-chip units per core; independently scalable memory-cores would be more forgiving if the trends turned out not match the initial predictions.
seanlumly - Friday, July 4, 2014 - link
Actually, I am starting to understand the motivation behind the ratio of ALUs to Memory units in a Midgard core. I notice in GFXBench 3.0 "Manhattan", that the Mali T760MP4 (Rockchip rk3288) performs incredibly well at 720p, but it's performance drops off more than proportionately as the resolution increases. This may imply that in these higher than 720p scenarios, the 4-core variant of GPU may not be able to keep up with the memory demands, as computation should scale very close to proportionally.Thus the 1080p offscreen score for the Mail T760 MP4 in the GFXBench 3.0 (offline) database may be misleading as the MP4 may be a bit small for this resolution, and thus the performance may be low relative to its competition. A T760 MP8 would likely more than double the performance for a doubling of the resolution, pushing something like the Mali T760 MP8 well beyond the competition, at what I suspect are similar levels of power and die-size. I predict that a T760 MP8 would get slightly north of 16fps in GFXBench 3.0 (assuming adequate bandwidth to DDR). Even an MP6 variant of the GPU (as was the case with the T628 MP6 in the Exynos 5420) should put it more-or-less on-par with the competition!
The Mali performs even better in the GFXBench 2.7 "T-Rex" test, where a small 4-core Mali T760 MP4 surpasses the competition at 720p and even sub-720p resolutions in some instances! This is incredible. In this case, it seems that the test is more computation bound, as there is a more proportional scaling between performance and resolution.
I hope that future GPUs consider using the T760 in higher-core-count configurations. I still like the idea of a Memory-core, though I have little doubt that a Mali GPU of evenly matched size can go toe-to-toe with the competition.
Frenetic Pony - Friday, July 4, 2014 - link
Every time I read an overview of a SoC GPU I am so, so glad I don't do anything with mobile stuff. "We support tessellation! I mean, don't actually do it. Ever. But you know, it's supported."kkb - Friday, July 4, 2014 - link
How come there is no comparison with intel GPUs like the ones in Baytrail?darkich - Friday, July 4, 2014 - link
Because there is no comparison phrase.That GPU is completely inferior compared to latest Mali, PowerVR and Adreno architectures
Krysto - Friday, July 4, 2014 - link
Word.kkb - Monday, July 7, 2014 - link
well.. I don't really agree. Please look at the AT review from last week or so.. http://www.anandtech.com/show/8197/samsung-galaxy-...MEMO pad is a baytrail product and definitely performs better than MALI devices.
darkich - Monday, July 7, 2014 - link
Get your facts and reading skills in order.Firstly, the GPU in Memo Pad is definitely not definitely performing better than even the Mali T628, in fact those very tests show it trades blows with it, mostly due to much lower resolution screen.
Secondly, do you realize that the T760 is MUCH faster than T628?
You can see here that it is basically comparable to the Tegra K1 and even the intimidating Series 6XT doesn't trounce it.
Rest assured that any of these three, as well as the Adreno 420, is way above the ULP HD graphics chip
kkb - Monday, July 7, 2014 - link
I hope you do understand how to read benchmark results. 3dmark and GFXbench(offscreen) results are resolution independent. Now go and check the results in the article.As per T760, I will not comment on theoretical GFLOP numbers unless there is real product. Even the theoretical MAD GFLOPs are not so great (roughly half) compared to others. I don't think anyone going to fall for marketing gimmick of taking dot product also as extra 40 GFLOPs
darkich - Monday, July 7, 2014 - link
You said it definitely performs better yet it loses on the T Rex HD onscreen and in Basemark X overall. That's not definite you know.Regardless, even if a case can be made that it performs slightly better overall than the Mali T628, it is without doubt outperformed by the :
ULP Gforce 3
Adreno 330
Sgx G6320
.. and is *definitely* far outclassed by the Adreno 420, Kepler K1, SGX G6550, Mali T760MP6-10.
Until Intel shows their next generation of ULP graphics, I don't see a point in comparing the current one
darkich - Monday, July 7, 2014 - link
Correction, I believe the GPU in Tegra 4 is internally referred to as the ULP GeForce 4, not 3fithisux - Friday, July 4, 2014 - link
Could you provide an expository of C66x architecture since it is suitable in my opininion for GPGPU tasks and realtime software rendering/raytracingjann5s - Friday, July 4, 2014 - link
lol, I thought this expression was wrong: "the proof is in the pudding", but in fact I was wrong: http://en.wiktionary.org/wiki/the_proof_is_in_the_...toyotabedzrock - Friday, July 4, 2014 - link
I wish you would have talked more about the GPU in the Nexus 10 since that is a shipping product. It would be nice to know how it differs from the newer midgard designs.seanlumly - Friday, July 4, 2014 - link
Another interesting point to make about the Mali architecture (that goes unnoticed, but is significant) is that the anti-aliasing is fully pipelined, tiled (read zero bandwidth penalty for the op), and very fast. MSAA 4x costs 1 cycle, MSAA 8x costs 2 cycles, and MSAA 16x costs 4 cycles. This means that if you have a scene full of fragment shaders running for more than 4 cycles (which is not too complex these days) you get the benefit of ultra-high quality MSAA 16x for FREE.There aren't too many examples of MSAA 16x online, but even at MSAA 8x performs very well, with sharp, non-blurry results and is often compared against. MSAA would produce very crisp edges devoid of aliasing and crawling during animation.
Of course, MSAA isn't perfect -- it isn't terribly helpful for deferred renderers -- but it certainly doesn't hurt them when its costs are nothing, even if you are planning to do a screen-space pass in post.
toyotabedzrock - Friday, July 4, 2014 - link
Oddly the best open source driver is for adreno GPU, perhaps you should ask that person what he knows about it?ol1bit - Saturday, July 5, 2014 - link
This is another fantastic job guys! Thanks to you and Thanks to Midgard for sharing!cwabbott - Saturday, July 5, 2014 - link
Well, if they won't cough up the information, then there's always freedreno... Rob Clark has reverse engineered basically everything you would want to know about the Adreno architecture, up to even more detail than this article. All that remains is to fill in the pieces based on the documentation he's written...tuxRoller - Monday, July 7, 2014 - link
Yup, that is indeed what I said earlier.I don't think anand is interested in investigating that avenue, however. For one, he might fear that such published info would blacklist him from future qualcomm info dumps (pretty far fetched, imho, but AT likes the corporate relationships its been able to cultivate). For another, AT doesn't seem to be terribly interested in oss.
prabindh - Monday, July 7, 2014 - link
Do the GFLOPS give a measure of how much performance an OpenCL kernel can provide in ARM Mali architecture ? Simple fp16/32 kernels doing a MAD and writing to global memory do not seem to match the GFLOPS calculated here. Are there other HW LD/ST limits in the pipeline (assuming no system memory Bandwidth limitations) ?Amadiro - Friday, July 11, 2014 - link
So how do the dFdx/dFdy operators work for such a non-wavefront based design? Does using them imply a huge overhead/stall/context-switch of some sort on this kind of architecture?MrSpadge - Wednesday, August 13, 2014 - link
Thanks ARM, Ryan and Anandtech for this very interesting article! This finally gives this "yet another SoC GPU irrelevant to me" a face. And I think ARM could employ this design for some very nice GP-GPU (co-)processors:- FP64 performance rivaling the best
- not relying on TLP could open up the GPU for entire new application ranges where they just didn't make sense yet
- ARM can scale the amount of compute per core easily
I know building a massive GP-GPU chip is anything but trivial.. but this seems an architecture worthy of this!
manoj1919 - Friday, September 25, 2015 - link
Hi Anand,Nice article!.I am working with mali GPU, and I am a researcher trying to figure out power management capabilities of MALI GPU. last part of this article seems to be covering a bit of that. can you please point me to the source of ARM's slide showing gating techniques of MALI GPU(last figure in article).
Thanks in advance,
Manoj
gregware - Monday, February 20, 2017 - link
Interesting article, thanks!