Thanks to AndreiF7's excellent work on discovering it, we kicked off our investigations into Samsung’s CPU/GPU optimizations around the international Galaxy S 4 in July and came away with a couple of conclusions:

1) On the Exynos 5410, Samsung was detecting the presence of certain benchmarks and raising thermal limits (and thus max GPU frequency) in order to gain an edge on those benchmarks, and

2) On both Snapdragon 600 and Exynos 5410 SGS4 platforms, Samsung was detecting the presence of certain benchmarks and automatically driving CPU voltage/frequency to their highest state right away. Also on Snapdragon platforms, all cores are plugged in immediately upon benchmark detect.

The first point applied exclusively to the Exynos 5410 equipped version of the Galaxy S 4. We did a lot of digging to confirm that max GPU frequency (450MHz) was never exceeded on the Snapdragon 600 version. The second point however applied to many, many more platforms.

The table below is a subset of devices we've tested, the silicon inside, and whether or not they do a benchmark detect and respond with a max CPU frequency (and all cores plugged in) right away:

I Can't Believe I Have to Make This Table
Device SoC Cheats In
    3DM AnTuTu AndEBench Basemark X Geekbench 3 GFXB 2.7 Vellamo
ASUS Padfone Infinity Qualcomm Snapdragon 800 N Y N N N N Y
HTC One Qualcomm Snapdragon 600 Y Y N N N Y Y
HTC One mini Qualcomm Snapdragon 400 Y Y N N N Y Y
LG G2 Qualcomm Snapdragon 800 N Y N N N N Y
Moto RAZR i Intel Atom Z2460 N N N N N N N
Moto X Qualcomm Snapdragon S4 Pro N N N N N N N
Nexus 4 Qualcomm APQ8064 N N N N N N N
Nexus 7 Qualcomm Snapdragon 600 N N N N N N N
Samsung Galaxy S 4 Qualcomm Snapdragon 600 N Y Y N N N Y
Samsung Galaxy Note 3 Qualcomm Snapdragon 800 Y Y Y Y Y N Y
Samsung Galaxy Tab 3 10.1 Intel Atom Z2560 N Y Y N N N N
Samsung Galaxy Note 10.1 (2014 Edition) Samsung Exynos 5420 Y(1.4) Y(1.4) Y(1.4) Y(1.4) Y(1.4) N Y(1.9)
NVIDIA Shield Tegra 4 N N N N N N N

We started piecing this data together back in July, and even had conversations with both silicon vendors and OEMs about getting it to stop. With the exception of Apple and Motorola, literally every single OEM we’ve worked with ships (or has shipped) at least one device that runs this silly CPU optimization. It's possible that older Motorola devices might've done the same thing, but none of the newer devices we have on hand exhibited the behavior. It’s a systemic problem that seems to have surfaced over the last two years, and one that extends far beyond Samsung.

Looking at the table above you’ll also notice weird inconsistencies about the devices/OEMs that choose to implement the cheat/hack/festivities. None of the Nexus do, which is understandable since the optimization isn't a part of AOSP. This also helps explain why the Nexus 4 performed so slowly when we reviewed it - this mess was going on back then and Google didn't partake. The GPe versions aren't clean either, which makes sense given that they run the OEM's software with stock Android on top.

LG’s G2 also includes some optimizations, just for a different set of benchmarks. It's interesting that LG's optimization list isn't as extensive as Samsung's - time to invest in more optimization engineers? LG originally indicated to us that its software needed some performance tuning, which helps explain away some of the G2 vs Note 3 performance gap we saw in our review.

The Exynos 5420's behavior is interesting here. Instead of switching over to the A15 cluster exclusively it seems to alternate between running max clocks on the A7 cluster and the A15 cluster.

Note that I’d also be careful about those living in glass houses throwing stones here. Even the CloverTrail+ based Galaxy Tab 3 10.1 does it. I know internally Intel is quite opposed to the practice (as I’m assuming Qualcomm is as well), making this an OEM level decision and not something advocated by the chip makers (although none of them publicly chastise partners for engaging in the activity, more on this in a moment).

The other funny thing is the list of optimized benchmarks changes over time. On the Galaxy S 4 (including the latest updates to the AT&T model), 3DMark and Geekbench 3 aren’t targets while on the Galaxy Note 3 both apps are. Due to the nature of the optimization, the benchmark whitelist has to be maintained (now you need your network operator to deliver updates quickly both for features and benchmark optimizations!).

There’s minimal overlap between the whitelisted CPU tests and what we actually run at AnandTech. The only culprits on the CPU side are AndEBench and Vellamo. AndEBench is an implementation of CoreMark, something we added in more as a way of looking at native vs. java core performance and not indicative of overall device performance. I’m unhappy that AndEBench is now a target for optimization, but it’s also not unexpected. So how much of a performance uplift does Samsung gain from this optimization? Luckily we've been testing AndEBench V2 for a while, which features a superset of the benchmarks used in AndEBench V1 and is of course built using a different package name. We ran the same native, multithreaded AndEBench V1 test using both apps on the Galaxy Note 3. The difference in scores is below:

Galaxy Note 3 Performance in AndEBench
  AndEBench V1 AndEBench V2 running V1 Workload Perf Increase
Galaxy Note 3 16802 16093 +4.4%

There's a 4.4% increase in performance from the CPU optimization. Some of that gap is actually due to differences in compiler optimizations (V1 is tuned by the OEMs for performance, V2 is tuned for compatibility as it's still in beta). As expected, we're not talking about any tremendous gains here (at least as far as our test suite is concerned) because Samsung isn't actually offering a higher CPU frequency to the benchmark. All that's happening here is the equivalent of a higher P-state vs. letting the benchmark ramp to that voltage/frequency pair on its own. We've already started work on making sure that all future versions of benchmarks we get will come with unique package names.

I graphed the frequency curve of a Snapdragon 600 based Galaxy S 4 while running both versions of AndEBench to illustrate what's going on here:

We're looking at is a single loop of the core AndEBench MP test. The blue line indicates what happens naturally, while the red line shows what happens with the CPU governor optimization enabled. Note the more gradual frequency ramp up/down. In the case of this test, all you're getting is the added performance during that slow ramp time. For benchmarks that repeat many tiny loops, these differences could definitely add up. In situations where everyone is shipping the same exact hardware, sometimes that extra few percent is enough to give the folks in marketing a win, which is why any of this happens in the first place.

Even when the Snapdragon 600 based SGS4 recognizes AndEBench it doesn't seem to get in the way of thermal throttling. A few runs of the test and I saw clock speeds drop down to under 1.7GHz for a relatively long period of time before ramping back up. I should note that the power/thermal profiles do look different when you let the governor work its magic vs. overriding things, which obviously also contributes to any performance deltas.

Vellamo is interesting as all of the flagships seem to game this test, which sort of makes the point of the optimization moot:

Any wins the Galaxy Note 3 achieves in our browser tests are independent of the CPU frequency cheat/optimization discussed above. It’s also important to point out that this is why we treat our suite as a moving target. I introduced Kraken into the suite a little while ago because I was worried that SunSpider was becoming too much of a browser optimization target. The only realistic solution is to continue to evolve the suite ahead of those optimizing for it. The more attention you draw to certain benchmarks, the more likely they are to be gamed. We constantly play this game of cat and mouse on the PC side, it’s just more frustrating in mobile since there aren’t many good benchmarks to begin with. Note that pretty much every CPU test that’s been gamed at this point isn’t a good CPU test to begin with.

Don’t forget that we’re lucky to be able to so quickly catch these things. After our piece in July I figured one of two things would happen: 1) the optimizations would stop, or 2) they would become more difficult to figure out. At least in the near term, it seems to be the latter. The framework for controlling all of this has changed a bit, and I suspect it’ll grow even more obfuscated in the future. There’s no single solution here, but rather a multi-faceted approach to make sure we're ahead of the curve. We need to continue to rev our test suite to stay ahead of any aggressive OEM optimizations, we need to petition the OEMs to stop this madness, we need to work with the benchmark vendors to detect and disable optimizations as they happen and avoid benchmarks that are easily gamed. Honestly this is the same list of things we do on the PC side, so we've been doing it in mobile as well.

The Relationship Between CPU Frequency and GPU Performance

On the GPU front is where things are trickier. GFXBench 2.7 (aka GLBenchmark) somehow avoids being an optimization target, at least for the CPU cheat we’re talking about here. There are always concerned about rendering accuracy, dropping frames, etc... but it looks like the next version of GFXBench will help make sure that no one is doing that quite yet. Kishonti (the makers of GFX/GLBench) work closely with all of the OEMs and tend to do a reasonable job at keeping everyone honest but they’re in a tricky spot as the OEMs also pay for the creation of the benchmark (via licensing fees to use it). Running a renamed version of GFXBench produced similar scores to what we already published on the Note 3, which ends up being a bit faster than what LG’s G2 was able to deliver. As Brian pointed out in his review however, there are driver version differences between the platforms as well as differences in VRAM sizes (thanks to the Note 3's 3GB total system memory):

Note 3: 04.03.00.125.077
Padfone: 04.02.02.050.116
G2: 4.02.02.050.141

Also keep in mind that both LG and Samsung will define their own governor behaviors on top of all of this. Even using the same silicon you can choose different operating temperatures you’re comfortable with. Of course this is another variable to game (e.g. increasing thermal headroom when you detect a benchmark), but as far as I can tell even in these benchmark modes thermal throttling does happen.

The two new targets are tests that we use: 3DMark and Basemark X. The latter tends to be quite GPU bound, so the impact of a higher CPU frequency is more marginalized, but with a renamed version we can tell for sure:

Galaxy Note 3 Performance in Basemark X
  Basemark X Basemark X - Renamed Perf Increase
On screen 16.036 fps 15.525 fps +3.3%
Off screen 13.528 fps 12.294 fps +10%

The onscreen differences make sense to me, it's the off screen results that are a bit puzzling. I'm worried about what's going on with the off-screen rendering buffer. That seems to be too little of a performance increase if the optimization was dropping frames (if you're going to do that, might as well go for the gold), but as to what is actually going on I'm not entirely sure. We'll keep digging on this one. The CPU optimization alone should net something around the 3% gain we see in the on screen test.

3DMark is a bigger concern. As we discovered in our Moto X review, 3DMark is a much more balanced CPU/GPU test. Driving CPU frequencies higher can and will impact the overall scores here.

ASUS thankfully doesn’t do any of this mess with their Padfone Infinity in the GPU tests. Note that there are still driver and video memory differences between the Padfone Infinity and the Galaxy Note 3, but we’re seeing roughly a 10% performance advantage in the overall 3DMark Extreme score (the Padfone also has a slightly lower clocked CPU - 2.2GHz vs. 2.3GHz). It's tough to say how much of this is due to the CPU optimization vs. how much is up to driver and video memory differences (we're working on a renamed version of 3DMark to quantify this exactly).

The Futuremark guys have a lot of experience with manufacturers trying to game their benchmarks so they actually call out this specific type of optimization in their public rules:

"With the exception of setting mandatory parameters specific to systems with multiple GPUs, such as AMD CrossFire or NVIDIA SLI, drivers may not detect the launch of the benchmark executable and alter, replace or override any parameters or parts of the test based on the detection. Period."

If I'm reading it correctly, both HTC and Samsung are violating this rule. What recourse Futuremark has against the companies is up in the air, but here we at least have a public example of a benchmark vendor not being ok with what's going on.

Note that GFXBench 2.7, where we don't see anyone run the CPU optimization, shows a 13% advantage for the Note 3 vs. Padfone Infinity. Just like the Exynos 5410 optimization there simply isn't a lot to be gained by doing this, making the fact that the practice is so widespread even more frustrating.

Final Words

As we mentioned back in July, all of this is wrong and really isn't worth the minimal effort the OEMs put into even playing these games. If I ran the software group at any of these companies running the cost/benefit analysis on chasing these optimizations vs. negativity in the press it’d be an easy decision (not to mention the whole morality argument). It's also worth pointing out that nearly almost all Android OEMs are complicit in creating this mess. We singled out Samsung for the initial investigation as they were doing something unique on the GPU front that didn't apply to everyone else, but the CPU story (as we mentioned back in July) is a widespread problem.

Ultimately the Galaxy Note 3 doesn’t change anything from what we originally reported. The GPU frequency optimizations that existed in the Exynos 5410 SGS4 don’t exist on any of the Snapdragon platforms (all applications are given equal access to the Note 3’s 450MHz max GPU frequency). The CPU frequency optimization that exists on the SGS4, LG G2, HTC One and other Android devices, still exists on the Galaxy Note 3. This is something that we’re going to be tracking and reporting more frequently, but it’s honestly no surprise that Samsung hasn’t changed its policies here.

The majority of our tests aren’t impacted by the optimization. Virtually all Android vendors appear to keep their own lists of applications that matter and need optimizing. The lists grow/change over time, and they don’t all overlap. With these types of situations it’s almost impossible to get any one vendor to be the first to stop. The only hope resides in those who don’t partake today, and of course with the rest of the ecosystem.

We’ve been working with all of the benchmark vendors to try and stay one step ahead of the optimizations as much as possible. Kishonti is working on some neat stuff internally, and we’ve always had a great relationship with all of the other vendors - many of whom are up in arms about this whole thing and have been working on ways to defeat it long before now. There’s also a tremendous amount of pressure the silicon vendors can put on their partners (although not quite as much as in the PC space, yet), not to mention Google could try to flex its muscle here as well. The best we can do is continue to keep our test suite a moving target, avoid using benchmarks that are very easily gamed and mostly meaningless, continue to work with the OEMs in trying to get them to stop (though tough for the international ones) and work with the benchmark vendors to defeat optimizations as they are discovered. We're presently doing all of these things and we have no plans to stop. Literally all of our benchmarks have either been renamed or are in the process of being renamed to non-public names in order to ensure simple app detects don't do anything going forward.

The unfortunate reality is this is all going to get a lot worse before it gets better. We wondered what would happen with the next platform release after our report in July, and the Note 3 told us everything we needed to know (you could argue that it was too soon to incite change, perhaps SGS5 next year is a better test). Going forward I expect all of this to become more heavily occluded from end user inspection. App detects alone are pretty simple, but what I expect to happen next are code/behavior detects and switching behavior based on that. There are thankfully ways of continuing to see and understand what’s going on inside these closed platforms, so I’m not too concerned about the future.

The hilarious part of all of this is we’re still talking about small gains in performance. The impact on our CPU tests is 0 - 5%, and somewhere south of 10% on our GPU benchmarks as far as we can tell. I can't stress enough that it would be far less painful for the OEMs to just stop this nonsense and instead demand better performance/power efficiency from their silicon vendors. Whether the OEMs choose to change or not however, we’ve seen how this story ends. We’re very much in the mid-1990s PC era in terms of mobile benchmarks. What follows next are application based tests and suites. Then comes the fun part of course. Intel, Qualcomm and Samsung are all involved in their own benchmarking efforts, many of which will come to light over the coming years. The problem will then quickly shift from gaming simple micro benchmarks to which “real world” tests are unfairly optimized which architectures. This should all sound very familiar. To borrow from Brian’s Galaxy Gear review (and BSG): “all this has happened before, and all of it will happen again.”

Comments Locked

374 Comments

View All Comments

  • bpondo - Wednesday, October 2, 2013 - link

    Samsung astro-turfers representing the crazy tinfoil hat brigade.
  • SpacedCowboy - Sunday, October 6, 2013 - link

    Hey, at least they don't pay people to AstroTurf...
  • PeteH - Wednesday, October 2, 2013 - link

    Anand specifically called out Apple and Motorola as the only two OEMs that never cheated.
  • Wilco1 - Thursday, October 3, 2013 - link

    Failure to prove they did "cheat" is not prove they didn't "cheat". There are a million ways to recognise a benchmark without using its name. And showing the correct performance of a benchmark is not cheating.
  • lilo777 - Wednesday, October 2, 2013 - link

    I am a little disappointed that AT succumbed to fanbois pressure to declare Samsung a cheater in Galaxy Note 3 review. It appears that the situation is not as white and black as some are trying to present it. I would expect a more nuanced approach from AT. Unlike in some "classic" cases of benchmark cheating, nobody disputes that these Android phones are actually technically capable of running benchmark and all other applications at the speed reported by the benchmark apps. Sure, some benchmark apps get "preferential" treatment. I believe it was AT who originally discovered this issue on SGS4 and back then you reported that not just benchmarks but some other apps were executed in accelerated mode (camera app was one example). Now, lets take a broader look. What we are dealing with here is the approach where resource allocation uses application name as one factor. Is this a new approach? It is not. Remember how Nvidia Optimus used application name lists for choosing which applications should be run on discrete GPU and for which ones IGP was good enough? Was it cheating too?

    Sure, Samsung could offer users to decide which applications should be executed at full "speed" (frequency, voltage whatever). Would that be a good idea? I am not sure. Given the number of apps and the type of customers who use smart phones that could be problematic. They might work with major app developers to allow third party apps to get the benefits of CPU potential where it's appropriate.

    In any case, to call it "cheating" is a disservice to people who want to know the actual facts.
  • ddriver - Wednesday, October 2, 2013 - link

    Calling this cheating is technically incorrect. While increasing the thermal limit does sound like cheating, when Intel processors do overclock (especially in single threaded heavy workloads) it is a feature? I am fairly certain that overclocking impacts performance significantly more than what your test indicate for this "cheat".

    Sending an explicit hint to the CPU for a known performance critical workload - it does not make the CPU any faster than it already is. This is as much cheating as a programmer using an explicit prefetch to warm up data that spans across multiple lines for faster execution from the cache and not wait on the ram. It is an optimization, and if anything, it ENSURES the realistic possible maximum, which is what a benchmark is used for. If anything, every benchmark should explicitly hint the platform itself to be on its toes for its duration. Calling this "cheating" in your little "unbelievable" childish chart... somehow really doesn't suit you.

    What you are vastly overlooking is the main reason for the performance delta - the duration of the workloads. The problem of most of those benchmarks is that they consist of a series of short tests, so what happens is between each test the CPU goes to a lower state, and when the next test runs, the time it takes for the platform to analyze the workload and to boost the clocks is too long relative to the way too short test. So it ends up eating a lot of "score...

    Needless to say, this does not happen in the real world. Number crunching is 99% of the time a sustained, continuous workload. This means the impact of power management will exponentially diminish as the duration increases. CPUs are benchmarked to determine how high performance is. And how high is performance is measured in HPC conditions. Real world performance workloads are continuous, not short pulses.

    But don't take my word for it, I employ you to head over to stackoverflow and ask about whether a 25 msec loop test can provide what may be considered adequate measure of performance. The 25 msec your own graph reveals. It is really the benchmark implementers fault for not doing that absolutely mandatory optimization in the first place. You cannot chain together a bunch of random short tests and pretend for this to be adequate measure at real-world performance, which is what is impacted mostly by power management - stuff that takes too little time already. There isn't much of a difference if a tab will open in 5 or 10 msec, but there is a significant difference if say your video edit takes 5 or 10 minutes, which is where performance is really needed, and where power saving features won't impact performance. That would be an adequate measure of how high performance is in a real world scenario.

    When people run at the Olympic games, is there a discipline in which the result measured in such miniscule magnitude? There is a good reason those guys and galls run no less than 100 meters. This kind of CPU benchmarks combined with power management is relative to making Olympic runners run 50 cm. Run 50 cm, stop, run 50 cm, stop.... does this give a good measure of at athlete's performance? Is it even a good idea? I don't think so, because this measures how fast you can start and stop, not how fast you can run. If anything, it should be a benchmark about how well power saving features work and at what grain and latency they analyze and kick in. Not about the actual maximum chip performance.

    Sorry to have to say it, but this looks like anti-PR, in which case would be highly unethical, especially if endorse financially or otherwise by direct Samsung and overall Android competitors. But your justification is not technically sound. It may pass for most of your readers, but certainly won't go unnoticed. Looks like you are trying to exploit poorly written benchmarks to influence the buyer’s mind, especially now, during this bonanza of new products and spikes in sales, with re-energized cash flow and an industry eager to get a bigger part of it, including “anti-competitive” competition that involves harming the competitors to be “relatively better” instead of simply besting them. As for suspects - I'd look at the few vendors that are not present in this "witch hunt" even though are known to have plenty of dirt on their hands already, the main benefactors from this "public shaming". Which makes this whole deal kind of hypocritical in two ways - the vendors resorting to such practices considering their own track record, and the attempt to hide such a dishonest act under a veil of a plea for honesty.

    Is this a site about tech, or a site that exploit tech to drive corporate interests? I think I gave plenty of technical argumentation on the subject. I used to be a reader years ago, recently I returned to find this place significantly changed... The world doesn't need another Engadget. One is one too many...
  • watzupken - Wednesday, October 2, 2013 - link

    Quoting this,
    Calling this cheating is technically incorrect. While increasing the thermal limit does sound like cheating, when Intel processors do overclock (especially in single threaded heavy workloads) it is a feature?

    I think you are pretty confused over the difference. Intel chip boost when loaded and when there is thermal headroom, that is true. But it boost across a range of applications, and does not selectively boost when running benchmarks only. The Exynos is programmed to wake all core and stay at full speed when it recognizes a benchmark being launched. The former boost from Intel gives you actual improvement in user experience across the board. Like I can be doing video conversion, and I am benefiting from the boost. The latter, only benefits benchmark results, and will not make your day to day usage on the phone any better. The motive is clear.
  • ddriver - Wednesday, October 2, 2013 - link

    I agreed to a certain extent increasing the thermal envelope can be considered cheating, since it robs you out of battery life, which is not normal, something that doesn't matter on a desktop system that much, and probably the reason ARM chips are not providing the feature in the first place. Considering the workload is 3D gaming, it is not something you rush into completing, so it is a little unfair to boost unless the it makes the difference between choppy and smooth experience, which is the only case that justifies the extra battery drainage, even if the CPU is not the major power consumer in a typical mobile device. But this only applies to the GPU tests, the claim the CPU clock hint is a cheat is entirely ungrounded. Which is mostly what this whole story is all about.

    And if you want improvement "across the board" improvements due to this hack, you should look at developers, samsung cannot go after every one of them and do it for them, especially considering it should be done internally and much more finely grained, not on OS level and determinable by the file name, which is quick and dirty, but certainly not cheating, samsung and the rest just took care of the applications whose numbers influence people. The very fact this hack is applied indicates that Samsung did their homework and analyzed and detected a weakness in the benchmark design that can easily be improved. But even their resources don't stretch infinitely do it for every application, considering in most it is not really possible without wasting battery life, but for those for which is possible, why not do it? The battery drainage will be negligible which is the cost of learning how fast the CPU is on itself.
  • ddriver - Wednesday, October 2, 2013 - link

    What is worse, ever since the note 3 popped out into existence, there has been an excruciating amount of effort to tarnish it as a product. While it does seem at this point the software is still buggy, it is just dirty corporate tricks, a tad more dirty than the amount of "cheating" caught.

    The funny part is that even without the "cheats" this device destroys pretty much every thing on the market, while not the best at anything particular, easily the best general purpose all-rounder, feature and future proof and usage versatile device. So this only goes to show how desperate the competition is, unintentionally admitting to the qualities of the product they try to tarnish in this critical time of consumerism shopping frenzy.
  • Wilco1 - Thursday, October 3, 2013 - link

    Actually when Intel CPUs boost they also increase TDP. That's perhaps not cheating per se but certainly misleading as Intel doesn't tell you that (it's hidden in their datasheets).

    You seem to imply that other applications do not benefit from the maximum clock speed on Exynos (and other ARM) devices. This is completely false. The boost is not required for real workloads. Micro benchmarks which only run for a few milliseconds are not being recognised as real loads by the OS, and deservedly so. So if only these benchmarks ran for a few seconds (rather than a few milliseconds...), they would not need to be boosted so that they show the correct CPU performance.

Log in

Don't have an account? Sign up now