Memory Subsystem & Latency: Quite Different

The memory subsystem comparisons for the Snapdragon 888 and Exynos 2100 are very interesting for a few couple of reasons. First of all – these new SoCs are the first to use new higher-frequency LPDDR5-6400 memory, which is 16% faster than that of last year’s LPDRR5-5500 DRAM used in flagship devices.

On the Snapdragon 888 side of things, Qualcomm this generation has said that they have made significant progress in improving memory latency – a point of contention that’s generally been a weak point of the previous few generations, although they always did keep improving things gen-on-gen.

On the Exynos 2100 side, Samsung’s abandonment of their custom cores also means that the SoC is now very different to the Exynos 990. The M5 used to have a fast-path connection between the cores and the memory controllers – exactly how Samsung reimplemented this in the Exynos 2100 will be interesting.

Starting things off with the new Snapdragon 888, we are seeing some very significant changes compared to the Snapdragon 865 last year. Full random memory latency went down from 138ns to 114ns, which is a massive generation gain given that Arm always quotes that 4ns of latency equals 1% of performance.

Samsung’s Exynos 2100 on the other hand doesn’t look as good: At around 136ns at 128MB test depth, this is quite worse than the Snapdragon 888, and actually a regression compared to the Exynos 990 at 131ns.

Looking closer at the cache hierarchies, we’re seeing 64KB of L1 caches for both X1 designs – as expected.

What’s really weird though is the memory patterns of the X1 and A78 cores as they transition from the L2 caches to the L3 caches. Usually, you’d expect a larger latency hump into the 10’s of nanoseconds, however on both the Cortex-X1 and Cortex-A78 on both the Snapdragon and Exynos we’re seeing L3 latencies between 4-6ns which is far faster than any previous generation L3 and DSU design we’ve seen from Arm.

After experimenting a bit with my patterns, the answer to this weird behaviour is quite amazing: Arm is prefetching all these patterns, including the “full random” memory access pattern. My tests here consist of pointer-chasing loops across a given depth of memory, with the pointer-loop being closed and always repeated. Arm seems to have a new temporal prefetcher that recognizes arbitrary memory patterns and will latch onto them and prefetch them in further iterations.

I re-added an alternative full random access pattern test (“Full random RT”) into the graph as alternative data-points. This variant instead of being pointer-chase based, will compute a random target address at runtime before accessing it, meaning it’ll be always a different access pattern on repeated loops of a given memory depth. The curves here aren’t as nice and they aren’t as tight as the pointer-chase variant because it currently doesn’t guarantee that it’ll visit every cache line at a given depth and it also doesn’t guarantee not revisiting a cache line within a test depth loop, which is why some of the latencies are lower than that of the “Full random” pattern – just ignore these parts.

This alternative patterns also more clearly reveals the 512KB versus 1MB L2 cache differences between the Exynos’ X1 core and the Snapdragon X1 core. Both chips have 4MB of L3, which is pretty straightforward to identify.

What’s odd about the Exynos is the linear access latencies. Unlike the Snapdragon whose latency grows at 4MB and remains relatively the same, the Exynos sees a second latency hump around the 10MB depth mark. It’s hard to see here in the other patterns, but it’s also actually present there.

This post-4MB L3 cache hierarchy is actually easier to identify from the perspective of the Cortex-A55 cores. We see a very different pattern between the Exynos 2100 and the Snapdragon 888 here, and again confirms that there’s lowered latencies up until around 10MB depth.

During the announcement of the Exynos 2100, Samsung had mentioned they had improved and included “better cache memory”, which in context of these results seems to be pointing out that they’ve now increased their system level cache from 2MB to 6MB. I’m not 100% sure if it’s 6 or 8MB, but 6 seems to be a safe bet for now.

In these A55 graphs, we also see that Samsung continues to use 64KB L2 caches, while Qualcomm makes use of 128KB implementations. Furthermore, it looks like the Exynos 2100 makes available to the A55 cores the full speed of the memory controllers, while the Snapdragon 888 puts a hard limit on them, and hence the very bad memory latency, similarly to how Apple does the same in their SoCs when just the small cores are active.

Qualcomm seems to have completely removed access of the CPU cluster to the SoC’s system cache, as even the Cortex-A55 cores don’t look to have access to it. This might explain why the CPU memory latency this generation has been greatly improved – as after all, memory traffic had to do one whole hop less this generation. This also in theory would put less pressure on the SLC, and allow the GPU and other blocks to more effectively use its 3MB size.

5nm / 5LPE: What Do We Know? SPEC - Single Threaded Performance & Power
Comments Locked

123 Comments

View All Comments

  • Andrei Frumusanu - Tuesday, February 9, 2021 - link

    I don't have 5G coverage here so it's not feasible for me to test.
  • Edwardmcardle - Wednesday, February 10, 2021 - link

    Will you be testing reception differences e.g. 4g and wifi? Fantastic write up as always!
  • Dorkaman - Tuesday, February 9, 2021 - link

    Different s21 ultra phones can have different performance says tech chap

    https://youtu.be/yuNNmf2gIRc

    I guess this is due to binning and his tests show his Exonys 2100 is in the middle. Strange. Also, the battery life is better on the 888 and external temps are about the samd.
  • serendip - Tuesday, February 9, 2021 - link

    All this really doesn't look good for Windows on ARM if we're stuck with hot and hungry Qualcomm chips on Samsung 5nm. The 8cx and SQ on TSMC 7nm were very efficient but that's with slower A76 cores. I'm hoping a quad-X1 design on TSMC 5nm will be in the next iteration of the Surface Pro X or Galaxy Book S.
  • Raqia - Tuesday, February 9, 2021 - link

    Disappointing sustained performance, however the S21 series lacks the phase change vapor chamber cooling solution of the S20's:

    https://9to5google.com/2021/01/18/samsung-galaxy-s...

    vs

    https://www.ifixit.com/News/43501/why-samsung-buil...

    Notably the Mi11 has this:

    https://gadgettendency.com/a-triple-chamber-as-a-s...

    This makes for better subsequent runs but the SoCs built on 5LPE are still disappointing.
  • iphonebestgamephone - Wednesday, February 10, 2021 - link

    Mi 11 may have the vapor chamber for better cooling, but it also allows for a higher battery temperature. If they throttled at the same temps we could see how useful that thing actually is.
  • dudedud - Wednesday, February 10, 2021 - link

    Not all S20s had that vapor chamber. Some just had a graphene layer, which in theory would give similar results. Don't know if the S21 uses graphene tho.
  • darkich - Wednesday, February 10, 2021 - link

    The battery life benchmarks are indication of how actually invalid the whole Anantech's premise is.
    Pretty much ALL actual real usage tests have shown BIG improvements in the autonomy between the S20/S21 yet Andrei wants us to believe in the stupid benchmark test that shows "regression" between Exynos 990 and Exynos 2100.
    What a joke..
  • ChrisGX - Sunday, February 14, 2021 - link

    I would say these results are incomplete rather than invalid. The PCMark Work 2.0 - Battery Life test is a demanding mixed usage benchmark. When running that benchmark it isn't exactly a shock that the Exynos 2100 S21 Ultra should return very slightly reduced battery life than the Exynos 990 S20 Ultra. Anandtech isn't alone in noting that when processing demanding workloads the Exynos 2100 draws more power (on average) than the Exynos 990. Andrei, for his part, is explicit that the Exynos 2100 is also significantly more performant than its predecessor. He does say that the increased performance wasn’t just achieved through improved efficiency, but also through greater power usage and it is hard to dispute that looking at the numbers.

    There is a gap in the data however. The full PCMark Work 2.0 - Battery Life test involves a Work performance score that gives a more complete picture of how much work/the rate that work is being completed while executing the test. That would be very useful information to have. Still, it is undoubtedly the case that the reduction in battery life that Andrei mentions is not due to a regression but rather the increased rate that the Exynos 2100 is executing work (when processing demanding mixed usage workloads). While that information isn't provided in connection to the PCMark Work 2.0 - Battery Life test the GFXBench GPU heavy test data (arranged in Power Efficiency tables) does confirm the high power draw of the Exynos 2100 during peak performance bursts (which must bump up average power consumption as well) even as that chip roundly outperforms the Exynos 990.

    Indeed, heavy mixed usage workloads are not going to put the Exynos 2100 battery life in the best light. Still, Andrei did show the results from a Web Browsing Battery Life test that undoubedly will be useful to a lot of phone users who don't view the results of the PCMark Work 2.0 - Battery Life test as having a lot of relevance for them. But, I, for one, am happy to have that information.

    Andrei seems to be adding to/reworking the battery life data in this review.

    https://benchmarks.ul.com/pcmark-android
    https://s3.amazonaws.com/download-aws.futuremark.c...
  • sachouba - Wednesday, February 10, 2021 - link

    Nice peak power consumption!
    It doesn't seem unlikely that we'll end up with a situation similar to Apple's battery gate on Snapdragon 888 devices, in a few years. Way to go!

Log in

Don't have an account? Sign up now