Comments Locked

33 Comments

Back to Article

  • Kevin G - Tuesday, September 30, 2014 - link

    Good to hear about the coming deep dive into the X-Gene. I'm really curious how it compares to Apple's A7 and A8 (granted they are targeted at radically different markets).

    Interesting. I thought that TI had gotten out of the SoC business. Perhaps that was only in the ultra mobile front?

    The m800 is interesting due to the DSPs. It should also be noted that while there are four chips on the cartridge, they are not in a SMP topology. Rather this cartridge should logically be seen as four nodes. Max memory per node is 8 GB which allows one 32 bit application a full 4 GB with a PAE-like extension enabling the rest for other applications.
  • DanNeely - Tuesday, September 30, 2014 - link

    They only left the smartphone/tablet market They're still in all the other embedded markets they've always supported.
  • Lord of the Bored - Friday, December 19, 2014 - link

    And now they're back in the high-end business computer space, which they left decades ago!

    If only HP had designated that model the m990... what a wasted opportunity. :(
  • hamiltenor - Tuesday, September 30, 2014 - link

    For reference, here is TI's marketing stuff on their website for the A15 CPUs
    http://www.ti.com/lsds/ti/arm/keystone/arm_cortex_...
  • Wolfpup - Tuesday, September 30, 2014 - link

    Are these things actually competitive with x86? I mean obviously when you have something that doesn't scale well they can't be, but are they even for things that do? It doesn't really seem like ARM's that competitive even on the low end when it comes to things like tablets...
  • Flunk - Tuesday, September 30, 2014 - link

    It mostly depends on how much the ARM server will cost all together, along with power efficiency. It doesn't have to be as powerful as an Intel system the same size if for example it's half the price and uses half the energy. There are really too many variables to tell until they do a full test.

    A lot of server loads are very parallel so that probably isn't a concern.
  • andrewaggb - Tuesday, September 30, 2014 - link

    I suspect the cores are nowhere near the speed of a xeon. But it should have decent I/O performance. I'm sure the DSP version will be very fast at a few things. Not sure if they announced what those things are yet :-)
  • patrickjchase - Wednesday, October 1, 2014 - link

    The DSP in these is the TI C66, which is actually an 8-wide VLIW CPU rather than a classic DSP. I've programmed them before and they're fast across a fairly wide range of applications, though they do require a much higher level of programmer skill than most CPUs.
  • MrSpadge - Tuesday, September 30, 2014 - link

    Don't discount them yet. ARM V8 can offer nice performance boosts, as Anand has shown using Apple hardware. And with 4-wide cores the X-Gene is more beefy than your typical smartphone / tablet CPUs. I hope they're giving it 4 memory channels for good reason (other than a crappy memory controller..)!
  • iwod - Tuesday, September 30, 2014 - link

    Intel still wins I/O and Network. Atom CPU will beat this pref wise. So the problem really is price and market fit.
  • ddriver - Tuesday, September 30, 2014 - link

    "Atom CPU will beat this pref wise" - when? In 2020?
  • Wilco1 - Tuesday, September 30, 2014 - link

    "Our next version will beat ARM, next year, we promise!" - we've heard that since 2008...

    Once again Intel aimed too low with Silvermont. Avoton is significantly slower than X-Gene in terms of CPU performance (4-way aggressive OoO vs 2-way partial OoO - no contest). The IO/Network performance is a fraction of the X-Gene version too, with just a dual 1GbE port per core, while X-Gene does dual 10GbE per core. X-Gene also has double the memory, so overall performance will be significantly better in any possible application.
  • fteoath64 - Wednesday, October 1, 2014 - link

    True. Atom, now BayTrail is still slower than the Arm64 cores in cpu terms on the same clock speed. Power consumption wise, it is demolished by the Arm. You can see that Intel is lowering the TDPs of the big core Haswell to address the lower power market knowing the BayTrail microarchitecture does not really cut it in terms of performance per watt.
  • JohanAnandtech - Thursday, October 2, 2014 - link

    Can you back that up? I have not seen any sign that Intel's baytrail is slower in pure performance. Perf/watt is a different matter.
  • iwod - Thursday, October 2, 2014 - link

    Exactly. Why are people doubting Intel on Performance? Perf / Watt on Server Level ( Few W to 10s W ) Intel pretty much dominate the game. You want a lower power CPU for File serving? Atom has it there and future Atom will do even better. You want a little more CPU Power? Broadwell -U is many times faster then even top end ARM64 CPU.

    So yes, Intel cant win the game at Mobile mW range. On the server side and its software ecosystem, Intel has everything, its only a matter of market and price fit.
  • Wilco1 - Friday, October 3, 2014 - link

    People doubt Atom performance as it underperforms. Nobody is saying real Xeons are slow, just Atom.
  • Wilco1 - Friday, October 3, 2014 - link

    You don't really believe a 2-way partially out-of-order core can beat a 4-way aggressive OoO one?

    If you want hard evidence, check how a 2.2GHz A15 beats a 2.6GHz Avoton on single-threaded performance by a good margin, both on integer and FP (ignore the hardware accelerated AES test): http://browser.primatelabs.com/geekbench3/compare/...

    So given that has ~30-35% higher IPC than Silvermont, and A57 has > 30% better IPC than A15, and X-Gene is faster still, it is pretty obvious that X-Gene at 2.4GHz beats Avoton even if it manages to turbo all cores to 2.6GHz. The large L3 cache and much faster memory system in X-Gene helps as well of course.
  • shodanshok - Friday, October 3, 2014 - link

    Hi,
    Silvermont is not "partially out-of-order" or "moderately aggresive out-of-order".
    Please read David Kanter dissection here: http://www.realworldtech.com/silvermont/4/
    In short: Silvermont OOOE is very aggressive, and its OOO memory capabilities are top notch.

    Regarding performance, I can not 100% predict which processor (X-Gene or Silvermont) is faster at single-thread level, but I bet X-Gene would be faster. However, Bay Trail (read: silvermont) is absolutely competitive with Cortex-A15, and generally faster. See http://www.anandtech.com/show/8197/samsung-galaxy-... and http://www.anandtech.com/show/7314/intel-baytrail-...
    While they are mainly browser-based benchmark, Bay Trail is always faster than Cortex-A15, often by a significant margin.

    Moreover, even the GeekBench3 score you provided show Bay Trail at least on par with A15 in single core integer performance, with 1234 point for Tegra K1 and 1321 (both chip had similar clocks).

    Anyway, the single, most notable thing that instantly makes X-Gene a very interesting product are the two integrated 10GbE links. This is an area where Intel is (as often happens) too much conservative, relaying on external (and costly) network chips.

    Regards.
  • patrickjchase - Saturday, October 4, 2014 - link

    Wilco is harping on the fact that Silvermont's FP/vector path is in-order, so in that sense he's right.

    With that said, over-obsession with size (or width) is as common in microarchitecture as in other domains. I've seen plenty of cases where cores with high theoretical or peak IPC get clobbered by designs that look much slower on paper. It often comes down to memory subsystem design and/or branch prediction.

    Without stooping to the "who's better" argument, the fact that we're having this debate at all is prima facia evidence that A15 is such an underperformer (i.e. that its real world performance is much lower than its width and clock rate suggest).
  • shodanshok - Saturday, October 4, 2014 - link

    Are you sure Silvermont's FP are in-order? From David Kanter article posted above:

    "The floating point schedulers also have 8 entries each, but do not hold data. Instead, the input operands are read when the instruction is dispatched to the execution units (similar to Haswell’s unified scheduler) to minimize the movement of the larger 128-bit SSE data."

    If I remember correctly, the problem with Silvermont FP performance is that many instruction are not executed with single cycle throughput. CortexA15 FPU is probably faster... ;)

    Regards.
  • Wilco1 - Saturday, October 4, 2014 - link

    Check the Intel optimization manual, it clearly states the FP issue queues are in-order. Memory operations are similar, except that there is a small retry buffer that handles cachemisses.
  • patrickjchase - Saturday, October 4, 2014 - link

    It was discussed ad nauseum in the (400+ post) discussion thread from the very same RWT article you cite. Looks for the posts titled "No out-of-order in FP cluster of silvermont" (and yes, I'm the same 'Patrick Chase' and 'Wilco1' above is the same "Wilco'. It's rather depressing how it's always the same group of people with nothing better to do with our lives... :-).

    The gcc commit comments cited in the initial post in that thread are somewhat misleading in describing the memory pipe as in-order. That's true for the initial L1 D$ probe, but ops that miss L1 are then forwarded to an OoO reissue queue. The result is an in-order memory pipeline for L1 but OoO for L2/DDR, which is basically what you want given that arguably the main benefit of having OoO in the first place is that it allows you to execute past cache misses (compilers are good at scheduling around known latencies such as L1 D$ access - it's the unknown/variable ones that require runtime "help").

    I recall David Kanter confirming the in-order nature of Silvermont FP somewhere, but can't find that post.
  • Wilco1 - Saturday, October 4, 2014 - link

    Silvermont has a small reorder buffer and a mostly in-order FP pipeline. Memory operations are also mostly in-order, with no speculation whatsoever. All loads/stores always issue in program order (ie. there is no out-of-order issue at all). Stores with unknown addresses stall the whole pipeline until resolved. Loads with unknown address and cache misses go into a small retry queue after using an issue cycle, and are retried later (so there is limited out of order retry). This is almost identical to a hit-under-miss load/store pipeline on an in-order CPU.

    So Silvermont is limited and partially OoO. If you call this "aggressive" or "top notch" then what do you call eg. Cortex-A15 with its much larger reorder buffer, full OoO FP, int, branch and memory pipelines with full speculation? You'll run out of superlatives...

    AnandTech would do well to ditch their hopeless JS rubbish - the results are not even from the same browser (let alone same version of the same browser!), and different browsers show >2x difference in performance on eg. SunSpider. Using JS as an indication of browser performance is bad enough (as it isn't at all), but using JS benchmarks to claim better CPU performance is insane.

    Did you read what I wrote about the Geekbench integer test? I said I removed the AES subtest as that used hardware acceleration on BT but not on A15 which skews the score. Without that the A15 wins the integer test too despite running at a significantly lower frequency (remember that Avoton runs at 2.6GHz turbo).

    Yes the large number on-board interfaces on X-Gene is definitely a differentiator, and this improves the power efficiency equation. Personally I think the 4 DDR3L channels are more interesting, providing a whopping 60GB/s or 7.5GB/s per core!
  • patrickjchase - Saturday, October 4, 2014 - link

    I call A15 a big, power-hungry design that looks great on paper but that has consistently underperformed in the real world, probably due to memory subsystem and branch prediction limitations. Yes, it's 3-wide and has a massive ROB, but it performs no better than the much less aggressive A17 (and not much better clock-for-clock than Krait, for that matter - There's a reason why Snapdragons dominated the market in that generation).

    ARM have effectively admitted as much by backtracking to a much less aggressive design in A17.
  • Wilco1 - Sunday, October 5, 2014 - link

    The main reason A17 does so well is the streamlined memory system that was designed in the 2 years since the A15. Better prefetchers and the ability to do 2 loads or 2 stores per cycle easily give 20-30% gain. You can see a 2.2GHz comparison of A15 with A17 here: http://browser.primatelabs.com/geekbench3/compare/... - almost identical integer and FP scores, but A17 scores 15% better on memory.

    If any ARM CPU could be compared with Pentium 4, Krait would fit the description perfectly - it is a design for high frequency with low IPC. So it has to be clocked extremely high to get decent performance (there is a reason Samsung uses either 1.9GHz A15 or 2.5-2.7GHz Kraits in their phones - they have nearly identical performance).

    Krait's only redeeming factor is that its power consumption is very good. However Snapdragon's dominance in the last 2 years is unlikely to continue now that other SoCs finally support built-in modems and QC got the timing of the 64-bit generation badly wrong.
  • patrickjchase - Sunday, October 5, 2014 - link

    Now that I think about it some more, A15 can be described by a single "superlative": It's the Pentium-4 of the ARM world. Really.
  • shodanshok - Sunday, October 5, 2014 - link

    Wilco, patrick,
    Thank you for giving some reference for silvermont FP cluster.

    @wilco
    "Aggressive" and "top notch" were referred to current common ARM cores. Silvermont performance are quite good even compared to A15 which, as patrick noted, is impressive on paper but underperforming in real world test and consume a noticeably amount of power. And until A12/A17, Cortex memory performance was quite bad.

    I think that the real plus of Intel cores is their very advanced prefetcher...

    Web benchmarks, while flawed, represents a real world scenario which is quite significative for end users.

    Anyway, the X-Gene discussed above is quite a different beat ;)

    Regards.
  • TadzioPazur - Tuesday, October 14, 2014 - link

    The benchmark that you referenced (how relevant is it to server products?) points that ARM-based tabled is on par with server Avoton, and needs to catch up when the field levels.
    Intel Atom C2750 @ 2.40 GHz trades blows with the nVidia tn8 @ 2.22 GHz (in single threaded benchmarks).
    Now, lets see what are the ramifications of making it a server-grade chip:
    1. Buffered memory with ECC (latencies go up) - the Avoton already uses them
    2. Write-through D cache (that is ARM's solution for cache coherency)
    3. Optional addition of L3 cache (which would mediate the 2. somewhat)
    4. Further latency increase if we go to multisocket machines (optional, would hurt both lines - not needed for SAN/NAS appliances)

    So no, ARM chips are not clearly better than low-power, low-performance intel CPUs, at least performance-wise.
  • michael2k - Wednesday, October 1, 2014 - link

    Um. what world do you live in where ARM isn't competitive in tablets? In the real world it is Intel that isn't competitive. Apple managed to ship a 6 issue dual core OOE processor at 1.3GHz last year, 1.4GHz this year, broadly comparable to a 2GHz 4 issue dual core Core Duo that Intel shipped in 2006.
  • AFigueira - Thursday, October 2, 2014 - link

    Can you provide some sources for those claims, please?
  • OreoCookie - Friday, October 3, 2014 - link

    If you base your performance estimate on Geekbench, Apple's A8 should be roughly as fast as a 2.9~3 GHz Core 2 Duo from 2008: the A8 reaches a score of http://www.phonearena.com/news/Geekbench-test-show...">about 2920 while the Core 2 Duo E7500 (2.93 GHz) scores http://browser.primatelabs.com/processor-benchmark...">2974 in the multicore 64 bit benchmark. On the single-core benchmark, the A8 is a hair's breadth faster (1633 points vs. 1614 for the Intel).

    But both, the A7 and A8 are faster than a 2006 era Core 2 Duo, though, the latter in its T7500 incarnation only scores 1262 and 2126.

    Roughly speaking, Apple's SoCs deliver the same level of CPU performance as a notebook CPU that's 6 years older (and I think this has been true for a while).
  • FunBunny2 - Tuesday, September 30, 2014 - link

    The interesting bit is the Informix pre-install. RDBMS, done smart, falls in the "embarrassingly parallel" quadrant of computing. Informix is a niche DB with some experience in parallel:
    http://programmingexamples.wikidot.com/19-informix...

Log in

Don't have an account? Sign up now