New Instructions and Updated Security

When a new generation of processors is launched, alongside the physical design and layout changes made, this is usually the opportunity to also optimize instruction flow, increase throughput, and enhance security.

Core Instructions

When Intel first stated to us in our briefings that by-and-large, aside from the caches, the new core was identical to the previous generation, we were somewhat confused. Normally we see something like a common math function get sped up in the ALUs, but no – the only additional changes made were for security.

As part of our normal benchmark tests, we do a full instruction sweep, covering throughput and latency for all (known) supported instructions inside each of the major x86 extensions. We did find some minor enhancements within Willow Cove.

  • CLD/STD - Clearing and setting the data direction flag - Latency is reduced from 5 to 4 clocks
  • REP STOS* - Repeated String Stores - Increased throughput from 53 to 62 bytes per clock
  • CMPXCHG16B - compare and exchange bytes - latency reduced from 17 clocks to 16 clocks
  • LFENCE - serializes load instructions - throughput up from 5/cycle to 8/cycle

There were two regressions:

  • REP MOVS* - Repeated Data String Moves - Decreased throughput from 101 to 93 bytes per clock
  • SHA256MSG1 - SHA256 message scheduling - throughput down from 5/cycle to 4/cycle

It is worth noting that Willow Cove, while supporting SHA instructions, does not have any form of hardware-based SHA acceleration. By comparison, Intel’s lower-power Tremont Atom core does have SHA acceleration, as does AMD’s Zen 2 cores, and even VIA’s cores and VIA’s Zhaoxin joint venture cores. I’ve asked Intel exactly why the Cove cores don’t have hardware-based SHA acceleration (either due to current performance being sufficient, or timing, or power, or die area), but have yet to receive an answer.

From a pure x86 instruction performance standpoint, Intel is correct in that there aren’t many changes here. By comparison, the jump from Skylake to Cannon Lake was bigger than this.

Security and CET

On the security side, Willow Cove will now enable Control-Flow Enforcement Technology (CET) to protect against a new type of attack. In this attack, the methodology takes advantage of control transfer instructions, such as returns, calls and jumps, to divert the instruction stream to undesired code.

CET is the combination of two technologies: Shadow Stacks (SS) and Indirect Branch Tracking (IBT).

For returns, the Shadow Stack creates a second stack elsewhere in memory, through the use of a shadow stack pointer register, with a list of return addresses with page tracking - if the return address on the stack is called and not matched with the return address expected in the shadow stack, the attack will be caught. Shadow stacks are implemented without code changes, however additional management in the event of an attack will need to be programmed for.

New instructions are added for shadow stack page management:

  • INCSSP: increment shadow stack pointer (i.e. to unwind shadow stack)
  • RDSSP: read shadow stack pointer into general purpose register
  • SAVEPREVSSP/RSTORSSP: save/restore shadow stack (i.e. thread switching)
  • WRSS: Write to Shadow Stack
  • WRUSS: Write to User Shadow Stack
  • SETSSBSY: Set Shadow Stack Busy Flag to 1
  • CLRSSBSY: Clear Shadow Stack Busy Flag to 0

Indirect Branch Tracking is added to defend against equivalent misdirected jump/call targets, but requires software to be built with new instructions:

  • ENDBR32/ENDBR64: Terminate an indirect branch in 32-bit/64-bit mode

Full details about Intel’s CET can be found in Intel’s CET Specification.

At the time of presentation, we were under the impression that CET would be available for all of Intel’s processors. However we have since learned that Intel’s CET will require a vPro enabled processor as well as operating system support for Hardware-Enforced Stack Protection. This is currently available on Windows 10’s Insider Previews. I am unsure about Linux support at this time.

Update: Intel has reached out to say that their text implying that CET was vPro only was badly worded. What it was meant to say was 'All CPUs support CET, however vPro also provides additional security such as Intel Hardware Shield'.

 

AI Acceleration: AVX-512, Xe-LP, and GNA2.0

One of the big changes for Ice Lake last time around was the inclusion of an AVX-512 on every core, which enabled vector acceleration for a variety of code paths. Tiger Lake retains Intel’s AVX-512 instruction unit, with support for the VNNI instructions introduced with Ice Lake.

It is easy to argue that since AVX-512 has been around for a number of years, particularly in the server space, we haven’t yet seen it propagate into the consumer ecosphere in any large way – most efforts for AVX-512 have been primarily by software companies in close collaboration with Intel, taking advantage of Intel’s own vector gurus and ninja programmers. Out of the 19-20 or so software tools that Intel likes to promote as being AI accelerated, only a handful focus on the AVX-512 unit, and some of those tools are within the same software title (e.g. Adobe CC).

There has been a famous ruckus recently with the Linux creator Linus Torvalds suggesting that ‘AVX-512 should die a painful death’, citing that AVX-512, due to the compute density it provides, reduces the frequency of the core as well as removes die area and power budget from the rest of the processor that could be spent on better things. Intel stands by its decision to migrate AVX-512 across to its mobile processors, stating that its key customers are accustomed to seeing instructions supported across its processor portfolio from Server to Mobile. Intel implied that AVX-512 has been a win in its HPC business, but it will take time for the consumer platform to leverage the benefits. Some of the biggest uses so far for consumer AVX-512 acceleration have been for specific functions in Adobe Creative Cloud, or AI image upscaling with Topaz.

Intel has enabled new AI instruction functionality in Tiger Lake, such as DP4a, which is an Xe-LP addition. Tiger Lake also sports an updated Gaussian Neural Accelerator 2.0, which Intel states can offer 1 Giga-OP of inference within one milliwatt of power – up to 38 Giga-Ops at 38 mW. The GNA is mostly used for natural language processing, or wake words. In order to enable AI acceleration through the AVX-512 units, the Xe-LP graphics, and the GNA, Tiger Lake supports Intel’s latest DL Boost package and the upcoming OneAPI toolkit.

10nm SuperFin, Willow Cove, Xe, and new SoC Cache Architecture: The Effect of Increasing L2 and L3
Comments Locked

253 Comments

View All Comments

  • huangcjz - Thursday, September 17, 2020 - link

    No, you can say "have to hand" as in something which is available. E.g. "Do you have the presentation to hand?"
  • 29a - Thursday, September 17, 2020 - link

    Wouldn't a non Iris chip be a fairer comparison to Renoir?
  • Kamen Rider Blade - Thursday, September 17, 2020 - link

    AMD's 4800U has a 25 watt mode, Hardware UnBoxed tested it against Intel.

    Why didn't you test it and put those results in the chart?

    Why this biased reviewing of one side gets 15 watt and 28 watt scores.

    Yet AMD isn't allowed to show 25 watt scores?

    What are you afraid of when comparing like for like?
  • IanCutress - Thursday, September 17, 2020 - link

    For us, the 15W to 15W results were the focal point. 28W is there to show a max Intel and look at scaling. Also, The amount of 4800U devices at 25W is minimal.

    Not only that, I'm on holiday. I had to spend two days out, while in this lovely cottage in the countryside, to write 18k words, rather than spend time with my family. I had 4 days with the TGL laptop, and 8 days notice in advance to prepare before the deadline. Just me with a couple of pages from Andrei, no-one else. Still posted the review 30 minutes late, while writing it in a pub as my family had lunch. Had to take the amd laptop with me to test, and it turns out downloading Borderlands 3 in the middle of nowhere is a bad idea.

    Not only that, I've been finishing up other projects last week. I do what I can in the time I have. This review is 21k words and more detailed than anything else out there done by a single person currently in the middle of a vacation. If you have further complains, our publisher's link is at the bottom of the webpage. Or roll your own. What are you afraid of? I stand by my results and my work ethic.
  • PixyMisa - Thursday, September 17, 2020 - link

    I really appreciate the effort. The individual SPEC results are vastly more useful than (for example) a single Geekbench score.
  • Spunjji - Friday, September 18, 2020 - link

    I can second that - I appreciate seeing a breakdown of the strengths/weaknesses of each core design.
  • Kamen Rider Blade - Friday, September 18, 2020 - link

    We appreaciate your hard work, I do watch your YT channel Tech Tech Potato. That being said, if you knew about this issue, with not comparing like for like; then just omit the 28 W scores from the Intel machine and just focus on Intel's 15W vs AMD's 15W.

    Why even include the 28W on the chart? You know how this makes you and Anandtech look, right? The issues of bias towards or against any entity could've been easily avoided if you had "Like for like" scores across the board. That's part of what Steve from Gamers Nexus and many of us enthusiast see's as "Bias Marketing" or "Paid Shilling" to manipulate results in one way or another. Many people can easily interpret your data of not showing "like for like" in many wrong ways when they have no context for it.

    If you didn't want to test AMD's 25 watt scores, nobody would care, just don't bring up Intel's equivalent 28 watt scores. Alot of the more casual readers won't look at the details and they can easily mis-interpret things. I prefer that your good name doesn't get dragged down in mud with a simple omission of certain benchmark figures. I know you wouldn't deliberately do that to show bias towards one entity or another, but will other folks know that?
  • Spunjji - Friday, September 18, 2020 - link

    Presenting the figures he has isn't bias. Bias would be proclaiming Intel to be the winner without noting the discrepancy, or specifically choosing tests to play to the strength of one architecture.

    As it is, the Lenovo device doesn't do a 25W mode, so you're asking him to add a full extra device's worth of testing to an already long review. That's a bit much.

    If you take a look at the 65W APU results and compare them, you'll see a familiar story for Renoir - there's not actually a whole lot of extra gas in the tank to be exploited by a marginally higher TDP. It performs spectacularly well at 15W, and that's that.
  • Kamen Rider Blade - Friday, September 18, 2020 - link

    You can literally just omit the 65W APU, it has no relevance to be on that chart.

    Ok, if that Lenovo LapTop doesn't offer a 25W mode, fine. Maybe Hardware Unboxed got a different model of LapTop for the 4800U. Then don't present Intel's 28W mode.

    That's how people misunderstand things when there is a deliberate omission of information or extra information that the other side doesn't happen. The lack of pure like for like causes issues.
  • Spunjji - Saturday, September 19, 2020 - link

    You're *demanding bias*. They had the Intel device with a 28W mode, 28W figures are a big part of the TGL proposition, so they tested it and labelled it all appropriately. That isn't bias.

    The "lack of pure like for like" only causes issues if you don't really pay attention to what the article says about what they had and how they tested it.

Log in

Don't have an account? Sign up now