Proj. CAR Paper Reading: XFL: Naming Functions in Binaries with Extreme Multi-label Learning

Abstract

背景：
挑战：unlike words in natural language, most function names occur only once.
本文：

XFL(extreme function labeling)

任务：为binary functions选择合适的labels
步骤：
1. 将function names分成若干tokens，每个token对应一个informative label
2. 使用DEXTER将静态分析获取的语义features与call graph对应的local context、整个binary抽取的global context相结合，获取function embedding
3. 预测label并组合输出

DEXTER

任务：function embedding
步骤：将静态分析获取的语义features与call graph对应的local context、整个binary抽取的global context相结合，获取function embedding

实验：
数据集：10047 binaries from the Debian project
效果：

precision of 83.5%
we demonstrate that binary function labeling can be effectively phrased in terms of multi-label learning

实验2：
替换DEXTER得到的function embedding
效果：

DEXTER得到的embedding永远最好
we demonstrate that binary function embeddings benefit from including explicit semantic features.

1. Intro

P1: 逆向工程作用

P2: 二进制逆向工程挑战

P3: 逆向工程3阶段，表达文本信息对逆向工程的作用

P4: 已有的工具如何处理；二进制匹配处理Static library和compiler runtimes，不足

P5: 机器学习已有进展

P6: 问题：1. 只能生成训练集中见过的数量名称: it can only generate function names that have been seen in the training set;
2. the classes are heavily imbalanced

P7: 例子的统计数据验证

P8: 解决方案1： tokenize

P9: 解决方案2: multi-label learning

2. Extreme Multi-label learning

D. Propensity-based Scoring in PfastreXML

假设：在大数据中，召回率要比实际低，而精度其实并不差-作者和编辑可能并不只是不知道原则上适⽤于他们的⽂章的所有类别。将标签分配给完全错误的数据点的情况要少得多。

相关但很少的标签的成功应该在训练期间得到更多奖励。
将折扣与相应的标签倾向相乘
reranking results using classifiers for rare labels
introduce additional hyperparameters α (for re-ranking) and γ (for predicting rare labels).
PfastreXML improves the multi-label classification accuracy of FastXML

3. Overview

Training

Generate a Label Space:
The tokenization syntactic rule 考虑(combinations of) multiple naming conventions和substitute common abbreviations
the size of the label space to between 512 and 4096 labels in our experiments
Training the Function Embeddings
距离相似原理
Training XFL
Traning the Language Model: a trigram model to predict the order of a function ame

Prediction

返回所有高于阈值的labels

4. Function Embedding

A. 现有二进制上的函数编码获取方法

现有方法: Asm2Vec, SAFE, PalmTree

B. DEXTER

Assumption: making code semantics explicit will help the training process derive meaningful embeddings with less training data
基于“Probabilistic naming of functions in stripped binaries,” 二进制文件相似性"Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned"

将每个函数表示为VEX这种IR
对每个函数执行符号分析
分析每个函数使用的内存地址和寄存器值，确定输入参数的数量
对每个输入参数进行taint analysis，计算被调用函数的参数对应的flows（calculate flows to the individual arguments of callee functions）
基于图的向量特征：
Local Degree Profile(LDP): 函数的intra-procedural CFG
BoostNE: callgraph
两个hash用来识别常见函数

fb = [qb, cb]

C. Function and Binary Context

a function context vector fC as the mean of the feature vectors of its callers and callees.

D. Autoencoder Training

fb^ = [fb, gb, hb]

5. XML FOR FUNCTION BINARIES

A. Tokenizing Function Names

the tokens str, string, String, and xStr should all have the common denominator token string

具体步骤

Strip Library Decorations: Regular expressions remove common symbol annotations added by compilers, e.g., '.*.constp$', '.ˆ.avx\d+'. 也会使用Radare2, IDA Pro和Ghidra中已经提供的功能
Split Alphanumerical: 根据非alphanum字符分割
Split Camel Case
Abbreviation Expansion
Best Split of the Rod: 使⽤动态规划算法将字符序列分割成尽可能⼤的不重叠序列

wrong splits change the meaning, as in fstarpu matrix 7→

B. Label Space

取出前 n 个最常⽤的标签来生成Ln

C. PfastreXML for Function Labels

PfastreXML 的重新排序机制专⻔确保此类不常⻅标签不会在FastXML 基于树的层次结构中丢失。

6. FUNCTION NAME GENERATION

A. Language Model

Kneser-Ney平滑+trigram model，预测顺序

B. Accuracy

often the model produces an order that is simply an alternative to the original one with the same meaning.
there are cases with multiple equally plausible orderings with different semantics
some labels are just too rare for the language model to be meaningfully generalized.

7. Evaluation

RQ1: Which binary function embedding is most suited for the task of ranking function labels?
RQ2: Does XFL generate more suitable function labels than state-of-the-art approaches?
We exclude pseudo functions of size zero, overlapping functions, and locally bound symbols, which do not clearly correspond to a well-defined function.
符号表可⽤于读取函数边界
当预测未知⼆进制⽂件中的标签时，符号已被剥离，并且必须使⽤函数边界预测来获得函数边界

泛化：像 hash、get_line 或 use 这样的函数名称是开发⼈员经常使⽤的，尽管相应的函数会有很⼤不同。重复了第 VII-D2 节中的实验，但将测试集限制为训练中不存在的函数名称。作为未知函数名称的⼀部分正确恢复的常⻅标签包括 get、set、
new、free 以及 OCaml 特定的标记 caml 和 fun

8. Discussion

在使⽤不同编译器和构建设置的 Nero 数据集上进⾏测试时，即使我们⾃⼰的模型性能也会下降。
使⽤机器学习来预测函数名称本质上需要⼀个分布与⽬标数据集相似的训练数据集,我们的整个⼯具链和设置⽬前仅适⽤于从 C 编译的⼆进制⽂件。除了从其他语⾔编译的代码所带来的分布差异之外，不同的命名约定、名称空间以及由此产⽣的名称修改还需要更改

posted @ 2023-06-22 07:56 雪溯阅读(42) 评论(0) 编辑收藏举报

刷新页面返回顶部

雪溯

总之心情不好的话大概就会来这边做两道OJ，此处顺便储存部分笔记