XFL: Naming Functions in Binaries with Extreme Multi-label Learning


挑战:unlike words in natural language, most function names occur only once.

  1. XFL(extreme function labeling)
  • 任务:为binary functions选择合适的labels
  • 步骤:
    1. 将function names分成若干tokens,每个token对应一个informative label
    2. 使用DEXTER将静态分析获取的语义features与call graph对应的local context、整个binary抽取的global context相结合,获取function embedding
    3. 预测label并组合输出
  • 任务:function embedding
  • 步骤:将静态分析获取的语义features与call graph对应的local context、整个binary抽取的global context相结合,获取function embedding

数据集:10047 binaries from the Debian project

  1. precision of 83.5%
  2. we demonstrate that binary function labeling can be effectively phrased in terms of multi-label learning

替换DEXTER得到的function embedding

  1. DEXTER得到的embedding永远最好
  2. we demonstrate that binary function embeddings benefit from including explicit semantic features.

1. Intro

P1: 逆向工程作用

P2: 二进制逆向工程挑战

P3: 逆向工程3阶段,表达文本信息对逆向工程的作用

P4: 已有的工具如何处理;二进制匹配处理Static library和compiler runtimes,不足

P5: 机器学习已有进展

P6: 问题:1. 只能生成训练集中见过的数量名称: it can only generate function names that have been seen in the training set;
2. the classes are heavily imbalanced

P7: 例子的统计数据验证

P8: 解决方案1: tokenize

P9: 解决方案2: multi-label learning

2. Extreme Multi-label learning

D. Propensity-based Scoring in PfastreXML


  1. 相关但很少的标签的成功应该在训练期间得到更多奖励。
  2. 将折扣与相应的标签倾向相乘
  3. reranking results using classifiers for rare labels
  4. introduce additional hyperparameters α (for re-ranking) and γ (for predicting rare labels).
  5. PfastreXML improves the multi-label classification accuracy of FastXML

3. Overview


  1. Generate a Label Space:
  2. The tokenization syntactic rule 考虑(combinations of) multiple naming conventions和substitute common abbreviations
  3. the size of the label space to between 512 and 4096 labels in our experiments
  4. Training the Function Embeddings
  5. 距离相似原理
  6. Training XFL
  7. Traning the Language Model: a trigram model to predict the order of a function ame



4. Function Embedding

A. 现有二进制上的函数编码获取方法

现有方法: Asm2Vec, SAFE, PalmTree


Assumption: making code semantics explicit will help the training process derive meaningful embeddings with less training data
基于“Probabilistic naming of functions in stripped binaries,” 二进制文件相似性"Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned"

  1. 将每个函数表示为VEX这种IR
  2. 对每个函数执行符号分析
  3. 分析每个函数使用的内存地址和寄存器值,确定输入参数的数量
  4. 对每个输入参数进行taint analysis,计算被调用函数的参数对应的flows(calculate flows to the individual arguments of callee functions)
  5. 基于图的向量特征:
  6. Local Degree Profile(LDP): 函数的intra-procedural CFG
  7. BoostNE: callgraph
  8. 两个hash用来识别常见函数

fb = [qb, cb]

C. Function and Binary Context

a function context vector fC as the mean of the feature vectors of its callers and callees.

D. Autoencoder Training

fb^ = [fb, gb, hb]


A. Tokenizing Function Names

the tokens str, string, String, and xStr should all have the common denominator token string


  1. Strip Library Decorations: Regular expressions remove common symbol annotations added by compilers, e.g., '.*.constp$', '.ˆ.avx\d+'. 也会使用Radare2, IDA Pro和Ghidra中已经提供的功能
  2. Split Alphanumerical: 根据非alphanum字符分割
  3. Split Camel Case
  4. Abbreviation Expansion
  5. Best Split of the Rod: 使⽤动态规划算法将字符序列分割成尽可能⼤的不重叠序列
  • wrong splits change the meaning, as in fstarpu matrix 7→

B. Label Space

取出前 n 个最常⽤的标签来生成Ln

C. PfastreXML for Function Labels

PfastreXML 的重新排序机制专⻔确保此类不常⻅标签不会在FastXML 基于树的层次结构中丢失。


A. Language Model

Kneser-Ney平滑+trigram model,预测顺序

B. Accuracy

  1. often the model produces an order that is simply an alternative to the original one with the same meaning.
  2. there are cases with multiple equally plausible orderings with different semantics
  3. some labels are just too rare for the language model to be meaningfully generalized.

7. Evaluation

RQ1: Which binary function embedding is most suited for the task of ranking function labels?
RQ2: Does XFL generate more suitable function labels than state-of-the-art approaches?
We exclude pseudo functions of size zero, overlapping functions, and locally bound symbols, which do not clearly correspond to a well-defined function.

泛化:像 hash、get_line 或 use 这样的函数名称是开发⼈员经常使⽤的,尽管相应的函数会有很⼤不同。重复了第 VII-D2 节中的实验,但将测试集限制为训练中不存在的函数名称。作为未知函数名称的⼀部分正确恢复的常⻅标签包括 get、set、
new、free 以及 OCaml 特定的标记 caml 和 fun

8. Discussion

在使⽤不同编译器和构建设置的 Nero 数据集上进⾏测试时,即使我们⾃⼰的模型性能也会下降。
使⽤机器学习来预测函数名称本质上需要⼀个分布与⽬标数据集相似的训练数据集,我们的整个⼯具链和设置⽬前仅适⽤于从 C 编译的⼆进制⽂件。除了从其他语⾔编译的代码所带来的分布差异之外,不同的命名约定、名称空间以及由此产⽣的名称修改还需要更改

