Proj. CAR Paper Reading: DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Abstract

本文
工具: DOBF
任务：pretraining objective to recover the original version of obfuscated source code
方法：利用编程语言的结构信息
实验：
效果：

在多个下游任务上显著好过已有成果
unsupervised code translation: +12.2%
natural language code search: +5.3%
能够deobfuscate fully obfuscated source files，并且能够建议有描述力的variable names(descriptive variable names)

P1: 模型预训练被广泛使用
P2:

mask, MLM objective。本文认为提到的方法很多都是denoising auto-encoding tasks + 不同的添加噪音方法
BERT本身是为了自然语言设计的，本文认为对具有语法结构的source code是suboptimal。并且提出了a new objective based on deobfuscation of identifier names

P3: Code obfuscation基本介绍，可能丢掉调试信息的其他情况，比如编译

P4: 例子，猜测图1实现了⼴度优先搜索算法。

P5: 本文方法：seq2seq， predict the content of randomly masked tokens
难点：需要理解程序作用
MLM方法弱点：

In practice, MLM also masks out variable names, but if a given variable appears multiple times in a function, it will be easy for the model to simply copy its name from one of the other occurrences.
本文方法：使用VAR_i这个特殊token来替换（提问）input中的Variable names

Training with pobf = 0.5 is a more difficult task that requires the model to learn and understand more about code semantics

objectives such as MLM and DAE that provide unstructured noise are complementary to DOBF.

posted @ 2023-06-22 07:53 雪溯阅读(21) 评论(0) 编辑收藏举报

刷新页面返回顶部