安卓恶意软件分类

Drebin样本的百度网盘下载链接我放在下面评论区了,大家自行下载。本次实验收到上一次实验启发([微软恶意软件分类](https://www.cnblogs.com/iloveacm/p/13813325.html)),并采用了这篇博文的实现代码([用机器学习检测Android恶意代码](https://blog.csdn.net/qq_33256688/article/details/81434389?utm_medium=distribute.pc_aggpage_search_result.none-task-blog-2~all~first_rank_v2~rank_v25-1-81434389.nonecase&utm_term=%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E6%A3%80%E6%B5%8B%E5%AE%89%E5%8D%93%E6%81%B6%E6%84%8F%E8%BD%AF%E4%BB%B6&spm=1000.2123.3001.4430)),代码都可以在博主提供的github地址中找到。

原理

​ 具体原理参考上一次实验,都是提取反编译文件中的操作码,n-gram中n取3。具体原理可以参考这篇文章。与上次实验不同的是,本次实验针对的是安卓软件,所以具体的操作码有所不同。并且由于所采用的数据集中良性软件明显比恶意软件大的多。所以n-gram不再采用出现频次而是是否出现作为特征。

数据集

​ 本次实验的恶意软件数据集来自于Drebin,只采用了第一个part共1000个恶意软件。良性软件来自于这个网站。共1100多个良性软件,取其中的1000个。良性软件集12.3GB,恶意软件集1.2GB。可以看出良性软件要比恶意软件大的多。

反编译

​ 将良性数据集以及恶意数据集软件分别反编译到 \smalis\kind 以及 \smalis\malware 中。代码如下

1# -*- coding: utf-8 -*-
"""
Created on Tue Feb  6 14:00:51 2018
@author: 燃烧杯
"""

import os
import subprocess

def disassemble(frompath, topath, num, start=0):
    files = os.listdir(frompath)
    files = files[start:num]
        
    total = len(files)
    
    for i, file in enumerate(files):
        fullFrompath = os.path.join(frompath, file)
        fullTopath = os.path.join(topath, file)
        command = "apktool d " + fullFrompath + " -o " + fullTopath
        subprocess.call(command, shell=True)
        print("已反汇编", i, "个应用,百分比如下:")
        print((i + 1) * 100 / total, "%")


#反汇编恶意软件样本
virus_root = "..\\bit\\virus\\VirusAndroid"
disassemble(virus_root, ".\\smalis\\malware", 600)


#反汇编正常软件样本
kind_root = "..\\bit\\virus\\normalApk"
disassemble(kind_root, ".\\smalis\\kind", 600)

​ 完成后每个软件会创建一个以文件名字命名的文件夹,文件夹中包含反编译后的文件,如下图所示:

image-20201015172056064

​ 其中smali文件夹中包含了我们要提取特征码的文件。smali文件大致如下

image-20201015172738379

​ 我们要提取的操作码就在.method中。操作码大概有下图这几类

p1

​ 将每一类的操作码对应为大写字母以简化特征码。如move表示为M。

操作码提取

​ 代码对应上文提到的github中的bytecode_extract.py文件。

# -*- coding: utf-8 -*-
"""
Created on Tue Feb  6 22:41:06 2018
@author: 燃烧杯
"""

from infrastructure.ware import Ware
from infrastructure.fileutils import DataFile

virusroot = "./smalis/malware"
kindroot = "./smalis/kind"

f = DataFile("./data.csv")

import os

def collect(rootdir, isMalware):
    wares = os.listdir(rootdir)
    total = len(wares)
    for i, ware in enumerate(wares):
        warePath = os.path.join(rootdir, ware)
        ware = Ware(warePath, isMalware)
        ware.extractFeature(f)
        print("已提取", i + 1, "个文件的特征,百分比如下:")
        print((i + 1) * 100 / total, "%")
        
    
#1代表恶意软件
collect(virusroot, 1)
collect(kindroot, 0)    

f.close()

​ 提取出后如下图所示:

image-20201015172554235

​ feture列就是我们为每个文件提取出的特征。每个方法的特征码序列用“|”隔开。

n-gram特征

​ 从上文的feture中提取出n-gram特征,其数值表示该操作序列是否出现。代码如下

# -*- coding: utf-8 -*-
"""
Created on Fri Feb  9 13:26:50 2018
@author: 燃烧杯
词集模型
"""

import sys

#n-gram的n值
n = int(sys.argv[1])
print("n = ", n)

import pandas as pd

origin = pd.read_csv("data.csv")
#origin = pd.read_csv("test.csv")

from infrastructure.mydict import MyDict

mdict = MyDict()

feature = origin["Feature"].str.split("|")
total = len(feature)
for i, code in enumerate(feature):
    mdict.newLayer()
    if not type(code) == list:
        continue
    for method in code:
        length = len(method)
        if length < n:
            continue
        for start in range(length - (n - 1)):
            end = start + n
            mdict.mark(method[start:end])
    print("已完成", i, "个应用,百分比如下:")
    print((i + 1) * 100 / total, "%")
            
result = mdict.dict
pd.DataFrame(result, index=origin.index)\
               .to_csv("./" + str(n) + "_gram.csv", index=False)

​ 结果如图:

image-20201015173708451

形成了2000343的特征表,之所以是343个特征序列是应为总共有7大类操作码,并且采用3-gram,有777个序列。

机器学习

​ 接下来就是训练了,本次实验采用随机森林算法,并采用10交叉验证,代码如下:

from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
import pandas as pd

train_feture = pd.read_csv('D:\\android\\dataset\\smalis\\3_gram.csv')
data = pd.read_csv('D:\\android\\dataset\\smalis\\data.csv')
labels = data["isMalware"]
train_feture = train_feture.iloc[:,:].values
srf = RF(n_estimators=500, n_jobs=-1)
clf_s = cross_val_score(srf, train_feture, labels, cv=10)

​ 结果如下

array([0.965     , 0.995     , 0.995     , 0.96      , 0.89      ,
       0.945     , 0.965     , 0.95      , 0.97487437, 0.97487437])

深度学习

​ 顺便用用深度学习做一下分类看看效果,深度学习库采用keras。以下是代码:

from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

test_split = 0.2			#划分训练集与测试集
data = pd.read_csv("D:\\android\\dataset\\smalis\\data_2.csv")
fetrues = pd.read_csv("D:\\android\\dataset\\smalis\\3_gram.csv")
labels = data["isMalware"]

p1 = int(len(labels)*(1-test_split))
index = np.random.permutation(len(fetrues))		#打乱顺序
train_data = fetrues.iloc[index]
labels = labels.iloc[index]
index = np.random.permutation(len(fetrues))
train_data = fetrues.iloc[index]
labels = labels.iloc[index]

model = keras.Sequential()
model.add(layers.Dense(50,input_dim = 343, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(1, activation = 'sigmoid'))
model.compile(
    optimizer = 'adam',
    loss='binary_crossentropy',
    metrics=['acc']
)

history = model.fit(x_train, y_train, epochs=60, batch_size=256, validation_data=(x_test, y_test))

​ 测试结果如下:

image-20201016093540843

最后10轮精确度如下:

0.9812, 0.9819, 0.9775, 0.9781, 0.9718, 0.9812, 0.9793, 0.9618, 0.9825, 0.9756

另外做10交叉验证,代码如下:

from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from sklearn.model_selection import StratifiedKFold
seed = 7
np.random.seed(seed)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

cvscores = []
data = pd.read_csv("D:\\android\\dataset\\smalis\\data_2.csv")
labels = data["isMalware"]
train_data = pd.read_csv("D:\\android\\dataset\\smalis\\3_gram.csv")
train_data =  train_data.iloc[:,:].values
for train, test in kfold.split(train_data, labels):
    model = keras.Sequential()
    model.add(layers.Dense(50,input_dim = 343, activation = 'relu'))
    model.add(layers.Dense(16, activation = 'relu'))
    model.add(layers.Dense(16, activation = 'relu'))
    model.add(layers.Dense(16, activation = 'relu'))
    model.add(layers.Dense(16, activation = 'relu'))
    model.add(layers.Dense(16, activation = 'relu'))
    model.add(layers.Dense(1, activation = 'sigmoid'))
    model.compile(
    optimizer = 'adam',
    loss='binary_crossentropy',
    metrics=['acc']
    )
    model.fit(train_data[train],labels[train],epochs=60, batch_size=256,verbose = 0)
    scores = model.evaluate(train_data[test], labels[test], verbose=0)
    print(scores[1])
    cvscores.append(scores[1])
print(cvscores)

精确度如下:

[0.95, 0.985, 0.95, 0.945, 0.975, 0.95, 0.955, 0.96, 0.9798995, 0.9748744]

与随机森林对比图;

image-20201016102523772

posted @ 2020-10-15 17:45  iloveacm  阅读(4668)  评论(69编辑  收藏  举报