Identifying Encrypted Malware Traffic with Contextual Flow Data

1. Dataset数据集

0x1：MCFP DATASET

0x2：CTU-13 Dataset - A Labeled Dataset with Botnet, Normal and Background traffic

The CTU-13 dataset consists in thirteen captures (called scenarios) of different botnet samples. On each scenario we executed a specific malware, which used several protocols and performed different actions.

Below Table 2 shows the characteristics of the botnet scenarios.

Each scenario was captured in a pcap file that contains all the packets of the three types of traffic（nornal、botnet、background）.

以其中一个为例，来看看这个dataset包含的内容。

1. INFO

•    Binary used: Neris.exe
    •    Md5: bf08e6b02e00d2bc6dd493e93e69872f
    •    Probable Name: Neris
    •    Capture duration: 6.15 hours
    •    Complete Pcap size: 52GB
    •    Botnet Pcap size: 56MB
    •    NetFlow size: 369MB
    •    Infected Virtual Environment
    •         Windows XP named 'SARUMAN'
    •         IP address: 147.32.84.165
    •         Label of this IP in the NetFlows files: 'Botnet'

2. TIMELINE

Wed ago 10 15:58:00 CEST 2011

We captured the neris bot along with the packets of the whole CTU department. The first hour of capture was composed of only Background traffic and latter we run the malware. The malware was stopped 5 minutes before ending the capture. We limited the bandwith of the experiment to 20kbps in the output of the bot.

3. TRAFFIC ANALYSIS

This dataset corresponds to a Neris botnet that run for 6.15 hours in a University network. The botnet used an HTTP based C&C channel and not an IRC C&C channel as it was erroneously reported before. The actions of the botnet were to communicate using several C&C channels and then to try to send SPAM, to actually send SPAM and perform click-fraud using some advertisement services.

The following connection is an example of a real C&C channel that sent few flows and that is not periodic. This is not a good representative model for C&C connections. An example of the commands sent are:

POST /?c799959d9582d499959791949482d19995939782d2999790969182c699959c949c92
959c82c0999582d79995969c959d9d9482c199e79ef8f3edeae0ebf3f7f8f0e1e9f4f893ccd
dddcccad3c28ac1dcc182c399cdcacdd0a4 HTTP/1.1
HTTP/1.1 200 OK
Date: Wed, 10 Aug 2011 09:41:53 GMT
Server: Apache/2.2.8 (Fedora) DAV/2 PHP/5.2.6 mod\_ssl/2.2.8 OpenSSL/0.9.8g
X-Powered-By: PHP/5.2.6
Content-Length: 26
Connection: close
Content-Type: text/html; charset=UTF-8
CB2=212.117.171.138:65500

The following connection is a not encrypted C&C were we can see the commands, and it is a good representative of the C&C connections.

POST /snapbn/gate.php HTTP/1.0
Host: finalcortex.com
Keep-Alive: 300
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 56
id=SARUMAN_610d402662842e9f&version=1337&os=2600&s5=6906


HTTP/1.1 200 OK
Date: Wed, 10 Aug 2011 09:08:48 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.1.6
Content-Length: 3
Connection: close
Content-Type: text/plain; charset=UTF-8

120

4. CTU-13基于pcap进行了一些net flow的五元组聚类，提取了一些特征

值得注意的是，CTU-13基于pcap进行了一些http、net flow、dns的信息提取和聚合，但是其实我们再实际项目中可以自己基于pcap去进行自定义的聚合以及feature engineering。

StartTime,Dur,Proto,SrcAddr,Sport,Dir,DstAddr,Dport,State,sTos,dTos,TotPkts,TotBytes,SrcBytes,Label
2011/08/10 09:46:53.047277,3550.182373,udp,212.50.71.179,39678,  <->,147.32.84.229,13363,CON,0,0,12,875,413,flow=Background-UDP-Established
2011/08/10 09:46:53.048843,0.000883,udp,84.13.246.132,28431,  <->,147.32.84.229,13363,CON,0,0,2,135,75,flow=Background-UDP-Established
2011/08/10 09:46:53.049895,0.000326,tcp,217.163.21.35,80,  <?>,147.32.86.194,2063,FA_A,0,0,2,120,60,flow=Background
2011/08/10 09:46:53.053771,0.056966,tcp,83.3.77.74,32882,  <?>,147.32.85.5,21857,FA_FA,0,0,3,180,120,flow=Background
2011/08/10 09:46:53.053937,3427.768066,udp,74.89.223.204,21278,  <->,147.32.84.229,13363,CON,0,0,42,2856,1596,flow=Background-

Relevant Link:

https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow
https://cwf.dw.alibaba-inc.com/nodeCode.html?nodeId=3897914&env=prod&currentAppId=17523#/
https://mcfp.weebly.com/ctu-malware-capture-botnet-42.html
https://mcfp.felk.cvut.cz/publicDatasets/CTU-13-Dataset/
https://mcfp.weebly.com/analysis
https://www.stratosphereips.org/datasets-overview/
https://mcfp.weebly.com/mcfp-dataset.html
https://mcfp.weebly.com/the-ctu-13-dataset-a-labeled-dataset-with-botnet-normal-and-background-traffic.html

2. 基于Pcap包进行NetFlow聚合提取特征工程

用基于ML进行网络流量的入侵发现，就需要基于原始日志进行合理的聚合和拉伸，目的是将蕴含在数据中的规律抽取出来，对于网络侧的入侵发现来说，基于五元组 session flow 维度进行聚合的方式是最为常见有效的聚合方式，基于这种方式也比较容易从中抽取出核心特征以及关联特征。

本文聚焦在一款开源工具 joy，以及学习它进行特征工程的思路方法上，以及基于这些特征进行ML识别恶意网络通信。

joy工具自带了对flow进行特征工程的选项，我们需要在命令行显式指定参数：

Data feature options
  bpf="expression"           only process packets matching BPF "expression"
  zeros=1                    include zero-length data (e.g. ACKs) in packet list
  retrans=1                  include TCP retransmissions in packet list
  bidir=1                    merge unidirectional flows into bidirectional ones
  dist=1                     include byte distribution array
  cdist=F                    include compact byte distribution array using the mapping file, F
  entropy=1                  include byte entropy
  http=1                     include HTTP data
  exe=1                      include information about host process associated with flow
  classify=1                 include results of post-collection classification
  num_pkts=N                 report on at most N packets per flow (0 <= N < 200)
  type=T                     select message type: 1=SPLT, 2=SALT
  idp=N                      report N bytes of the initial data packet of each flow
  label=L:F                  add label L to addresses that match the subnets in file F
  URLmodel=URL               URL to be used to retrieve classisifer updates
  model=F1:F2                change classifier parameters, SPLT in file F1 and SPLT+BD in file F2
  hd=1                       include header description
  URLlabel=URL               Full URL including filename to be used to retrieve label updates
  wht=1                      include walsh-hadamard transform
  example=1                  include example feature
  dns=1                      report DNS response information
  ssh=1                      report ssh information
  tls=1                      report TLS data (ciphersuites, record lengths and times, ...)
  dhcp=1                     report dhcp information
  http=1                     report http information
  ike=1                      report IKE information
  payload=N                  include N bytes of payload
  salt=1                     include salt feature
  ppi=1                      include per-packet info (ppi)

提取一个完整的date feature

{
    "sa": "30.43.132.45",
    "da": "140.205.35.18",
    "pr": 6,
    "sp": 65376,
    "dp": 443,
    "bytes_out": 0,
    "num_pkts_out": 1,
    "bytes_in": 0,
    "num_pkts_in": 1,
    "time_start": 1529927333.626875,
    "time_end": 1529927333.637219,
    "packets": [],
    "byte_dist": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    "p_malware": 0.000076,
    "ip": {
        "out": {
            "ttl": 64,
            "id": [9329]
        },
        "in": {
            "ttl": 50,
            "id": [30923]
        }
    },
    "oseq": [1275303583],
    "oack": [1237297150],
    "iseq": [1237297150],
    "iack": [1275303584],
    "ppi": [{
        "seq": 1275303583,
        "ack": 1237297150,
        "rseq": 0,
        "rack": 0,
        "b": 0,
        "olen": 0,
        "dir": ">",
        "t": 0,
        "flags": "A",
        "opts": []
    }, {
        "seq": 1237297150,
        "ack": 1275303584,
        "rseq": 0,
        "rack": 1,
        "b": 0,
        "olen": 0,
        "dir": "<",
        "t": 10,
        "flags": "A",
        "opts": []
    }]
}

0x1：Bidirectional flows - 双向netflow聚合

常规情况下，netflow是directional即有向五元组，但是将 directional five-tuple五元组进一步聚合成 undirectional netflow，可以获得更加丰富的信息。

A bidirectional flow consists of a pair of unidirectional flows whose source and destination addresses and ports are reversed, and whose time spans overlap. That is, if the flows out and in are unidirectional, then their combination is a bidirectional flow when

in.sa = out.da,
in.da = out.sa,
in.sp = out.dp,
in.dp = out.sp,
in.pr = out.pr, in.start time > out.start time,
in.start time < out.stop time.

A bidirectional flow is sometimes called a biflow

0x2：Flow expiration - 流聚合的激活时间窗口

我们知道，在pcap或者网络monitor层面，五元组会话是没有明确的border边界的，即没有明确的标识一次flow什么时候结束，这本质上也是因为flow实际上是一个虚拟的逻辑概念。

因此，我们需要定义出一个expiration过期时间，joy定义了两个概念：

1. 静默expiration持续时间：whenever a flow is inactive for an inactive timeout period：10 s
| --> 静默expiration时间（静默期间保持无激活，到达时间窗口后，则停止统计，将已统计的package封装为一个flow） <-- |

2. 激活active持续时间：whenever it is active and its duration exceeds an active timeout period：30 s
| --> 激活存活时间（持续激活的最大窗口时间，如果一直处于激活状态，则最大只统计该时间窗口内的package为一个flow） <-- |

The flow record indicates which of these conditions happened, in the expire type element, which appears in the top level of each flow object:

• "expire type": "i" denotes an inactive expiration, and
• "expire type": "a" denotes an active expiration.

0x3：zeros=1：if Include packets with zero-length Data fields？

Include packets with zero-length Data fields (e.g. TCP ACKs) in the packets array.

TCP Ack是所有通信开始都必须进行的握手过程，该package对classify可能作用有限，这部分可以考虑去除。

0x4：统计flow中package length sequence（flow meta特征）

The packet lengths were taken to be the sizes of the UDP, TCP, or ICMP packet payloads. If the packet was not one of those three types, then the length was set to the size of the IP packet。

统计包括 UDP, TCP, or ICMP的 packet data length，如果是其他类型的数据包，则按照 IP 包的length计算。

为了避免标量值的拟合困难问题，我们对【0，1500】的特征空间进行离散化，每 150 bytes 一个离散区间，也就是 10 个离散特征值。

将序列数据输入Markov chain，计算sequence probability，将输出结果作为ML的向量化特征。

得到的是一个 1 维特征。

0x5：统计flow中 inter-arrival times sequence（flow meta特征）

The inter-arrival times had a millisecond resolution。

关于inter-arrival，下图展示了inter-packet time的概念，它表示同一个flow会话中，两个包之间的时间间隔。这个序列包含了该flow中会话的频率模式（行为模式）。

同样，对【0，500】millisecond 进行离散化，每50 millisecond一个离散区间，也就是 10 个离散特征值。

time inter-arrival time sequence的特征计算方法和package length sequence是一样的。

0x6：TCP retransmissions是否保留？

Tcp retransmission是在发生网络丢包情况下的重传机制，它本身不包含任何规律，也不具有可重入性，因此我们在进行flow feature engineering的时候进行去除。

retrans=0

0x7：byte distribution：ASCII词频表（flow meta特征）

length-256 array that keeps a count for each byte value encountered in the payloads of the packets of the flow being analyzed。这是针对整个flow会话统计的。

"byte_dist": [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],

0x8：bytes_out、num_pkts_out、bytes_in、num_pkts_in（flow meta特征）

基础概率统计特征

0x9：entropy熵值（flow meta特征）

the entropy of the Data fields of the flow

• entropy is the entropy in bits per byte; this is the empirical entropy computed from the empirical probability distribution over the bytes. This number ranges from zero to 8.

• total entropy is total entropy, in bytes, over all of the bytes in the flow. This number ranges from zero to 8n, where n is bytes out for unidirectional flows, and is bytes out plus bytes in for bidirectional flows.

{ 
    "entropy": 7.224162
    "total_entropy": 463054.342398,
}

0x10：对flow中特定的sa/da address进行label打标，判断flow中sa/da命中blacklist的比例（flow meta特征）

label=L:F

This facility can be used to label flows that contact known-bad servers, for instance,

label=malware:badips.net

我们可以用blacklist ips 对flow中的sa/da进行打标，然后统计count()值，这里可以得到一个特征表征，即该flow中合法请求占比次数多少，占比多少。

从经验上来看，malware traffic和normal traffic的flow应该在恶意ip的命中占比上会有比较明显的区别。

joy支持从pcap或者网卡流中捕获DNS特征：dns=1

"dns": [{
        "rn": "cldshlsrl.aliyun.com.gds.alibabadns.com",
        "rc": 8,
        "rr": [{
            "a": "140.205.60.9",
            "ttl": 18
        }]
    }],

需要注意的是，将DNS response和TLS Flow进行context聚合的条件是：DNS response ip和TLS Flow的destination ip相同，即在该会话期间，client进行了dns name的解析，通过dns获取到了destination ip。即client通过dns访问server。

0x11：the lengths of both the DNS response's domain name and the FQDN（DNS特征）

From the associated DNS response of the TLS flow, we collect the lengths of both the domain name and the FQDN。

提取flow中DNS response的 domain name、FQDN 长度。

For the domain name lengths, the benign domain names had a roughly Gaussian shape centered around 6/7. The malicious domain names and FQDNs had a sharp peak at 6 and 10, respectively

normal和malware traffic在DNS/FQDN length上的概率分布呈现出比较明显的区别。

0x12：40 most common suffixes one-hot编码（长度41的vec）（DNS特征）

根据领域经验定义了40个常见suffix（例如.com）以及一个“other”，对flow中的DNS response进行one-hot编码

0x13：32 most common TTL value one-hot编码（长度33的vec）（DNS特征）

根据领域经验定义了32个TTL以及一个“other”，对flow中的DNS response TTL进行one-hot编码

下图展示了normal和malware traffic在不同TTL Vec上的样本分布直方图。

0x14：DNS name词频特征（DNS特征）

1. the number of numerical characters

2. the number of nonalphanumeric characters

下图展示了normal和malware traffic在不同nums of chars Vec上的样本分布直方图。

0x15：DNS Response中包含IP list count()特征（DNS特征）

the number of IP addresses returned by the DNS response.

below pic shows the number of distinct IP addresses returned. The majority of DNS responses return 1 IP address for both malicious and benign responses. Beyond those cases, there is some interesting structure where we see significantly more benign responses returning 2 and 8 IP addresses and significantly more malware responses returning 4 and 11 IP addresses

0x16：DNS name和 alexa top 的命中比例（DNS特征）

Alexa ranks websites based on the number of page views and the number of unique IP addresses.

six binary features representing whether the domain name was in the

1. top-100
2. top-1,000
3. top-10,000
4. top-100,000
5. top-1,000,000
6. or not found in the Alexa list.

选择最小命中范围的alex top，比如top-100和top-1000都命中了，则将top-100的feature置位为1.

As expected, roughly ∼86% of the domain names that the malware samples looked up were not found in the Alexa top-1,000,000 list.

On the other hand, below Figure shows that the normal traffic had the majority of its domain names in the top-1,000,000, this is a significant difference.

joy支持从pcap或者网卡流中捕获TLS特征：tls=1

"tls": {
        "c_version": 5,
        "s_version": 5,
        "c_key_length": 264,
        "c_key_exchange": "206f66817bde64dd09443dfb398e01ad5667fe53edc1164c2504a2d8ed1b2f6b5e",
        "c_random": "31d0341f4239ec26efe6c95e5498630f99bfca54b679dcadf0106705095e440d",
        "s_random": "5b31fb66335283a01f8a5126a0f6625a7855d607c4a35da8d7cc709ef9f81b9b",
        "c_sid": "b028000021ddf6ec1fab1ec29f40a169a0e995085ce2c1cebb9747b7a8500c39",
        "s_sid": "d8260000d08b84ff5b5f2832368b29985c8164953453c0118aa87d9f0671f0e7",
        "sni": ["mobile.pipe.aria.microsoft.com"],
        "scs": "c030",
        "cs": ["c02c", "c02b", "c024", "c023", "c00a", "c009", "cca9", "c030", "c02f", "c028", "c027", "c014", "c013", "cca8", "c008", "c012", "009d", "009c", "003d", "003c", "0035", "002f", "000a"],
        "c_extensions": [{
            "renegotiation_info": "00"
        }, {
            "server_name": "002100001e6d6f62696c652e706970652e617269612e6d6963726f736f66742e636f6d"
        }, {
            "extended_master_secret": ""
        }, {
            "signature_algorithms": "0012040308040401050308050501080606010201"
        }, {
            "status_request": "0100000000"
        }, {
            "signed_certificate_timestamp": ""
        }, {
            "ec_point_formats": "0100"
        }, {
            "supported_groups": "0008001d001700180019"
        }],
        "s_extensions": [{
            "status_request": ""
        }, {
            "extended_master_secret": ""
        }, {
            "renegotiation_info": "00"
        }],
        "s_cert": [{
            "length": 1770,
            "serial_number": "7b00004529a28f3959bbd4eb6e000000004529",
            "signature": "0c9c2ca945381c3408d9a318cbc2069046411b30f4191d95232daff0638fd2f2cf4276ed856c4c4e523684208416aabbff114db646010534a221cdb01a4475b83657ab633a02218c9c79bc8693f49a5471da64a39c4a7c9345a2e6461e3b9f4119e5ec602a13acfb5662516d3481536f8a25d9ee6e00b3d270ea6d218a7e9fddae581d3dc8e5bee5995ad456d26eb4ea581d035cf076b332f17e39a172e72054eb035a820bb08cec7bf1acc591d2e81a19fb0eb65cd26ee3334892cf8ecdf3b66024b97c959f4acd89453df7033b06709fb7d2bb3db1a67c5ed95690d18f7563745e3b9005d348d3886bbb25cedcbf48409e2e8a6a160b6649d18ceefa00d47e9861046a63c912011bcf71090e0f5b0c8e4fdb1fe0213b41eaffc8651b201d2f88355f685bd2243df9e51b401ce10c877464653676714ecae9885e79a9b84ce8f3c0b07a1f3ae5021cd391a9307ba704e9494d7e5c175c5eece35560adbbb312e761142fb1f7e106a489bf7a47a8a682c2f352483a2b0df3c64060efe3edae6a691764819c80b911ab4348cd7b3243ac81308d55f854ca27e3f6997500675d372e5b3039ae7837a5cb2b9acd1b0826d049dae171249d892e4aeb10e6f3d46afa25535132c2e50ed5e79b00968656c9c7fe788868c7a54cbd61ebf877e0aa3e596f8831a7e3836f0f39d973868cefba1c618cb88d0511aa91f7f6025ce8a078a7",
            "signature_algo": "sha256WithRSAEncryption",
            "signature_key_size": 4096,
            "issuer": [{
                "countryName": "US"
            }, {
                "stateOrProvinceName": "Washington"
            }, {
                "localityName": "Redmond"
            }, {
                "organizationName": "Microsoft Corporation"
            }, {
                "organizationalUnitName": "Microsoft IT"
            }, {
                "commonName": "Microsoft IT TLS CA 1"
            }],
            "subject": [{
                "commonName": "*.pipe.aria.microsoft.com"
            }],
            "extensions": [{
                "X509v3 Subject Key Identifier": "4E:47:63:9E:6E:FC:3D:95:06:D2:7E:10:EB:35:2F:FE:4B:DF:90:52"
            }, {
                "X509v3 Key Usage": "Digital Signature, Key Encipherment, Data Encipherment"
            }, {
                "X509v3 Authority Key Identifier": "keyid:58:88:9F:D6:DC:9C:48:22:B7:14:3E:FF:84:88:E8:E6:85:FF:FA:7D."
            }, {
                "X509v3 CRL Distribution Points": ".Full Name:.  URI:http:..mscrl.microsoft.com.pki.mscorp.crl.Microsoft IT TLS CA 1.crl.  URI:http:..crl.microsoft.com.pki.mscorp.crl.Microsoft IT TLS CA 1.crl."
            }, {
                "Authority Information Access": "CA Issuers - URI:http:..www.microsoft.com.pki.mscorp.Microsoft IT TLS CA 1.crt.OCSP - URI:http:..ocsp.msocsp.com."
            }, {
                "1.3.6.1.4.1.311.21.7": "0..'+.....7.....u...........a...`.]...B...z..d..."
            }, {
                "X509v3 Extended Key Usage": "TLS Web Client Authentication, TLS Web Server Authentication"
            }, {
                "X509v3 Certificate Policies": "Policy: 1.3.6.1.4.1.311.42.1.  CPS: http:..www.microsoft.com.pki.mscorp.cps."
            }, {
                "1.3.6.1.4.1.311.21.10": "0.0...+.......0...+......."
            }, {
                "X509v3 Subject Alternative Name": "DNS:*.pipe.aria.microsoft.com, DNS:pipe.skype.com, DNS:*.pipe.skype.com"
            }],
            "validity_not_before": "Sep  6 20:57:50 2017 GMT",
            "validity_not_after": "Sep  6 20:57:50 2019 GMT",
            "subject_public_key_algo": "rsaEncryption",
            "subject_public_key_size": 2048
        }, {
            "length": 1464,
            "serial_number": "08b87a501bbe9cda2d164d3e3951bf55",
            "signature": "309ac69d6afdef93080cbe8277f976a06d9e7b30237ba8295af46a3ec70b0c96dfb84b52e40d9c38ed7863b573c01c1f3be0a7ff7f49519532b8d09ba9e5cf96038180d54a6118fec46ac6df7f4146229c8066eb0f42a0e4f3a421a398d07a74f68ce8c3d22baa2bce11591944e75c070942ebd7fd154db96f6c44352687baa33b68b081e720c97f1302f3ccab9f1c9550cbae6480bb870a5dcea66bb27de33d36e22951b725fcd009e3b0adc4622e3e7e8526b2f6aff76d3173c61998a9729302ceca0b3d3cecd970e880f516ab786a874dc68137a80a768106a8ef17607c7010133c38d7334ce4376508fb91b3e81676612a65f55894b34501efc04f037bb8",
            "signature_algo": "sha256WithRSAEncryption",
            "signature_key_size": 2048,
            "issuer": [{
                "countryName": "IE"
            }, {
                "organizationName": "Baltimore"
            }, {
                "organizationalUnitName": "CyberTrust"
            }, {
                "commonName": "Baltimore CyberTrust Root"
            }],
            "subject": [{
                "countryName": "US"
            }, {
                "stateOrProvinceName": "Washington"
            }, {
                "localityName": "Redmond"
            }, {
                "organizationName": "Microsoft Corporation"
            }, {
                "organizationalUnitName": "Microsoft IT"
            }, {
                "commonName": "Microsoft IT TLS CA 1"
            }],
            "extensions": [{
                "X509v3 Subject Key Identifier": "58:88:9F:D6:DC:9C:48:22:B7:14:3E:FF:84:88:E8:E6:85:FF:FA:7D"
            }, {
                "X509v3 Authority Key Identifier": "keyid:E5:9D:59:30:82:47:58:CC:AC:FA:08:54:36:86:7B:3A:B5:04:4D:F0."
            }, {
                "X509v3 Basic Constraints": "CA:TRUE, pathlen:0"
            }, {
                "X509v3 Key Usage": "Digital Signature, Certificate Sign, CRL Sign"
            }, {
                "X509v3 Extended Key Usage": "TLS Web Server Authentication, TLS Web Client Authentication, OCSP Signing"
            }, {
                "Authority Information Access": "OCSP - URI:http:..ocsp.digicert.com."
            }, {
                "X509v3 CRL Distribution Points": ".Full Name:.  URI:http:..crl3.digicert.com.Omniroot2025.crl."
            }, {
                "X509v3 Certificate Policies": "Policy: X509v3 Any Policy.  CPS: https:..www.digicert.com.CPS."
            }],
            "validity_not_before": "May 20 12:51:28 2016 GMT",
            "validity_not_after": "May 20 12:51:28 2024 GMT",
            "subject_public_key_algo": "rsaEncryption",
            "subject_public_key_size": 4096
        }],

TLS数据的会话data是被加密的，我么只能提取TLS Flow在handshake握手期间的信息，以及整个flow的一些概率统计特征。

0x17：the list of offered ciphersuites（TLS特征 - client-based TLS-specific features）

TLS/SSL client在握手交互时，会向server端提供自己支持的 ciphersuites。

"cs": ["c02c", "c02b", "c024", "c023", "c00a", "c009", "cca9", "c030", "c02f", "c028", "c027", "c014", "c013", "cca8", "c008", "c012", "009d", "009c", "003d", "003c", "0035", "002f", "000a"],

我们定义了一个176 length的one-hot vec，对c_ciphersuites进行向量化编码。

below Figure illustrates differences in two client-side TLS features，which the implication being that malware authors use a distinct set of TLS libraries and/or configurations.

从领域经验来看，malware作者和normal software作者倾向于使用不同的TLS库，从而在traffic层面上表现出了ciphersuites的区分。

0x18：the list of advertised extensions（TLS特征 - client-based TLS-specific features）

a length-21 binary vector was used to represent the TLS extensions.

Malware also seems to have comparatively little diversity in the client-supported TLS extensions

0x19：the client’s public key length（TLS特征 - client-based TLS-specific features）

"c_key_length": 264,

1 维向量特征。

Most of the normal traffic used 256-bit elliptic curvecryptography for the public keys, but most of the malicious traffic used 2048-bit RSA public keys.

0x20：the list of selected ciphersuites（TLS特征 - server-based TLS-specific features）

"scs": "c030",

0x21：supported extensions,（TLS特征 - server-based TLS-specific features）

the malicious traffic most often selected obsolete ciphersuites.

The normal traffic contained a wider variety of supported TLS extensions by the servers.

0x22：number of certificates（TLS特征 - server-based TLS-specific features）

The certificate message passes the server’s certificate chain to the client.

We observed that the number of certificates in the chain for the malware and normal data were roughly the same.

But, if we restrict our focus on the length1 chains（根证书）

∼70% were self-signed for malware
∼.1% were self-signed for the normal traffic

0x23：number of SAN names（TLS特征 - server-based TLS-specific features）

The number of names in the SubjectAltName (SAN) X.509 extension also differed in the two datasets.

0x24：validity in days（TLS特征 - server-based TLS-specific features）

Similar to the other data features, the period of validity for a server certificate has notable differences in the malicious and normal traffic

0x25：whether there was a self-signed certificate

joy支持从pcap或者网卡流中捕获HTTP特征：http=1

能够将HTTP package和TLS Flow进行context关联的条件是：HTTP的source ip等于TLS Flow的source ip，即该会话期间，除了TLS加密通信之外，同时还进行了HTTP明文通信。

"http": [{
        "out": [{
            "method": "POST"
        }, {
            "uri": "/api/adp/addStat"
        }, {
            "version": "HTTP/1.1"
        }, {
            "Host": "adsp.xunlei.com"
        }, {
            "App-Type": "Mac"
        }, {
            "Accept": "*/*"
        }, {
            "Peer-Id": "6C96CFE0952F005V"
        }, {
            "Product-Id": "25"
        }, {
            "Platform-Version": "3.2.0"
        }, {
            "Version-Code": "3.2.0"
        }, {
            "Accept-Language": "zh-cn"
        }, {
            "Accept-Encoding": "gzip, deflate"
        }, {
            "Content-Length": "224"
        }, {
            "User-Agent": "è¿é·/3450 CFNetwork/897.15 Darwin/17.5.0 (x86_64)"
        }, {
            "Connection": "keep-alive"
        }, {
            "Version-Name": "3.2.0"
        }, {
            "Content-Type": "application/x-www-form-urlencoded"
        }, {
            "Cookie": "usernewno=543039754; usernick=éªçèçéä¸ç"
        }, {
            "body": "00000000000000000000000000000000"
        }],
        "in": [{
            "version": "HTTP/1.1"
        }, {
            "code": "200"
        }, {
            "reason": "OK"
        }, {
            "Server": "openresty"
        }, {
            "Date": "Tue, 26 Jun 2018 08:38:15 GMT"
        }, {
            "Content-Type": "text/plain; charset=utf-8"
        }, {
            "Connection": "close"
        }, {
            "Content-Length": "12"
        }, {
            "body": "7b2272657475726e223a307d00000000"
        }]
    }],

There is a single feature vector of binary variables representing all of the observed HTTP headers.

If any of the HTTP flows have a specific header value, then that feature will be a 1 regardless of the other HTTP flows.

We used seven types of features from the HTTP data. For each feature, we selected all specific values used by at least 1% of either the malware or benign samples and an “other” category.

The types were the presence of outbound and inbound HTTP fields,

1. Content-Type
2. User-Agent
3. AcceptLanguage
4. Server
5. code.

对于这部分特征的抽取，我们可以基于领域经验人工指定，也可以基于normal和malware样本进行one-hot编码提取。

0x26：inbound HTTP fields（HTTP Header特征）

malicious HTTP is more likely to make use of the

1. Server
2. Set-Cookie
3. and Location fields

while the Normal HTTP traffic was more likely to make use of the

1. Connection
2. Expires
3. and Last-Modified fields.

0x27：outbound HTTP fields（HTTP Header特征）

For outbound HTTP fields, the normal HTTP was more likely to make use of the

1. User-Agent
2. Accept-Encoding
3. and Accept-Language fields.

0x28：HTTP ContentType（HTTP Header特征）

the dominant HTTP ContentType for the Normal traffic was image/*,

and the malware traffic was mostly text/*.

和之前的特征一样，对这部分的特征一个比较好的方法就是进行one-hot向量化编码。

0x29：inbound Server field（HTTP Header特征）

Malware most often says that it is using a version-less nginx server,

and the benign traffic most often says that it is using either the version-less Apache or nginx server

0x30：outbound Usser-Agent（HTTP Header特征）

The User-Agent field had a very long tail, several thousand unique strings in both datasets.

For the malware data, the most common advertised User-Agent string was Opera/9.50(WindowsNT6.0;U;en), followed by several variations of Mozilla/5.0 and Mozilla/4.0.

All of the top User-Agent strings in the Normal data were Windows and OS X variants of Mozilla/5.0.

This field also had the most diverse set of capitalizations that we observed:

• User-Agent
• user-agent
• User-agent
• USER-AGENT
• User-AgEnt

尤其在malware中，拼写不规范问题常常会出现。

Relevant Link:

https://cse.buffalo.edu/~jcorso/t/CSE555/files/lecture_hmm.pdf 
https://github.com/cisco/joy/tree/master/test
https://www.cisco.com/c/dam/en/us/solutions/collateral/enterprise-networks/enterprise-network-security/nb-09-encrytd-traf-anlytcs-wp-cte-en.pdf
https://medium.com/@austin_57472/understanding-ciscos-new-anti-malware-tech-eta-encrypted-traffic-analytics-c5664b9efca1
http://delivery.acm.org/10.1145/3000000/2996768/p35-anderson.pdf?ip=42.120.75.129&id=2996768&acc=ACTIVE%20SERVICE&key=C8BAF422464E9FCC%2EC8BAF422464E9FCC%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1529913021_0e48b0742e1cffbb6ddfabb4b56e87cb 
https://blogs.cisco.com/security/detecting-encrypted-malware-traffic-without-decryption

3. TODO

网友编译的joy，看看是不是提取完毕特征了
研究joy和cisco的paper
http://delivery.acm.org/10.1145/3000000/2996768/p35-anderson.pdf?ip=42.120.75.129&id=2996768&acc=ACTIVE%20SERVICE&key=C8BAF422464E9FCC%2EC8BAF422464E9FCC%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1529913021_0e48b0742e1cffbb6ddfabb4b56e87cb

pcap特征提取
tcpdump -i en0 -w /Users/zhenghan/Downloads/xxx.pcap
/Users/zhenghan/Downloads/joy/bin/joy output=/Users/zhenghan/Downloads/jou_parsed.txt /Users/zhenghan/Downloads/xxx.pcap

抓包
sudo /Users/zhenghan/Downloads/joy/bin/joy interface=en0 bidir=1 dist=1 entropy=1 http=1 hd=1 wht=1 dns=1 ssh=1 tls=1 dhcp=1 http=1 ike=1 salt=1 ppi=1 show_config=1 retrans=0 output=/Users/zhenghan/Downloads/data.json.gz

paper
https://publishup.uni-potsdam.de/opus4-ubp/frontdoor/deliver/index/docId/10094/file/httpsmalware.pdf
https://dspace.cvut.cz/bitstream/handle/10467/68528/F3-BP-2017-Strasak-Frantisek-strasak_thesis_2017.pdf
http://ecmlpkdd2017.ijs.si/papers/paperID193.pdf
https://2018.bsidesbud.com/wp-content/uploads/2018/03/seba_garcia_frantisek_strasak.pdf

posted @ 2019-03-17 16:08 郑瀚阅读(30) 评论(0) 编辑收藏举报

刷新页面返回顶部

Han Zheng, Thinker and Doer

Welcome to contact me. Wechat：LittleHann

Identifying Encrypted Malware Traffic with Contextual Flow Data

1. Dataset数据集

0x1：MCFP DATASET

0x2：CTU-13 Dataset - A Labeled Dataset with Botnet, Normal and Background traffic

1. INFO

2. TIMELINE

3. TRAFFIC ANALYSIS

4. CTU-13基于pcap进行了一些net flow的五元组聚类，提取了一些特征

2. 基于Pcap包进行NetFlow聚合提取特征工程

0x1：Bidirectional flows - 双向netflow聚合

0x2：Flow expiration - 流聚合的激活时间窗口

0x3：zeros=1：if Include packets with zero-length Data fields？

0x4：统计flow中package length sequence（flow meta特征）

0x5：统计flow中 inter-arrival times sequence（flow meta特征）

0x6：TCP retransmissions是否保留？

0x7：byte distribution：ASCII词频表（flow meta特征）

0x8：bytes_out、num_pkts_out、bytes_in、num_pkts_in（flow meta特征）

0x9：entropy熵值（flow meta特征）

0x10：对flow中特定的sa/da address进行label打标，判断flow中sa/da命中blacklist的比例（flow meta特征）

0x11：the lengths of both the DNS response's domain name and the FQDN（DNS特征）

0x12：40 most common suffixes one-hot编码（长度41的vec）（DNS特征）

0x13：32 most common TTL value one-hot编码（长度33的vec）（DNS特征）

0x14：DNS name词频特征（DNS特征）

1. the number of numerical characters

2. the number of nonalphanumeric characters

0x15：DNS Response中包含IP list count()特征（DNS特征）

0x16：DNS name和 alexa top 的命中比例 （DNS特征）

0x17：the list of offered ciphersuites（TLS特征 - client-based TLS-specific features）

0x18：the list of advertised extensions（TLS特征 - client-based TLS-specific features）

0x19：the client’s public key length（TLS特征 - client-based TLS-specific features）

0x20：the list of selected ciphersuites（TLS特征 - server-based TLS-specific features）

0x21：supported extensions,（TLS特征 - server-based TLS-specific features）

0x22：number of certificates（TLS特征 - server-based TLS-specific features）

0x23：number of SAN names（TLS特征 - server-based TLS-specific features）

0x24：validity in days（TLS特征 - server-based TLS-specific features）

0x25：whether there was a self-signed certificate

0x26：inbound HTTP fields（HTTP Header特征）

0x27：outbound HTTP fields（HTTP Header特征）

0x28：HTTP ContentType（HTTP Header特征）

0x29：inbound Server field（HTTP Header特征）

0x30：outbound Usser-Agent（HTTP Header特征）

3. TODO

公告

0x16：DNS name和 alexa top 的命中比例（DNS特征）