Smart Contracts Vulnerability Classification(附github源码)
In this project, we explored various deep learning techniques, including Convolutional Neural Networks (CNNs), Swin Transformer, and ConvNeXt, for detecting and classifying vulnerabilities in smart contracts deployed on the Ethereum main net.
1 Introduction
Ethereum smart contracts are typically programmed using Solidity, a Turing-complete programming language. While this feature provides blockchain developers with the ability to implement complex business logic solutions and has been crucial for the development of decentralized applications (dApps), it also increases the likelihood of bugs and vulnerabilities in the code. Malicious users can exploit these issues, which poses a significant problem for smart contracts, as they cannot be patched after deployment due to the immutable nature of the ledger.
1.1 Problem Statement
Therefore, before deploying smart contracts, developers should check for potentially vulnerable pieces of code using security patterns created by experts to ensure the code's reliability and mitigate the risk of losing digital assets such as tokens. However, defining security patterns requires in-depth knowledge of the blockchain's internal workings and Solidity code, making it a task that only field experts can perform. Additionally, it is time-consuming, especially with the discovery of new vulnerabilities.
1.2 Motivation
To address this problem, automated vulnerability detection tools have been proposed, such as Oyente and Mythril based on symbolic execution, and Slither and Smartcheck based on rule-based methods. While these tools have high detection accuracy for known bugs, they are either time-consuming or rely on expert-made detection rules, which do not completely solve the problem. As a result, researchers are now investigating Machine Learning (ML) and Deep Learning (DL) based techniques, proposing solutions that are usually fast and do not require heavy feature engineering.
DL techniques based on CNNs have shown promising results in malware detection and classification. The executable malware file is transformed into a grayscale image and fed into a convolutional architecture to extract relevant features and patterns. A final linear classification head predicts the malware class or declares the program safe for use. Since there are similarities between a program's executable file and the bytecode of a smart contract, researchers are actively investigating whether similar techniques can be used for detecting vulnerabilities in Solidity code [1].
But the efficiency of CNN-based DL techniques is relatively low. To achieve better results faster, we consider using Swin Transformer and ConvNeXt. They are both backbone neural networks. We use Swin Transformer to capture fine-grained details. it does this by dividing the The innovation in Swin Transformer is the use of shifted windows. allowing it to capture details while still maintaining a multi-stage transformer network. The model also uses an efficient batch computation approach and a masking mechanism to The model also uses an efficient batch computation approach and a masking mechanism to limit self-attention computation [2].
ConvNeXt started with a ResNet-50 model, trained it with similar techniques used to train vision Transformers, and then gradually modernized the architecture with the design decisions listed as 1) macro design, 2) ResNeXt, 3) inverted bottleneck, 4) large kernel size, and 5) various layer-wise The result is a family of pure ConvNets named ConvNeXt [3].
2 Dataset Description
As a relatively new research area, labeled datasets of smart contracts are not yet abundant, and most available datasets are relatively small. Two such datasets are the SmartBugs [4] and ScrawlID [5] datasets, which were labeled using different tools to minimize the probability of false positives (i.e., detecting vulnerabilities where none exist). However, these datasets only contain 6.7k and 47k elements, respectively, making them insufficient for training a deep model from scratch.
Therefore, we have created and released our own large-scale dataset, consisting of over 100k labeled smart contracts. We used the Slither static analyzer to label the dataset, which applies a set of rule-based detectors to the code and produces a JSON file with information about any detected vulnerabilities. We then mapped the 38 detectors that identified vulnerabilities in our dataset to the following five classes:
The first is access control, which is a common issue in all types of programs. If the visibility of a contract's fields or functions is not correctly set to private, malicious users may be able to access them.
The second class is arithmetic issues, which are particularly dangerous in smart contracts where unsigned integers are prevalent. These issues are mainly related to integer underflow and overflow, which can turn benevolent pieces of code into tools for DoS attacks and theft.
The third class is reentrancy, which is perhaps the most famous Ethereum vulnerability. It occurs when a call to an external contract is allowed to make new calls to the calling contract before the initial execution is complete. This can cause the contract state to change in the middle of the execution of a function.
The fourth class is unchecked calls, which relates to Solidity's low-level functions such as call(), callcode(), delegatedcall(), and send(). These functions simply return false, but the code will continue to run, so developers should always check the return value of such low-level calls. We also include in this class the results of the Slither detector unused-return, which checks if the return value of an external call is not stored in a local or state variable.
Finally, the fifth class is called "others," which includes the results of all the other relevant Solidity detectors not included in the previous classes. Examples include uninitialized-state, which checks for uninitialized state variables, incorrect-equality, which checks for the use of strict equalities to determine if an account has enough Ether or tokens (something that can be easily manipulated by an attacker), and backdoor, which simply detects the presence of a function called "backdoor."
To generate an RGB image from the contract bytecode, we can follow a procedure similar to that used in [1]. For example, if we have the bytecode "606080", we can create an RGB image with channels (R:60, G:60, B:80). We use this method to generate images from the bytecode, and then center crop and resize them to achieve a consistent image size before passing them as input to our convolutional neural networks. Below, we include some examples of images produced using this technique, as well as a sample batch of images.
Figure 1 A Sample Batch of Images
2.1 Exploratory Data Analysis
Let's first show the amount of data points contained in every split.
Figure 2
Now, let's show the distribution of each class label in train, test and validation splits. As we can see classes are quite unbalanced but their distribution stays similar in different splits, which is what we expected.
Figure 3
Here we show the histogram related to the lenght (in words) of the source code files once the comment were removed. As we can see, for all the three splits the majority of the files have a lenght which is around 1000: there is however a small percentage of files for which the lenght arrives to almost 30000 words. A quick look at the source code of these long contracts shows that this is mostly due to the developers including also library code (i.e.: OpenZepplein) in the codebase published with their contract.
Figure 4
Here we show instead how the bytecodes do not necessarely have a peak lenght value, even if the majority of the still seem to be around 1000/2000 opcodes long.
Figure 5
Finally, now we show that the percentage of elements for which the bytecode was not available is negligible for all splits.
Table 4-1 The Percentage of Elements
Splits | Percentage |
---|---|
train | 0.002850 |
test | 0.003193 |
val | 0.002762 |
3 Model Building
In previous experiments, researchers explored two main types of architectures for the model. The first type is a traditional two-dimensional convolutional neural network (CNN) applied to RGB images created from contract bytecodes, as described above. The second type is a one-dimensional CNN applied directly to the contract bytecode, treating it as a signal and normalizing it to between -1 and 1.
However, in our experiments, we used backbone neural network models for the first architecture, Swin Transformer and ConvNeXt, respectively.
Below, we provide more details about each of the architectures we used in our experiments.
Figure 6 Multiple Methods
3.1 LSTM Baseline
The baseline LSTM model we used in our experiments consisted of a simple network architecture. It included an Embedding layer, which was trained from scratch to produce an embedding for each opcode in the contract bytecode.
The model also had three stacked bidirectional LSTM layers and two linear layers that served as the classification head. These linear layers took as input the concatenation of the final hidden states of the last forward and backward LSTMs and computed the prediction.
In this part, we introduce the Long-Short Term Memory (LSTM) network [6]. It has been proved theoretically that the LSTM neural network is improved on the basis of the structure of RNN and introduces the concept of the gate. It solves the problem of gradient disappearance and gradient explosion in the propagation process, so that the model can obtain a longer time memory and has advantages in timing processing.
It has a gating mechanism that allows them to selectively remember or forget information, which makes them less prone to the vanishing gradient problem than traditional RNNs. It can also handle input sequences of variable length, making them more flexible than other types of neural networks.
Figure 7 The structure of an LSTM memory block
Figure 7 shows the structure of a LSTM memory block. The LSTM network uses three gates, namely the forgetting gate, the input gate and the output gate, to control the discarded or added information, so as to realize the forgetting or memory function.
LSTM networks replace hidden layer neurons with memory blocks. Each memory block is composed of four parts: memory cell \(\boldsymbol{c}_{t}\), input gate\(\boldsymbol{i}_{t}\), output gate \(\boldsymbol{o}_{t}\) and forgetting gate \(\boldsymbol{f}_{t}\). The internal state \(\boldsymbol{c}_{t}\) is calculated by the formula below,
The forgetting gate \(\boldsymbol{f}_{t}\) controls how much information the internal state \(\boldsymbol{c}_{t-1}\) needs to forget,
The input gate \(\boldsymbol{i}_{t}\) controls how much information needs to be saved about the current candidate state \(\tilde{\boldsymbol{c}}_{t}\),
The output gate \(\boldsymbol{o}_{t}\) controls how much information about the current internal state \(\boldsymbol{c}_{t}\) needs to be output to the external state \(\boldsymbol{h}_{t}\),
Through the cycle unit, the whole network can establish a long time sequence dependence formula, which can be briefly described as:
where \(x_{t} \in \mathbb{R}^{M}\) is the input of the current time, \(\boldsymbol{W} \in \mathbb{R}^{4 D \times(D+M)}\) and \(\boldsymbol{b} \in \mathbb{R}^{4 D}\) are the network parameters.
3.2 Conv2D Models
For the 2D CNN models, we utilized two commonly used models from the computer vision field, specifically ResNet-18 and Inception v3. ResNet served as a sort of baseline convolutional model, while the Inception network was selected because it has a proven track record of achieving good results in malware detection and classification according to the literature. It's important to note that both models were not trained from scratch but instead were initialized with ImageNet weights. This pre-training is useful and has been shown to improve performance even on domains that are quite different from the original ImageNet dataset.
3.3 Conv1D Model
Finally, some literature [7,8] suggested that 1D convolutions may be a good fit for this task. This is because traditional 2D CNNs are structured in such a way that the shallow layers capture low-level features which then get aggregated into high-level ones in subsequent layers. However, the patterns which are interesting and useful to detect code vulnerability are most likely low-level pixel-by-pixel ones. In practice, as the network grows deeper, we tend to lose some of the pixel-level information, and as a result, the semantics and context of the smart contract can be destroyed.
At the same time, applying 1D convolutions over the contract bytecode used as a signal (i.e., not reshaped as an RGB image) may be better equipped to maintain this information. Thus, we implemented and tested a ResNet-inspired 1D convolutional neural network. The first of the two pictures below shows how we defined the 1D ResBlock, while the second one shows the architecture as a whole.
Figure 8 How we defined the 1D ResBlock
Figure 9 The architecture of the 1D ResBlock
3.4 Swin Transformer
Swin Transformer is a new type of transformer-based architecture for image recognition tasks, introduced in a paper titled "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [2]". Swin Transformer is designed to address the limitations of existing transformer-based architectures, such as their high computational cost and limited ability to capture fine-grained details.
Swin Transformer achieves these goals by introducing a hierarchical architecture in which the input image is first divided into smaller patches, and then a multi-stage transformer network is applied to the patches. The key innovation in Swin Transformer is the use of shifted windows, where the windows used by the self-attention mechanism in each layer are shifted in position relative to the previous layer. This allows the network to capture fine-grained details while still maintaining global context.
Swin Transformer has achieved state-of-the-art performance on a range of benchmark image recognition datasets, including ImageNet and COCO. Its success has led to increasing interest in transformer-based architectures for image recognition tasks, and Swin Transformer is likely to be an important reference point for future research in this area.
Figure 10 Swin Transformer VS ViT
Swin Transformer is a general-purpose Transformer backbone proposed for vision tasks such as semantic segmentation that require dense prediction at the pixel level. As shown on the left of the figure 10, the proposed Swin Transformer builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It constructs hierarchical feature maps with linear computational complexity to image size by starting from small-sized patches and gradually merging neighboring patches in deeper Transformer layers. Swin Transformer computes self-attention locally within non-overlapping windows that partition an image, achieving linear complexity by fixing the number of patches in each window. This makes Swin Transformer suitable for various vision tasks, in contrast to previous Transformer-based architectures that produce feature maps of a single resolution and have quadratic complexity. Previous Transformer-based architectures is shown on the right of the figure 10.Swin Transformer can also conveniently leverage advanced techniques for dense prediction such as feature pyramid networks (FPN) or U-Net.
Figure 11 An illustration of the shifted window approach for computing self-attention in the proposed Swin Transformer architecture.
A key design element of Swin Transformer is its shift of the window partition between consecutive self-attention layers, which bridges the windows of the preceding layer, providing connections among them that significantly enhance modeling power. An illustration of the shifted window approach is shown in the figure 11. This strategy is also efficient in regards to real-world latency, as all query patches within a window share the same key set, which facilitates memory access in hardware. The shifted window approach is much faster than the earlier sliding window based self-attention approaches, yet is similar in modeling power.
The network starts by splitting an input RGB image into non-overlapping patches, each treated as a "token", and applies a linear embedding layer on the raw-valued feature to project it to an arbitrary dimension. The architecture of a Swin Transformer (Swin-T) is shown on the left of the figure 12. Several Transformer blocks with modified self-attention computation (Swin Transformer blocks) are applied on these patch tokens, followed by patch merging layers to reduce the number of tokens and produce a hierarchical representation. Two successive Swin Transformer Blocks are shown on the right of the figure 12. The Swin Transformer block replaces the standard multi-head self-attention module in a Transformer block with a module based on shifted windows, followed by a 2-layer MLP with GELU nonlinearity in between. LayerNorm layers are applied before each MSA module and each MLP, and a residual connection is applied after each module.
Figure 12 The architecture of a Swin Transformer (Swin-T) and two successive Swin Transformer Blocks
The proposed Swin Transformer architecture uses a shifted window partitioning strategy for its self-attention computation in order to enhance modeling power and improve efficiency in terms of real-world latency. To address the issue of smaller windows resulting from the shifted configuration, a cyclic-shifting batch computation approach with a masking mechanism is employed for efficient computation. The apperoach is shown in the figure 13.
Figure 13 Illustration of an efficient batch computation approach for self-attention in shifted window partitioning.
3.5 ConvNeXt
ConvNeXt is a family of pure ConvNet models that are designed to achieve high performance on computer vision tasks. It is created by gradually modernizing a standard ResNet towards the design of a Vision Transformer, and discovering several key components that contribute to the performance difference. The models are constructed entirely from standard ConvNet modules, and are competitive with Transformers in terms of accuracy and scalability. ConvNeXt achieves 87.8% ImageNet top-1 accuracy, and outperforms Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.[3]
ConvNeXt is proposed as a family of pure ConvNets that gradually modernizes the architecture of a standard ResNet (e.g. ResNet50) to the construction of a hierarchical vision Transformer (e.g. Swin-T), exploring the impact of design decisions in Transformers on ConvNets' performance. The exploration discovers several key components that contribute to the performance difference, leading to the proposal of ConvNeXt.
The exploration aimed to improve the performance of a standard ConvNet by investigating different design decisions from a Swin Transformer. It started with a ResNet-50 model, trained it with similar techniques used to train vision Transformers, and then gradually modernized the architecture with the design decisions listed as 1) macro design, 2) ResNeXt, 3) inverted bottleneck, 4) large kernel size, and 5) various layer-wise micro designs. The result is a family of pure ConvNets named ConvNeXt.
Figure 14 Block designs for a ResNet, a Swin Transformer, and a ConvNeXt. Swin Transformer's block is more sophisticated due to the presence of multiple specialized modules and two residual connections. For simplicity, we note the linear layers in Transformer MLP blocks also as“1 × 1 convs” since they are equivalent.
- Macro Design
Swin Transformers follow ConvNets to use a multi-stage design, where each stage has a different feature map resolution. There are two interesting design considerations: the stage compute ratio, and the “stem cell” structure.
Changing stage compute ratio. The original design of the computation distribution across stages in ResNet was largely empirical. The heavy “res4” stage was meant to be compatible with downstream tasks like object detection, where a detector head operates on the 14×14 feature plane. Swin-T, on the other hand, followed the same principle but with a slightly different stage compute ratio of 1:1:3:1. For larger Swin Transformers, the ratio is 1:1:9:1. Following the design, we adjust the number of blocks in each stage from (3, 4, 6, 3) in ResNet-50 to (3, 3, 9, 3), which also aligns the FLOPs with Swin-T.
Changing stem to “Patchify”. Typically, the stem cell design is concerned with how the input images will be processed at the network’s beginning. Due to the redundancy inherent in natural images, a common stem cell will aggressively downsample the input images to an appropriate feature map size in both standard ConvNets and vision Transformers.
The stem cell in standard ResNet contains a 7×7 convolution layer with stride 2, followed by a max pool, which results in a 4× downsampling of the input images. In vision Transformers, a more aggressive “patchify” strategy is used as the stem cell, which corresponds to a large kernel size (e.g.kernel size = 14 or 16) and non-overlapping convolution.
Swin Transformer uses a similar “patchify” layer, but with a smaller patch size of 4 to accommodate the architecture’s multi-stage design. The exploration replaces the ResNet-style stem cell with a patchify layer implemented using a 4×4, stride 4 convolutional layer. - ResNeXt
At a high level, ResNeXt’s guiding principle is to “use more groups, expand width”. More precisely, ResNeXt employs grouped convolution for the 3×3 conv layer in a bottleneck block. As this significantly reduces the FLOPs, the network width is expanded to compensate for the capacity loss. - Inverted Bottleneck
One important design in every Transformer block is that it creates an inverted bottleneck, i.e., the hidden dimension of the MLP block is four times wider than the input dimension. Interestingly, this Transformer design is connected to the inverted bottleneck design with an expansion ratio of 4 used in ConvNets. The idea was popularized by MobileNetV2, and has subsequently gained traction in several advanced ConvNet architectures.
By using the inverted bottleneck design, the FLOPs for the depthwise convolution layer increased, but the overall network FLOPs were reduced to 4.6G, mainly due to the significant reduction of FLOPs in the downsampling residual blocks' shortcut 1×1 conv layer. Despite this reduction in FLOPs, there was a slight improvement in performance.
ResNet50 originally used the normal residual bottleneck, which has been changed to the inverted bottleneck, as shown in figure \ref{Block modifications and resulted specifications} (b) from (a). Although the computation cost of dw conv has increased, the calculation cost of the 1x1 conv used for shortcut in residual blocks containing downsampling has been greatly reduced. As a result, the FLOPs of the final model have been reduced to 4.6G. This change has a relatively small impact on ResNet50 (80.%->80.6%). (a) is a ResNeXt block; in (b) we create an inverted bottleneck block and in (c) the position of the spatial depthwise conv layer is moved up.
Figure 15 Block modifications and resulted specifications.
- Large Kernel Size
One of the most distinguishing aspects of vision Transformers is their non-local self-attention, which enables each layer to have a global receptive field. While large kernel sizes have been used in the past with ConvNets, the gold standard (popularized by VGGNet) is to stack small kernel-sized (3×3) conv layers, which have efficient hardware implementations on modern GPUs. Although Swin Transformers reintroduced the local window to the self-attention block, the window size is at least 7×7, significantly larger than the ResNe(X)t kernel size of 3×3.
Moving up depthwise conv layer. To explore large kernels, one prerequisite is to move up the position of the depthwise conv layer. That is a design decision also evident in Transformers: the MSA block is placed prior to the MLP layers. As we have an inverted bottleneck block, this is a natural design choice — the complex/inefficient modules (MSA, large-kernel conv) will have fewer channels, while the efficient, dense 1×1 layers will do the heavy lifting.
Increasing the kernel size. With all of these preparations, the benefit of adopting larger kernel-sized convolutions is significant. - Various Layer-wise Micro Designs
Most of the explorations here are done at the layer level, focusing on specific choices of activation functions and normalization layers.
Replacing ReLU with GELU One discrepancy between NLP and vision architectures is the specifics of which activation functions to use. Numerous activation functions have been developed over time, but the Rectified Linear Unit (ReLU) is still extensively used in ConvNets due to its simplicity and efficiency. ReLU is also used as an activation function in the original Transformer paper. The Gaussian Error Linear Unit, or GELU, which can be thought of as a smoother variant of ReLU, is utilized in the most advanced Transformers, including Google’s BERT and OpenAI’s GPT-2, and, most recently, ViTs.
Fewer activation functions. One minor distinction between a Transformer and a ResNet block is that Transformers have fewer activation functions. Consider a Transformer block with key/query/value linear embedding layers, the projection layer, and two linear layers in an MLP block. There is only one activation function present in the MLP block. In comparison, it is common practice to append an activation function to each convolutional layer, including the 1 × 1 convs.
Fewer normalization layers. Transformer blocks usually have fewer normalization layers as well. Here we remove two BatchNorm (BN) layers, leaving only one BN layer before the conv 1 × 1 layers.
Substituting BN with LN. BatchNorm is an essential component in ConvNets as it improves the convergence and reduces overfitting. However, BN also has many intricacies that can have a detrimental effect on the model’s performance. There have been numerous attempts at developing alternative normalization techniques, but BN has remained the preferred option in most vision tasks. On the other hand, the simpler Layer Normalization (LN) has been used in Transformers, resulting in good performance across different application scenarios.
Directly substituting LN for BN in the original ResNet will result in suboptimal performance. But ConvNet model does not have any difficulties training with LN; in fact, the performance is slightly better, obtaining an accuracy of 81.5%.
Separate downsampling layers. In ResNet, the spatial downsampling is achieved by the residual block at the start of each stage, using 3×3 conv with stride 2 (and 1×1 conv with stride 2 at the shortcut connection). In Swin Transformers, a separate downsampling layer is added between stages. Further investigation shows that, adding normalization layers wherever spatial resolution is changed can help stablize training. These include several LN layers also used in Swin Transformers: one before each downsampling layer, one after the stem, and one after the final global average pooling.
ConvNeXt, a pure ConvNet model, can perform as good as a hierarchical vision Transformer on image classification, object detection, instance and semantic segmentation tasks.
ConvNeXt may be more suited for certain tasks, while Transformers may be more flexible for others. A case in point is multi-modal learning, in which a cross-attention module may be preferable for modeling feature interactions across many modalities. Additionally, Transformers may be more flexible when used for tasks requiring discretized, sparse, or structured outputs. We believe the architecture choice should meet the needs of the task at hand while striving for simplicity.
3.6 Model Comparison
RNN (Recurrent Neural Networks), 1D-CNN (One-Dimensional Convolutional Neural Networks), and Transformers are all widely used in Natural Language Processing (NLP) tasks, but they differ in their approach and architecture.
RNNs are a type of neural network that are designed to process sequential data. They are well-suited for tasks such as language modeling, speech recognition, and machine translation. The main advantage of RNNs is their ability to capture the context and dependencies between previous inputs and current outputs, making them effective for tasks that require memory and context awareness. However, RNNs can suffer from vanishing and exploding gradients and can be computationally expensive.
1D-CNNs are a type of neural network that use convolutional layers to extract local features from sequential data. They are particularly good at capturing short-term dependencies and are often used in tasks such as sentiment analysis, text classification, and named entity recognition. 1D-CNNs are computationally efficient and require less training time than RNNs. However, they are less effective at capturing long-term dependencies and context.
Transformers are a type of neural network that use self-attention mechanisms to process sequential data. They are particularly well-suited for tasks that require long-term dependencies and context awareness, such as language modeling, machine translation, and summarization. Transformers are highly parallelizable and can process large amounts of data in parallel, making them very efficient. However, they require a large amount of training data and are computationally expensive.
We have used multiple architectures here to address the task of Smart Contracts Vulnerability Detection. It is worth mentioning that we did not use the traditional transformer to process the binary code code of the smart contract. We used the swin transformer architecture to analyze the source code from a visual perspective. (Like 2D CNN)
Swin Transformer is a recently proposed variant of the Transformer architecture that has achieved state-of-the-art performance on a number of computer vision tasks, such as image classification and object detection. The Swin Transformer differs from the original Transformer architecture in several key ways.
Firstly, the Swin Transformer uses a hierarchical feature representation that consists of multiple levels of feature maps, rather than a single level of feature maps as used in the original Transformer. This allows the model to capture features at multiple scales and resolutions, which is particularly useful for computer vision tasks.
Secondly, the Swin Transformer uses a shifted window approach to divide the input image into patches, rather than a fixed-size non-overlapping grid of patches as used in the original Transformer. This reduces the number of patches required and enables the model to capture long-range dependencies more effectively.
Thirdly, the Swin Transformer uses a multi-stage architecture that processes the input image in multiple stages, with each stage processing a different level of feature maps. This enables the model to capture both local and global information effectively.
The Swin Transformer improves upon the original Transformer architecture by incorporating several innovations that are specifically tailored for computer vision tasks. By leveraging a hierarchical feature representation, a shifted window approach, and a multi-stage architecture, the Swin Transformer achieves state-of-the-art performance on a range of computer vision benchmarks.
ConvNeXt draws on the successful experience of Swin Transformer to some extent, referencing its architecture and training mode, designs the latest CNN network, and surpasses Swin Transformer in some tasks.
4 Results and Error Analysis
4.1 Predictive Analytics
Figure 16 Comparing the performance of different models
4.1.1 Compare 1D-CNN and 2D-CNN
As shown in the figure above, we have improved a variety of models and applied them to this downstream task. From the results of the verification set, 1D-CNN has a very poor effect. Its accuracy is only 0.4755. It does not capture the characteristics of smart contract binary code like LSTM, as we imagine. There is a significant gap with the LSTM baseline whose accuracy is 0.6934. After changing it to 2D-CNN (i.e. Resnet18), the overall results have significantly improved. Although the characteristic of 2D-CNN is that it can effectively capture local information, after stacking multiple convolutional layers, the model is able to capture global information in addition to local information. This is because each convolutional layer learns to extract increasingly abstract features from the input, with higher-level layers capturing more global information that is relevant to the task at hand. As the input passes through multiple convolutional layers, the receptive field of the output feature map gradually increases, allowing the model to capture more global patterns and structures.
Clearly, 1D-CNN can grasp the relationship between adjacent bytecodes, which theoretically better reflects the sequential execution process of smart contracts. However, why is the result not as good as 2D-CNN?
1D-CNN has the advantage of being able to capture the relationship between adjacent bytecodes, which is important for understanding the sequential execution process of smart contracts. However, despite this advantage, the results of 1D-CNN models in analyzing smart contracts are not always as good as those of 2D-CNN models.
One reason for this is that 2D-CNN models are better at capturing spatial relationships between different parts of the contract code, which can be important for detecting patterns and anomalies in the data. 2D-CNN models are able to extract features from different regions of the contract code and learn spatial correlations between these regions, which allows them to better identify complex patterns that may be missed by 1D-CNN models.
Intuitively, it is believed that 2D-CNN models, which are typically used in computer vision, are better at capturing the call information between functions when analyzing source code for vulnerabilities. This is because the structure of a program can be thought of as a two-dimensional grid, with each function call or declaration representing a spatially distinct location within the code.
By using a 2D-CNN model, it is possible to capture the spatial relationship between different parts of the code and learn correlations between them, which can be useful for detecting patterns and anomalies that may be indicative of vulnerabilities. Additionally, 2D-CNN models have a larger receptive field than 1D-CNN models, which allows them to capture more complex patterns and relationships in the data.
Furthermore, call information between functions is a critical aspect of program execution and can often be a source of vulnerabilities, such as in cases of code injection attacks or buffer overflow vulnerabilities. By using a 2D-CNN model to analyze the code, it may be possible to detect these vulnerabilities more effectively and accurately than with a 1D-CNN model.
In summary, In terms of efficiency and effect, we believe that the computer vision method is better than the natural language processing method in the task of smart contract vulnerability detection, although the 2D-CNN and subsequent Swin Transformer and ConvNeXt models show the characteristics of overfitting.
4.1.2 Innovation
Due to the success of computer vision methods in smart contract vulnerability detection tasks, we will further apply a more powerful backbone network to this task.
Especially Swin Transformer and ConvNeXt. It can be seen that the powerful Swin Transformer has further improved its performance compared to Resnet18, and the latest ConvNeXt has made a further leap in performance compared to Swin Transformer, ultimately surpassing the LSTM baseline in terms of Accuracy (0.7032) and F1-Score (0.8095). Becoming the best model for our research this time.
4.1.3 Image Analysis
Figure 17 Color representation of different models on different datasets. When the validation set and training set appear on the same graph, the color of the curve corresponding to the validation set is darker.
Figure 18 Accuracy Curve
Figure 19 Loss Curve
Computer vision methods face the problem of overfitting.
The loss and accuracy curves for the 2D-CNN, Swin Transformer, and ConvNeXt models in computer vision tasks indicate that these models suffer from overfitting. Overfitting occurs when a model performs well on the training set, but poorly on the validation or test set, which suggests that the model has memorized the training data rather than learning to generalize to new data.
This problem is evident in the accuracy curves, which show high accuracy on the training set, but a significant drop in accuracy on the validation set. Additionally, the loss curves for the validation set show an initial decrease in loss, followed by an increase, indicating that the model is unable to generalize well to new data.
One potential reason for this overfitting is that the models are too complex, with too many parameters relative to the amount of training data. This can cause the model to learn noise in the training data, rather than meaningful patterns, which leads to poor performance on new data.
To address overfitting, techniques such as regularization, data augmentation, and early stopping can be used. These techniques can help to prevent the model from memorizing the training data and improve its ability to generalize to new data.
It is worth mentioning that we used an early stop during the training process, so different methods have different epochs.
The efficiency of natural language processing in training process is too low.
From the observation of the accuracy curve for the 1D-CNN model, it can be seen that the accuracy continues to increase at 100 epochs, but the upward process is slow. In contrast, computer vision methods such as 2D-CNN, Swin Transformer, and ConvNeXt show great advantages in this task, as they can converge to a good solution within 50 epochs, even if they suffer from overfitting.
The slow convergence of the 1D-CNN model may be due to its limited capacity to capture complex patterns and relationships in the data. This makes it more difficult for the model to learn meaningful representations and achieve high accuracy within a reasonable number of epochs.
On the other hand, computer vision methods such as 2D-CNN, Swin Transformer, and ConvNeXt are better equipped to handle complex patterns and relationships in the data, which enables them to achieve good performance even with a smaller number of epochs. However, these methods may suffer from overfitting, as mentioned earlier.
4.1.4 Future Research
Improving the performance of 1D-CNN models: Despite its limitations, the 1D-CNN model has potential for improvement. Future work could explore ways to improve its ability to capture complex patterns and relationships in the data, which could lead to better performance on various tasks.
Developing hybrid models: Hybrid models that combine the strengths of 1D-CNN and computer vision methods could be explored to achieve better performance. For example, a model could use a 1D-CNN for sequential data processing and a 2D-CNN for spatial data processing, or combine 1D-CNN and Swin Transformer architectures.
Exploring new architectures for computer vision: While 2D-CNN, Swin Transformer, and ConvNeXt are currently popular architectures for computer vision tasks, future work could explore new architectures that are better suited to specific tasks or have better performance.
Addressing the problem of overfitting: Overfitting remains a significant challenge in deep learning. Future work could explore new techniques for preventing overfitting, such as advanced regularization methods or ensemble learning.
Projects have been uploaded to the lot (except the LSTM baseline), see https://github.com/Tracker1701/Smart_Contracts_Vulnerability_Detection.
Acknowledgement
Thanks to Qingbin, Zijiao and Qiuning who have made great contributions to our project. Thanks to Mr. Jin Fusheng for his guidance.
References
[1] Huang T H D. Hunting the ethereum smart contract: Color-inspired inspection of potential attacks [J]. arXiv preprint arXiv:1807.01868, 2018.
[2] Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows [C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.
[3] Liu Z, Mao H, Wu C Y, et al A convnet for the 2020s[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 11976-11986.
[4] Durieux T, Ferreira J F, Abreu R, et al Empirical review of automated analysis tools on 47,587 ethereum smart contracts[C]//Proceedings of the ACM/IEEE 42nd International conference on software engineering. 2020: 530-541.
[5] Yashavant C S, Kumar S, Karkare A. Scrawld: A dataset of real world ethereum smart contracts labelled with vulnerabilities[J]. arXiv preprint arXiv:2202.11409, 2022.
[6] Malhotra P, Vig L, Shroff G, et al. Long Short Term Memory Networks for Anomaly Detection in Time Series.[C]//ESANN: vol. 2015. 2015: 89.
[7] Hwang S J, Choi S H, Shin J, et al. CodeNet: Code-targeted convolutional neural network architecture
for smart contract vulnerability detection[J]. IEEE Access, 2022, 10: 32595-32607.
[8] Lin W C, Yeh Y R. Efficient malware classification by binary sequences with one-dimensional convolutional neural networks[J]. Mathematics, 2022, 10(4): 608.