A closer look at Ownership in Rust

原文链接:A closer look at Ownership in Rust

So you want to learn Rust and keep hearing about the concept of Ownership and Borrowing, but can’t fully wrap your head around what it is. Ownership is so essential that it’s good to understand it early on in your journey of learning Rust, also to avoid running into compiler errors that keep you from implementing your programs.

你是否想学习Rust并且总是听到有关所有权(Ownership)和借用(Borrowing)的概念,但是你不能完全理解它到底是什么。所有权十分重要,理解这个概念对于初学Rust来讲是有很大好处的,并且这也能让你在实现程序的过程中避免很多编译错误。

In our previous article, we’ve already talked about the Ownership model from a JavaScript developer’s perspective. In this article we’ll take a closer look at how Rust manages memory and why this ultimately affects how we write our code in Rust and preserve memory safety.

在我们之前的文章中,已经从JavaScript开发者的角度讨论过所有权模型。在本文中,我们将会更仔细地来看一看Rust是如何管理内存并且所有权为何极大地影响了我们在Rust中写代码的方式以及它是如何保证内存安全的。

Once you’re done reading this, you might want to check out our article on References in Rust as well as the difference between String and &str.

当你读完本文的时候,你可能想要去看一下另外两篇文章References in Rust difference between String and &str

What is Memory Safety anyway?

First and foremost it’s good to understand what memory safety actually means when it comes to discussing what makes Rust stand out as a programming language. Especially when coming from a non-systems programming background, or with mainly experience in garbage collected languages, it might be a bit harder to appreciate this fundamental feature of Rust.

在讨论是什么让Rust作为一门编程语言能够脱颖而出时,我们最好能够先来理解内存安全意味着什么?尤其是如果你没有系统编程背景或者主要使用带垃圾回收机制的语言,可能会很难理解Rust的这个基础特性。

As Will Crichton states in his great article Memory Safety in Rust: A Case Study with C:

正如Will Crichton 的一篇很棒的文章Memory Safety in Rust: A Case Study with C中所说的那样:

Memory safety is the property of a program where memory pointers used always point to valid memory, i.e. allocated and of the correct type/size. Memory safety is a correctness issue—a memory unsafe program may crash or produce nondeterministic output depending on the bug.

In practice, this means that there are languages that allow us to write “memory unsafe” code, in the sense that it’s fairly easy to introduce bugs. Some of those bugs are:

事实上,很多语言默许我们写出“内存不安全”的代码,这也就会更容易产生bug。比如像下面这些:

  • Dangling pointers: Pointers that point to invalid data (this will make more sense once we look at how data is stored in memory). You can read more about dangling pointers here.
  • 悬垂指针(Dangling pointers): 指向无效数据的指针(当我们了解数据在内存中如何存储之后,这个就很有意义)。你可以在这里了解更多悬垂指针
  • Double frees: Trying to free the same memory location twice, which can lead to “undefined behaviour”. More on that here.
  • 重复释放(Double frees): 试图对同一块内存地址释放两次,这会导致“未定义行为”。更多了解在这里。

To illustrate the concept of a dangling pointer, let’s take a look at the following C++ code and how it is represented in memory:

为了说明悬垂指针的概念,让我们来看下面的C++代码以及它是如何在内存中表示的:

std::string s = "Have a nice day";

The initialized string is usually represented in memory using the stack and heap like this:

初始化的字符串通常是在内存中使用堆和栈进行表示的,像下面这样:

                     buffer
                   /   capacity
                 /   /    length
               /   /    /
            +–––+––––+––––+
stack frame │ • │ 1615 │ <– s
            +–│–+––––+––––+
              │
            [–│––––––––––––––––––––––––– capacity ––––––––––––––––––––––––––]
              │
            +–V–+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+
       heap │ H │ a │ v │ e │   │ a │   │ n │ i │ c │ e │   │ d │ a │ y │   │
            +–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+

            [––––––––––––––––––––––––– length ––––––––––––––––––––––––––]

We’ll get into what stack and heap are in a second, but for now it’s important to appreciate that what gets stored on the stack is the std::string object itself which is of a three words long, fixed size. The fields are a pointer to the heap-allocated buffer which holds the actual data, the buffers capacity and the length of the text. In other words, the std::string owns its buffer. When the program destroys this string, it’ll free the corresponding buffer as well through the string’s destructor.

我们马上会讲到什么是堆和栈,但是现在更重要地是理解存储在栈上的是std::string对象本身,这个对象的长度为三个字(word),长度固定。它里面的字段包括指向一块分配在堆上的缓冲区(buffer),也是实际存放数据的位置,还包括缓冲区容量以及文本长度。换句话说,std::string拥有它的缓冲区。当程序销毁这个字符串的时候,对应的缓冲区也会通过字符串的析构器被释放。

However, it’s totally possible to create other pointer objects to a character living inside that same buffer which won’t get destroyed as well, leaving them invalid after the string has been destroyed, and there we have it - a dangling pointer!

但是,如果创建一个指向相同缓冲区中某个字符的指针,当字符串已经被销毁之后,指针指向的内容就是无效的,这时候我们就有一个悬垂指针,这是完全有可能的。

If you wonder how this is not exactly an issue when you write programs in languages like JavaScript or Python, the reason for that is that those languages are garbage collected. This means that the language comes with a program that, at run-time, will traverse the memory and free everything up that is no longer in use. Such program is called a Garbage Collector. While this sounds like a nice thing to have, of course garbage collection comes at a cost. Since it happens at run-time of your program, it can certainly affect the program’s overall run-time performance.

如果你想知道当你在用像JavaScript或者Python这样的语言编写程序时是怎么解决这个问题的,那是因为这些语言都有垃圾回收机制。这意味这些语言会在运行时带着一个程序,这个程序会遍历内存然后释放所有不会再用到的东西。这样的程序叫做垃圾回收器(Garbage Collector)。虽然有垃圾回收器听起来很美好,但是想想也知道这也要付出一定的代价。因为垃圾回收器是在你的程序运行时工作的,所以这一定会影响程序的整体性能。

Rust does not come with garbage collection, instead, it solves the issue of guaranteeing memory safety using ownership and borrowing. When we say that Rust comes with memory safety, we refer to the fact that, by default, Rust’s compiler doesn’t even allow us to write code that is not memory safe. How cool is that?

Rust没有垃圾回收器,取而代之的是,它使用所有权和借用来解决保证内存安全的问题。当我们说Rust是内存安全的,我们是指,在默认情况下,Rust的编译器根本不允许我们写出内存不安全的代码。这是多么酷!

Stack and Heap

Before we jump into how Rust handles Ownership of data, let’s quickly touch on what the stack and heap are and how they relate to which data gets stored where.

在我们深入了解Rust是如何处理数据的所有权之前,我们先来快速看一下什么是堆和栈以及他们是怎么和哪些数据存放在哪儿相关联的。

Both, stack and heap, are parts of memory but are represented in different data structures. While the stack is… well, a stack, where values are stored in order as they come in, and removed in the opposite order (which are very fast operations), a heap is more like a tree structure that requires a bit more computational effort to read and write data.

堆和栈都是内存的一部分但是以不同的数据数据结构来表示。栈是按照数据进来的顺序进行存储的,但是移除数据的时候是以相反的顺序(这样操作速度比较快)。堆更像是一个树结构,但是在进行数据读写时就需要多进行一些计算。

What goes onto the stack and what onto the heap depends on what data we’re dealing with. In Rust, any data of fixed size (or “known” size at compile time), such as machine integers, floating-point numeric types, pointer types and a few others, are stored on the stack. Dynamic and “unsized” data is stored on the heap. This is because often these types of unkown size either need to be able to to dynamically grow, or because they need to do certain “clean up” work when destructed (more than just popping a value off the stack).

哪些数据存放在栈上,哪些数据存放在堆上,这取决于我们要处理的数据。在Rust里,任何固定大小(在编译期可以知道的大小),比如机器整数(machine integers),浮点数类型,指针类型和一些其他类型会被存储在栈上。动态的和“不确定大小(unsized)”数据被存储在堆上。这是因为这些不知道大小的类型,会经常地要么需要能够动态增长,要么需要在被析构时执行准确地清理工作(这不仅仅是从栈上弹出一个值)。

That’s why in the previous example, the string object itself is actually a pointer stored on the stack, which is always of fixed size (a buffer pointer, capacity and length), whereas the buffer (the raw data) is stored on the heap.

这也是为什么在之前的例子里,拥有固定大小(一个缓冲区指针,容量和长度)的字符串对象本身是一个存储在栈上的指针,而缓冲区(原始数据)存储在堆上。

As for Rust, generally the language avoids storing data on the heap and the compiler will never implicitly do so either. To make it explicit, Rust comes with certain pointer types such as Box, which we’ll cover in another article. For more information on stack and heap I highly recommend taking a look at Rust’s official chapter on Ownership.

至于Rust,通常语言本身是避免在堆上存储数据并且编译器也不会在暗中这么做。要想显式这么做,Rust里有对应的指针类型,比如Box,我们会在另一篇文章里讲到。要想了解更多关于堆和栈的内容,我强烈推荐你去看一下 Rust’s official chapter on Ownership。

Understanding Ownership

Now that we have a little bit of a better understanding of how data is stored, let’s take a closer look at Ownership in Rust. In Rust, every value has a single owner that determines its lifetime. If we take the C++ code from above and look at the Rust equivalent, the data is stored in memory pretty much the same way.

既然我们已经对数据是如何存储的有了一些理解,现在让我们来看看Rust里的所有权吧。在Rust里,每一个值都有一个决定其生命周期的唯一的所有者(owner)。如果我们对比上面的C++代码和Rust中等价的写法,数据几乎是以相同的方式在内存中存储的。

let s = "Have a nice day".to_string();

Similarly, when the owner of some value is “freed”, or in Rust lingo, “dropped”, the owned value is dropped as well. When are values dropped? This is where it gets interesting. When the program leaves a block in which a variable is declared, that variable will be dropped, dropping its value with it.

类似地,当某些值的所有者被“释放(freed)”,或者用Rust的术语“丢弃(dropped)”,那么这个被拥有的值也会被丢弃。这些值在什么时候被丢弃?这才是吸引人的地方。当这个程序离开了变量被生命的块(block),这个变量就会被丢弃,变量的值也会被丢弃。

A block could be a function, an if statement, or pretty much anything that introduces a new code block with curly braces. Assuming we have the following function:

一个块可以是一个函数,一个if语句,或者几乎是任何用大括号引入的代码块。假定我们有下面的函数:

fn greeting() {
  let s = "Have a nice day".to_string();
  println!("{}", s); // `s` is dropped here
}

Just by looking at the code, we know the lifetime of s because we know that Rust will drop its value when it reaches the end of the function block. The same applies when we deal with more complex data structures. Let’s take a look at the following code:

只看上面的代码,我们就可以知道s的生命周期,因为我们知道Rust会在执行到函数块末尾的时候丢弃它的值。当我们处理更复杂的数据结构时这也同样适用。让我们俩看看下面的代码:

let names = vec!["Pascal".to_string(), "Christoph".to_string()];

This creates a vector of names. A vector in Rust is like an array, or list, but it’s dynamic in size. We can push() values into it at run-time. Our memory will look something like this:

上面的代码创建了一个名字的vector。Rust的vector类似一个数组(array)或者列表(list),但是它是动态增长的。我们可以在运行时调用push()把值放进去。我们的内存看起来下面这样:

            [–– names ––]
            +–––+–––+–––+
stack frame │ • │ 32+–│–+–––+–––+
              │
            [–│–– 0 –––] [–––– 1 ––––]
            +–V–+–––+–––+–––+––––+–––+–––+–––+
       heap │ • │ 86 │ • │ 129 │       │
            +–│–+–––+–––+–│–+––––+–––+–––+–––+
              │\   \   \  │
              │ \   \    length
              │  \    capacity
              │    buffer │
              │           │
            +–V–+–––+–––+–––+–––+–––+–––+–––+
            │ P │ a │ s │ c │ a │ l │   │   │
            +–––+–––+–––+–––+–––+–––+–––+–––+
                          │
                          │
                        +–V–+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+
                        │ C │ h │ r │ i │ s │ t │ o │ p │ h │   │   │   │
                        +–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+

Notice how the vector object itself, similar to the string object earlier, is stored on the stack with its capacity, and length. It also comes with a pointer, pointing at the location in the heap where the vector data is located. The string objects of the vector are then stored on the heap, which in turn own their dedicated buffer.

留意这里,vector对象本身以及其容量和长度,类似于前面的字符串对象,是如何在栈上存储的。它也带了一个指针,指向了vector数据在堆上存储的位置。vector里的字符串对象也就跟着存储在堆上,同样的,这些字符串对象也有自己的专用缓冲区。

This creates a tree structure of data where every value is owned by a single variable. When names goes out of scope, its values will be dropped which eventually cause the string buffers to be dropped as well.

这就创建了一个树结构,在这个树结构中,每一个值都被唯一的变量拥有。当names离开作用域时,它的值就会被丢弃并最终会引起字符串缓冲区也会被丢弃。

This probably raises a couple of questions though. How does Rust ensure that only a single variable owns its value? How can we have multiple variables point at the same data? Are we forced to copy everything to ensure only a single variable owns some value?

这可能会引发一系列值得思考的问题。Rust是怎么保证只有一个变量拥有它的值?我们怎么能让多个变量指向相同的数据?我们会被强制要求拷贝所有的东西从而保证某个值被唯一的变量拥有?

Moves and Borrowing

Let’s start with the first question: How does Rust ensure that only a single variable owns its value? It turns out that Rust moves values to their new owner when doing things like value assignment or passing values to functions. This is a very important concept as it affects how we write code in Rust.

让我们从第一个问题开始,Rust是怎么保证只有一个变量拥有它的值?这是因为Rust在进行类似赋值或者给函数传值的行为时,Rust把值移动给了新的拥有者。这是一个非常重要的概念,因为这会影响我们在Rust中写代码的方式。

Let’s take a look at the following code:

让我们来看看下面的代码:

let name = "Pascal".to_string();
let a = name;
let b = name;

Coming from languages like Python or JavaScript, we’d probably expect that both a and b will have a reference to name and therefore will both point at the same data. However, when we try to compile this code, we soon realize that this is not the case:

如果了解过Python或者JavaScript这样的语言,我们可能会认为ab都有一个对name的引用并且它们都指向相同的数据。但是,当我们尝试编译上面的代码时,我们很快意识到不是这样的:

error[E0382]: use of moved value: `name`
 --> src/main.rs:4:11
  |
2 |   let name = "Pascal".to_string();
  |       ---- move occurs because `name` has type `std::string::String`, which does not implement the `Copy` trait
3 |   let a = name;
  |           ---- value moved here
4 |   let b = name;
  |           ^^^^ value used here after move

We get a compiler error with a lot of (useful) information. The compiler tells us that we’re trying to assign the value from name to b after it had been moved to a. The problem here is that, by the time we’re trying to assign the value of name to bname doesn’t actually own the value anymore. Why? Because ownership has been moved to a in the meantime.

我们得到了一个带有很多(有用)信息的编译错误。编译器告诉我们,我们正在尝试在把name移动给a之后接着把它赋值给b。问题在于,当我们尝试把name赋值给b的时候,name实际上已经不再拥有值了。为什么呢?因为在这个时候,所有权已经被移动给a

Let’s take a look at what happens in memory to get a better understanding of what’s going on. When name is initialized, it looks very similar to our examples earlier:

让我们看看内存中发生了什么以便于我们更好地理解接下来发生的事情。当name被初始化的时候,它和我们之前的例子很像:

            +–––+–––+–––+
stack frame │ • │ 86 │ <– name
            +–│–+–––+–––++–V–+–––+–––+–––+–––+–––+–––+–––+
       heap │ P │ a │ s │ c │ a │ l │   │   │
            +–––+–––+–––+–––+–––+–––+–––+–––+

However, when we assign the value of name to a, we move ownership to a as well, leaving name uninitialized:

但是,当我们把name的值赋值给a的时候,我们也把所有权交给了a,这时候的name是未初始化的。

            [–– name ––] [––– a –––]
            +–––+–––+–––+–––+–––+–––+
stack frame │   │   │   │ • │ 86+–––+–––+–––+–│–+–––+–––++–––––––––––++–V–+–––+–––+–––+–––+–––+–––+–––+
       heap │ P │ a │ s │ c │ a │ l │   │   │
            +–––+–––+–––+–––+–––+–––+–––+–––+

At this point, it’s no surprise that the expession let b = name will result in an error. What’s important to appreciate here is that all of this is static analysis done by the compiler without actually running our code!

此时,表达式let b = name会产生一个错误就不足为奇了。这里很重要的一点是,所有的这种静态分析都是由编译器完成,而实际上并没有运行我们的代码。

Remember when I said Rust’s compiler doesn’t allow us to write memory unsafe code?

还记得我说过Rust的编译器不允许我们写出内存不安全的代码么?

So how do we handle cases like these? What if we really want to have multiple variables point at the same data? There are two ways to deal with this, and depending on the case we want to go with one or the other. Probably the easiest but also most costly way to handle this scenario is to copy or clone the value. Obviously, that also means we’ll end up duplicating the data in memory:

所以,我们怎么处理这种情况呢?如果我们真的想要有多个变量指向同一块数据呢?有两种方法可以处理,具体采用哪种要看实际情况。对值进行拷贝或者克隆来处理这种情况可能是最简单但是开销最大的方式。显然,这也意味着我们最终还是要复制内存中的数据:

let name = "Pascal".to_string();
let a = name;
let b = a.clone();

Notice that we don’t need to clone the value from name into a because we’re not trying to read a value from name after its value has been assigned to a. When we run this program, the data will be represented in memory like this before its dropped:

记住,我们不需要从name克隆值到a因为我们没有试图在name的值赋值给a之后对name进行读取。当我们运行这个程序的时候,数据被丢弃之前,在内存中是像下面这个表示的:

            [–– name ––] [––– a –––][–––– b ––––]
            +–––+–––+–––+–––+–––+–––+–––+–––+–––+
stack frame │   │   │   │ • │ 86 │ • │ 86+–––+–––+–––+–│–+–––+–––+–│–+–––+–––+
                          │           │
              +–––––––––––+           +–––––––+
              │                               │
            +–V–+–––+–––+–––+–––+–––+–––+–––+–V–+–––+–––+–––+–––+–––+–––+–––+
       heap │ P │ a │ s │ c │ a │ l │   │   │ P │ a │ s │ c │ a │ l │   │   │
            +–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+–––+

Obviously, cloning data isn’t always an option. Depending on what data we’re dealing with, this can be a quite expensive operation with a lot of memory preassure. Often, all we really need is a reference to a value. This is especially useful when we write functions that don’t actually need ownership of a value. Imagine a function greet() that takes a name and simply outputs it:

显然,并不是所有的时候都可以对数据进行克隆。根据我们要处理的数据的不同,有时候克隆操作可能是一项对内存开销很高的操作。我们经常需要的可能只是对值的引用。这在我们写一些不需要值的所有权的函数的时候,是非常有用的。假设有一个greet()函数传入name并且只是简单地将其输出:

fn greet(name: String) {
  println!("Hello, {}!", name);
}

This function doesn’t need ownership to output the value it takes. Also, it would prevent us from calling the function multiple times with the same variable:

这个函数并不需要传入值的所有权才能输出。而且,这还会阻止我们对这个传入相同变量的函数进行多次调用:

let name = "Pascal".to_string();
greet(name);
greet(name); // Move happened earlier so this won't compile

To get a reference to a variable we use the & symbol. With that we can be explict about when we expect a reference over a value:

我们使用&符号对一个变量进行引用。当我们需要一个引用时,我们可以使用这个符号:

fn greet(name: &String) {
  println!("Hello, {}!", name);
}

For the record, we would probably design this API to expect a &str instead for various reasons, but I don’t want to make it more confusing as it needs to be so we’ll just stick with a &String for now.

明确地说,我们可能会由于各种原因使用&str取而代之来设计这个API,但是这里不想让它变得太复杂,因为我们现在只需要一个&String

greet() now expects a string reference, which also enables us to call it multiple times like this:

greet()现在期望传入一个字符串引用,从而能够对其进行多次调用:

let name = "Pascal".to_string();
greet(&name);
greet(&name);

When a function expects a reference to a value, it *borrows it. Notice that it never gets ownership of the values that are being passed to it.

当一个函数期望传入一个值的引用时,我们说这个函数对这个值进行借用。注意,这里的函数从未得到过传入值的所有权。

We can address the variable assignment from earlier in a similar fashion:

我们可以用类似的方式解决前面的变量赋值问题:

let name = "Pascal".to_string();
let a = &name;
let b = &name;

With this code, name never loses ownership of its value and a and b are just pointers to the same data. The same can be expressed with:

使用上面的代码,name就不会失去所有权而ab只是执行相同数据的指针。下面的表达也是一样的:

let name = "Pascal".to_string();
let a = &name;
let b = a;

Calling greet() in between those assignments is no longer problem either:

在这些赋值操作之间,调用greet()就不再是问题了:

let name = "Pascal".to_string();
let a = &name;
greet(a);
let b = a;
greet(a);

Conclusion

This was really just the tip of the iceberg. There are a few more things to consider when it comes to Ownership, Borrowing and Moving data, but hopefully this article conveys a good basic understanding of what’s going on behind the scenes on how Rust ensures memory safety.

这些只是冰山一角。关于数据的所有权,借用以及移动,还有很多东西需要考虑,但是希望这篇文章能够让你对Rust是如何保证内存安全的背后原理有一个基本的理解。

 

posted @ 2023-09-22 12:01  ImreW  阅读(9)  评论(0编辑  收藏  举报