size_t为何这么重要?

原文Why size_t matters


合理的使用size_t可以提高程序的可移植性和代码的可读性,让你的程序更高效。


Numerous functions in the Standard C library accept arguments or return values that represent object sizes in bytes. For example, the lone argument in malloc(n) specifies the size of the object to be allocated, and the last argument in memcpy(s1, s2, n) specifies the size of the object to be copied. The return value of strlen(s) yields the length of (the number of characters in) null-terminated character array s excluding the null character, which isn't exactly the size of s, but it's in the ballpark.

在标准C库中,许多函数接收参数,或者返回对象的字节大小。例如,malloc(n)函数中,唯一的实参n指定要分配对象的大小;memcpy(s1, s2, n)函数中,最后一个实参n指定要拷贝对象的大小。还有strlen(s)函数的返回值得到的是数组s中以NULL结尾的非空字符的个数(不包括NULL在内),当然这不是s的真正长度,但这是可以接受的。



You might reasonably expect these parameters and return types that represent sizes to be declared with type int (possibly long and/or unsigned), but they aren't. Rather, the C standard declares them as type size_t. According to the standard, the declaration for malloc should appear in <stdlib.h> as something equivalent to:

你可能想当然地认为,这些表示大小的形参和返回值的类型会被声明为int类型(也可能是long类型及他们对应的unsigned类型),但事实上并不是这样。相反,标准C声明他们为size_t类型。根据标准,malloc()函数应该在stdio.h中有类似的声明:

void *malloc(size_t n);



and the declarations for memcpy and strlen should appear in <string.h> looking much like:

同时,memcpy()函数和strlen()函数在string.h中也要有类似的声明:

void *memcpy(void *s1, void const *s2, size_t n);
size_t strlen(char const *s);



The type size_t also appears throughout the C++ standard library. In addition, the C++ library uses a related symbol size_type, possibly even more than it uses size_t.

size_t类型也遍及C++标准库。另外,C++标准库使用一个相关的符号size_type,可能更倾向于使用这个。



In my experience, most C and C++ programmers are aware that the standard libraries use size_t, but they really don't know what size_t represents or why the libraries use size_t as they do. Moreover, they don't know if and when they should use size_t themselves.

以我的经验,大多数C和C++程序员虽然知道标准库使用size_t,但他们真的不清楚size_t表示什么或者为什么标准库使用size_t。此外,他们也不知道何时使用、是否使用size_t。



In this column, I'll explain what size_t is, why it exists, and how you should use it in your code.

在本专栏中,我将解释size_t是什么,他为什么会存在,以及在你的代码中如何使用他。


可移植性问题


Classic C (the early dialect of C described by Brian Kernighan and Dennis Ritchie in The C Programming Language, Prentice-Hall, 1978) didn't provide size_t. The C standards committee introduced size_t to eliminate a portability problem, illustrated by the following example.

传统的C(早期在The C Programming Language, Prentice-Hall, 1978一书中, Brian Kernighan和Dennis Ritchie对C的描述)并没有提供size_t。后来C标准委员会提出size_t来解决可移植性问题,如以下这个例子。



Let's examine the problem of writing a portable declaration for the standard memcpy function. We'll look at a few different declarations and see how well they work when compiled for different architectures with different-sized address spaces and data paths.

让我们来为标准memcpy()函数写一个具有可移植性的声明,并检查他所存在的问题。我们一起来看几种不同的声明,看看他们在具有不同大小的地址空间和数据路径的体系结构下编译之后,是如何工作的。



Recall that calling memcpy(s1, s2, n) copies the first n bytes from the object pointed to by s2 to the object pointed to by s1, and returns s1. The function can copy objects of any type, so the pointer parameters and return type should be declared as "pointer to void." Moreover, memcpy doesn't modify the source object, so the second parameter should really be "pointer to const void." None of this poses a problem.

调用memcpy()函数,会把s2指向的对象的前n个字节拷贝到s1所指向的对象中,并返回s1。这个函数可以拷贝任意类型的对象,所以指针形参和返回类型应该声明为“指向void的指针”。同时,memcpy()不能修改源对象,所以第二个形参应该为“指向const void的指针”。这些都不会引起问题。



The real concern is how to declare the function's third parameter, which represents the size of the source object. I suspect many programmers would choose plain int, as in:

真正应该关心的是如何声明函数的第三个形参,也就是代表了源对象大小的的那个形参。我猜很多程序员会选择再简单不过的int类型,就像这样:

void *memcpy(void *s1, void const *s2, int n);



which works fine most of the time, but it's not as general as it could be. Plain int is signed--it can represent negative values. However, sizes are never negative. Using unsigned int instead of int as the type of the third parameter lets memcpy copy larger objects, at no additional cost.

大多数时候运行得不错,但情况并非如此。int类型是有符号的,他可以表示负值。然而,大小永远不会有负值。用unsigned int代替int作为第三个参数的类型,可以让memcpy()函数在没有额外开销的情况下,拷贝更大的对象。



On most machines, the largest unsigned int value is roughly twice the largest positive int value. For example, on a 16-bit twos-complement machine, the largest unsigned int value is 65,535 and the largest positive int value is 32,767. Using an unsigned int as memcpy's third parameter lets you copy objects roughly twice as big as when using int.

在大多数机器上,unsigned int的最大值大致是int的最大正数值的两倍。例如,在16位二进制补码的机器上,unsigned int的最大值是65535,int的最大正数是32767。使用unsigned int作为memcpy()函数的第三个参数可以让你拷贝比使用int多一倍的对象。



Although the size of an int varies among C implementations, on any given implementation int objects are always the same size as unsigned int objects. Thus, passing an unsigned int argument is always the same cost as passing an int.

虽然在C的实现当中,int的大小各不相同,但是,任何给出的实现当中,int对象和unsigned int对象的大小都是相同的。也就是说,传递一个unsigned int实参的开销和传递int的开销总是相同的。



Using unsigned int as the parameter type, as in:

使用unsigned int作为形参,形如:

void *memcpy(void *s1, void const *s2, unsigned int n);



works just dandy on any platform in which an sunsigned int can represent the size of the largest data object. This is generally the case on any platform in which integers and pointers have the same size, such as IP16, in which both integers and pointers occupy 16 bits, or IP32, in which both occupy 32 bits. (See the sidebar on C data model notation.)

可以在任何平台下完美运行,同时unsigned int代表了这些平台上最大数据对象的大小。通常很多平台下都是这样,整数和指针有相同的大小,例如IP16下,整数和指针都占16位;IP32下,整数和指针都占32位。(见边栏上的C数据模型表示法。)



C data model notation
C数据模型表示法

Of late, I've run across several articles that employ a compact notation for describing the C language data representation 
on different target platforms. I have yet to find the origins of this notation, a formal syntax, or even a name for it, but it 
appears to be simple enough to be usable without a formal definition. The general form of the notation appears to be:
I nI L nL LL nLL P nP
最近,我偶然发现几篇文章,他们使用简明的标记来表述不同目标平台下c语言数据的实现。我还没有找到这个标记的来源,一个正式的语法甚至连一个名字都没有,但他似乎很简单,即使没有正规的定义也可以很容易使用起来。这些标记的一边形式形如:I nI L nL LL nLL P nP。

where each capital letter (or pair thereof) represents a C data type, and each corresponding n is the number of bits that 
the type occupies. I stands for int, L stands for long, LL stands for long long, and P stands for pointer (to data, not pointer
 to function). Each letter and number is optional.
其中每个大写字母(或成对出现)代表一个C的数据类型,每一个对应的n是这个类型包含的位数。I代表int,L代表long,LL代表long long,以及P代表指针(指向数据,而不是函数)。每个字母和数字都是可选的。

For example, an I16P32 architecture supports 16-bit int and 32-bit pointers, without describing whether it supports long 
or long long. If two consecutive types have the same size, you typically omit the first number. For example, you typically 
write I16L32P32 as I16LP32, which is an architecture that supports 16-bit int, 32-bit long, and 32-bit pointers.
例如,I16P32架构支持16位int和32位指针类型,没有指明是否支持long或者long long。如果两个连续的类型具有相同的大小,通常省略第一个数字。例如,你通常将I16L32P32写为I16LP32,这是一个支持16位int,32位long,和32位指针的架构。

The notation typically arranges the letters so their corresponding numbers appear in ascending order. For example, 
IL32LL64P32 denotes an architecture with 32-bit int, 32-bit long, 64-bit long long, and 32-bit pointers; however, it 
appears more commonly as ILP32LL64.
标记通常把字母分类在一起,所以可以按照其对应的数字按照升序排列。例如,IL32LL64P32表示支持32位int,32位long,64位long long和32位指针的架构;然而,通常写作ILP32LL64。



Unfortunately, this declaration for memcpy comes up short on an I16LP32 processor (16-bits for int and 32-bits for long and pointers), such as the first generation Motorola 68000. In this case, the processor can copy objects larger than 65,536 bytes, but this memcpy can't because parameter n can't handle values that large.

不幸的是,这个memcpy()的声明在I16LP32处理器(16位int,32位long和指针)上会捉襟见肘,就像摩托罗拉68000第一代。在这种情况下,处理器可以拷贝的对象大于65536字节,但是因为形参n不能承载这么大的值,导致memcpy()不能完全做到处理器可以实现的操作。



Easy to fix, you say? Just change the type of memcpy's third parameter:

“很容易解决呀”,你会这么说吗?仅仅改变memcpy()的第三个形参的类型:

void *memcpy(void *s1, void const *s2, unsigned long n);



You can use this declaration to write a memcpy for an I16LP32 target, and it will be able to copy large objects. It will also work on IP16 and IP32 platforms, so it does provide a portable declaration for memcpy. Unfortunately, on an IP16 platform, the machine code you get from using unsigned long here is almost certainly a little less efficient (the code is both bigger and slower) than what you get from using an unsigned int.

你可以用这个声明来为I16LP32目标机器实现memcpy(),他也确实可以拷贝最大的对象。在IP16和IP32上也可以正常工作,所以他为memcpy()提供了一个可移植的声明。但不幸的是,在IP16平台上,使用unsigned long得到的机器码比使用unsigned int得到的机器码效率低,他的代码会更冗长,速度会更慢。



In Standard C, a long (whether signed or unsigned) must occupy at least 32 bits. Thus, an IP16 platform that supports Standard C really must be an IP16L32 platform. Such platforms typically implement each 32-bit long as a pair of 16-bit words. In that case, moving a 32-bit long usually requires two machine instructions, one to move each 16-bit chunk. In fact, almost all 32-bit operations on these platforms require at least two instructions, if not more.

在标准C当中,long(无论signed或unsigned)至少占用32位。所以,一个支持标准C的IP16平台一定是个IP16L32平台。这些平台通常通过一对16位的字来实现每一个32位的long。在这种情况下,移动一个32位的long通常需要两个机器指令,每一个用来移动16位的块。事实上,几乎所有的这些平台上的32位的操作需要至少两个指令,如果没有更多指令的话。



Thus, declaring memcpy's third parameter as an unsigned long in the name of portability exacts a performance toll on some platforms, something we'd like to avoid. Using size_t avoids that toll.

因此,以可移植性为名牺牲某些平台的性能,将memcpy()的第三个形参声明为unsigned long,这不是我们希望看到的。使用size_t可以避免这些性能浪费。



Type size_t is a stypedef that's an alias for some unsigned integer type, typically unsigned int or unsigned long, but possibly even unsigned long long. Each Standard C implementation is supposed to choose the unsigned integer that's big enough--but no bigger than needed--to represent the size of the largest possible object on the target platform.

size_t类型是通过typedef定义的一些无符号整型的别名,通常是unsigned int或unsigned long,甚至是unsigned long long。每种标准C的实现应该选择足够大的无符号整型,来代表目标平台可能的最大对象,但不能供过于求。


使用size_t


The definition for size_t appears in several Standard C headers, namely, <stddef.h>, <stdio.h>, <stdlib.h>, <string.h>, <time.h>, and <wchar.h>. It also appears in the corresponding C++ headers, <cstddef>, <cstdio>, and so on. You should include at least one of these headers in your code before referring to size_t.

在几个标准C头文件中,size_t均有定义,即 <stddef.h>,<stdio.h>,<stdlib.h>,<string.h>,<time.h>以及<wchar.h>。他也在对应的C++头文件中出现过,<cstddef>,<cstdio>等等。在你引用size_t之前,你至少应该包含这些头文件中的一个。



Including any of the C headers (in a program compiled as either C or C++) declares size_t as a global name. Including any of the C++ headers (something you can do only in C++) defines size_t as a member of namespace std.

包含以上任何C头文件(由C或C++编译的程序)表明将size_t作为全局关键字。包含以上任何C++头文件(当你只能在C++中做某种操作时)表明将size_t作为std命名空间的成员。



By definition, size_t is the result type of the sizeof operator. Thus, the appropriate way to declare n to make the assignment:

根据定义,size_t是sizeof操作符的结果的类型。所以,通过适当的方式声明n来分配:

n = sizeof(thing);



both portable and efficient is to declare n with type size_t. Similarly, the appropriate way to declare a function foo to make the call:

使用size_t可以兼具可移植性和高效性。相似的,通过适当的方式声明函数foo()来调用:

foo(sizeof(thing));



both portable and efficient is to declare foo's parameter with type size_t. Functions with parameters of type size_t often have local variables that count up to or down from that size and index into arrays, and size_t is often a good type for those variables.

使用size_t可以兼具可移植性和高效性。拥有size_t类型的形参的函数经常有向上向下计数或索引数组的局部变量,size_t类型通常是这些变量不错的选择。



Using size_t appropriately makes your source code a little more self-documenting. When you see an object declared as a size_t, you immediately know it represents a size in bytes or an index, rather than an error code or a general arithmetic value.

适当地使用size_t使你的代码变得自我化文档。当你看到一个对象声明为size_t,你马上就知道它代表一个字节大小或索引,而不是错误代码或一般的算术值。



Expect to see me using size_t in other examples in upcoming columns.

期待在我之后的文章当中使用size_t。

posted @ 2014-09-21 16:38  Noble_  阅读(617)  评论(0编辑  收藏  举报