GPU编程和流式多处理器（三）

3. Floating-Point Support

快速的本机浮点硬件是GPU的存在理由，并且在许多方面，它们在浮点实现方面都等于或优于CPU。全速支持异常可以根据每条指令指定直接舍入，特殊功能单元可为六种流行的单精度先验函数，提供高性能的近似函数。相比之下，x86 CPU在微代码中实现异常，其运行速度可能比在规范化浮点算子上运行的速度慢100倍。舍入方向是由一个控制字指定的，该控制字需要数十个时钟周期来更改，并且SSE指令集中唯一的超越逼近函数是用于倒数和倒数平方根的，给出必须用牛顿-微分精确的12位近似值。使用Raphson迭代之前。

由于GPU的核心数量较多，但其较低的时钟频率会有所抵消，开发人员可以期望在公平竞争的环境下最多提高10倍（或大约10倍）的速度。如果一篇论文报告了从将优化的CPU实现移植到CUDA所实现的100倍或更高的速度，则可能是上述“指令集不匹配”之一。

3.1. Formats

图2描述了CUDA支持的三（3）个IEEE标准浮点格式：双精度（64位），单精度（32位）和半精度（16位）。值分为三个字段：正负号，指数和尾数。对于double，single和half，指数字段的大小分别为11位，8位和5位。相应的尾数字段为52、23和10位。

Figure 2 Floating-point formats.

指数字段更改浮点值的解释。最常见的（“正常”）表示形式将一个隐式1位编码为尾数，然后将该值乘以2个^e-bias，其中bias是在编码为浮点表示形式之前添加到实际指数的值。例如，单精度的偏差为127。

表7总结了如何对浮点值进行编码。对于大多数指数值（所谓的“正常”浮点值），尾数假定为隐式1，并将其乘以指数的偏置值。最大指数值保留用于无穷大和非数字值。除以零（或除以除法）会产生无穷大；执行无效运算（例如取平方根或负数的对数）会产生NaN。最小指数值保留给太小而无法用隐式前导1表示的值。随着所谓的正规数接近零，它们失去有效精度的位，这种现象称为逐渐下溢。表8给出了三种格式的某些极值的编码和值。

Table 7 Floating-Point Representations

Table 8 Floating-Point Extreme Values

Rounding舍入取整法

IEEE标准提供了四（4）个回合模式。

四舍五入到最近（也称为“四舍五入”）
向零舍入（也称为“截断”或“斩”）
向下舍入（或“朝负无穷大舍入”）
向上舍入（或“朝正无穷大舍入”）

舍入取整法是迄今为止最常用的舍入模式，在每次操作后将中间值舍入为最接近的可表示浮点值。向上舍入和向下舍入（“定向舍入模式”）用于间隔算术，其中一对浮点值用于括住计算的中间结果。为了正确地将结果括起来，该间隔的下限值和上限值必须分别四舍五入为负无穷大（“向下”）和正无穷大（“ up”）。

C语言不提供任何方式来按指令指定舍入模式，并且CUDA硬件不提供用于隐式指定舍入模式的控制字。因此，CUDA提供了一组内部函数来指定操作的舍入模式，如表9所示。

Table 9 Intrinsics for Rounding

Conversion

通常，开发人员可以使用标准C结构，在不同的浮点表示形式和/或整数之间进行转换：隐式转换或显式类型转换。但是，如有必要，开发人员可以使用表10中列出的内在函数，执行C语言规范中不包含的转换，例如，使用定向舍入的转换。

由于一半未在C编程语言中标准化，因此CUDA在__half2float（）和__float2half（）的接口中使用unsigned short。__float2half（）仅支持舍入到最近舍入模式。

float __half2float( unsigned short );

unsigned short __float2half( float );

3.2. Single Precision (32-Bit)

单精度浮点支持是GPU计算的主力军。GPU已经过优化，可以在此数据类型上原生提供高性能，¹¹不仅适用于核心标准IEEE操作（例如加法和乘法），还适用于非标准操作（例如对先验的近似（例如sin（）和log（）））。32位值与整数保存在同一寄存器文件中，因此单精度浮点值和32位整数（使用__float_as_int（）和__int_as_float（））之间的强制转换是免费的。

加法，乘法和乘加

编译器自动将浮点值的+，–和*运算符转换为加，乘和乘加指令。__fadd_rn（）和__fmul_rn（）内部函数，可以被用于加入抑制融合和乘法操作进入乘加指令。

对等与除法

对于具有2.x或更高版本的计算能力的设备，使用--prec-div = true编译代码时，除法运算符符合IEEE规定。对于具有计算能力1.x的设备或对于具有计算能力2.x的设备，当使用--prec-div = false编译代码时，除法运算符和__fdividef（x，y）具有相同的精度，

SM中的特殊功能单元（SFU）实现了六个常见超越功能的快速版本。

正弦和余弦
对数和指数
倒数和倒数平方根

表11（摘自有关Tesla架构的论文）摘录了受支持的操作和相应的精度。SFU并不能完全实现精度，可以很好地近似于这些功能，并且速度很快。对于比优化的CPU等效端口（例如25倍或更多）快得多的CUDA端口，代码很可能依赖SFU。

Table 11 SFU Accuracy

使用表12中给出的内在函数访问SFU。指定--fast-math编译器选项，将使编译器将常规C运行时runtime，调用替换为上面列出的相应SFU内部函数。

Table 12 SFU Intrinsics

Miscellaneous

__saturate(x) returns 0 if x<0, 1 if x>1, and x otherwise.

3.3. Double Precision (64-Bit)

带有SM 1.3的CUDA中添加了双精度浮点支持（最早在GeForce GTX 280中实现），而SM 2.0则提供了大大改进的双精度支持（功能和性能）。CUDA对双精度的硬件支持具有全速异常功能，并且从SM 2.x开始，符合IEEE 754 c的本机融合乘法加法指令（FMAD）。2008年，仅执行一个舍入步骤。FMAD除了是本质上有用的操作之外，还可以使某些功能（与Newton-Raphson迭代收敛）完全准确。

与单精度运算一样，编译器会自动将标准C运算符转换为乘法，加法和乘加指令。__dadd_rn（）和__dmul_rn（）内部函数可以被用于加入抑制融合和乘法操作进入乘加指令。

3.4. Half Precision (16-Bit)

对于5位指数和10位有效数，半值具有足够的精度以用于HDR（高动态范围）图像，并可用于保存其它不需要浮点精度的值，例如角度。半精度值仅用于存储，而不用于计算，硬件仅提供指令以转换为32位或从32位转换。这些指令以__halftofloat（）和__floattohalf（）内部函数公开。

float __halftofloat( unsigned short );

unsigned short __floattohalf( float );

这些内部函数使用unsigned short，因为C语言尚未标准化半浮点类型。

3.5. Case Study: float→half Conversion

研究float → half转换操作，了解浮点编码和舍入细节的有用方法。这是一个简单的一元运算，可以专注于编码和舍入，而不会被浮点运算的细节和中间表示的精度所分散。

当从float转换为half时，对于任何太大而无法表示的float的正确输出是half infinity。任何太小而不能代表一半（甚至是反常的一半）的浮点数都必须固定为0.0。舍入为0.0的一半的最大浮点数为0x32FFFFFF或2.98，而舍入为无穷大的一半的最小浮点数为65520.0。此范围内的float值可以通过传播符号位转换为一半，重新偏置指数（因为float的8位指数偏差为127，一半的5位指数偏差为15），然后将float的尾数四舍五入到最接近的一半尾数。在所有情况下，舍入都是简单的，除非输入值恰好落在两个可能的输出值之间。在这种情况下，IEEE标准指定四舍五入到“最近的偶数”值。在十进制算术中，这意味着四舍五入为1.5到2.0，但也四舍五入为2.5到2.0，以及（例如）四舍五入到0.5到0.0。

清单3显示了一个C例程，该例程完全复制了CUDA硬件实现的浮点到一半转换操作。变量exp和mag包含输入指数和“幅值”，尾数和指数以及被屏蔽的符号位。可以对幅度执行许多操作，例如比较和舍入操作，而无需分离指数和尾数。

清单3中使用的宏LG_MAKE_MASK创建具有给定位数的掩码：#define LG_MAKE_MASK（bits）（（1 << bits）-1）。联合用于治疗相同的32位值作为浮子和无符号整型; 诸如*（（float *）（＆u））之类的习惯用法不是可移植的。该例程首先传播输入符号位并将其屏蔽掉。

提取幅度和指数后，该函数处理输入浮点为INF或NaN的特殊情况，并尽早退出。注意，INF是带符号的，但NaN具有规范的无符号值。将输入浮点值钳位到与可表示的半值相对应的最小值或最大值，然后重新计算钳位值的大小。不要被构造f32MinRInfin和f32MaxRf16_zero的复杂代码所迷惑；它们是分别具有值0x477ff000和0x32ffffff的常量。

例程的其余部分处理输出正常和异常的情况（输入异常在前面的代码中被钳位，因此mag对应于正常float）。与钳位代码一样，f32Minf16Normal是一个常量，其值为0x38ffffff。

要构建法线，必须计算新的指数（第92和93行），正确舍入的10位尾数将移入输出。要构造非正规数，必须将隐式1与输出尾数进行“或”运算，然后将所得尾数偏移与输入指数对应的量。对于正向和非正向，输出尾数的舍入分两步完成。舍入是通过添加一个1的掩码来实现的，该掩码的末尾刚好等于输出的LSB，如图3所示。

Figure 3 Rounding mask (half).

如果输入的位12置1，则此操作将增加输出尾数；否则，将增加输出尾数。如果输入尾数全为1，则溢出会导致输出指数正确递增。如果向此调整的最高有效位再加1，将获得小学风格的四舍五入，其中平局决胜数更大。反之，即使设置了舍入到最接近值，如果设置了10位输出的LSB，则有条件地增加输出尾数（图8.4）。注意，这些步骤可以按任何顺序执行，也可以以许多不同的方式重新制定。

Figure 4 Round-to-nearest-even (half).

Listing 3. ConvertToHalf().

* exponent shift and mantissa bit count are the same.

* When we are shifting, we use [f16|f32]ExpShift

* When referencing the number of bits in the mantissa,

* we use [f16|f32]MantissaBits

const int f16ExpShift = 10;

const int f16MantissaBits = 10;

const int f16ExpBias = 15;

const int f16MinExp = -14;

const int f16MaxExp = 15;

const int f16SignMask = 0x8000;

const int f32ExpShift = 23;

const int f32MantissaBits = 23;

const int f32ExpBias = 127;

const int f32SignMask = 0x80000000;

unsigned short

ConvertFloatToHalf( float f )

{

* Use a volatile union to portably coerce

* 32-bit float into 32-bit integer

volatile union {

float f;

unsigned int u;

} uf;

uf.f = f;

// return value: start by propagating the sign bit.

unsigned short w = (uf.u >> 16) & f16SignMask;

// Extract input magnitude and exponent

unsigned int mag = uf.u & ~f32SignMask;

int exp = (int) (mag >> f32ExpShift) - f32ExpBias;

// Handle float32 Inf or NaN

if ( exp == f32ExpBias+1 ) { // INF or NaN

if ( mag & LG_MAKE_MASK(f32MantissaBits) )

return 0x7fff; // NaN

// INF - propagate sign

return w|0x7c00;

}

* clamp float32 values that are not representable by float16

{

// min float32 magnitude that rounds to float16 infinity

unsigned int f32MinRInfin = (f16MaxExp+f32ExpBias) <<

f32ExpShift;

f32MinRInfin |= LG_MAKE_MASK( f16MantissaBits+1 ) <<

(f32MantissaBits-f16MantissaBits-1);

if (mag > f32MinRInfin)

mag = f32MinRInfin;

}

{

// max float32 magnitude that rounds to float16 0.0

unsigned int f32MaxRf16_zero = f16MinExp+f32ExpBias

(f32MantissaBits-f16MantissaBits-1);

f32MaxRf16_zero <<= f32ExpShift;

f32MaxRf16_zero |= LG_MAKE_MASK( f32MantissaBits );

if (mag < f32MaxRf16_zero)

mag = f32MaxRf16_zero;

}

* compute exp again, in case mag was clamped above

exp = (mag >> f32ExpShift) - f32ExpBias;

// min float32 magnitude that converts to float16 normal

unsigned int f32Minf16Normal = ((f16MinExp+f32ExpBias)<<

f32ExpShift);

f32Minf16Normal |= LG_MAKE_MASK( f32MantissaBits );

if ( mag >= f32Minf16Normal ) {

// Case 1: float16 normal

// Modify exponent to be biased for float16, not float32

mag += (unsigned int) ((f16ExpBias-f32ExpBias)<<

f32ExpShift);

int RelativeShift = f32ExpShift-f16ExpShift;

// add rounding bias

mag += LG_MAKE_MASK(RelativeShift-1);

// round-to-nearest even

mag += (mag >> RelativeShift) & 1;

w |= mag >> RelativeShift;

}

else {

* Case 2: float16 denormal

// mask off exponent bits - now fraction only

mag &= LG_MAKE_MASK(f32MantissaBits);

// make implicit 1 explicit

mag |= (1<<f32ExpShift);

int RelativeShift = f32ExpShift-f16ExpShift+f16MinExp-exp;

// add rounding bias

mag += LG_MAKE_MASK(RelativeShift-1);

// round-to-nearest even

mag += (mag >> RelativeShift) & 1;

w |= mag >> RelativeShift;

}

return w;

}

实际上，开发人员应使用__floattohalf（）内在函数将float转换为一半，编译器会将其转换为单个F2F机器指令。提供此示例例程的目的纯粹，为了帮助理解浮点布局和舍入。同样，检查所有INF / NAN和非标准值的特殊情况代码，有助于说明IEEE规范的这些功能，自诞生以来就一直引起争议：使硬件变慢，成本更高，或者由于增加的硅片面积和工程设计而使硬件变慢验证工作。

清单3中的ConvertFloatToHalf（）例程被合并到名为float_to_float16.cu的程序中，该程序针对每个32位浮点值测试其输出。

3.6. Math Library

CUDA包括一个以C运行时库为模型的内置数学库，但有一些小区别：CUDA硬件不包括舍入模式寄存器（取而代之的是，舍入模式是按指令编码的），作为引用当前舍入模式的rint（），始终舍入为最接近值。此外，硬件不会引发浮点异常。异常运算的结果（例如，取负数的平方根）将编码为NaN。

表13列出了数学库函数以及每个函数的最大误差（单位为ulps）。在float上操作的大多数函数的函数名称后均带有“ f”，例如，计算正弦函数的函数如下。

double sin( double angle );

float sinf( float angle );

Table 13 Math Library

			ULP ERROR
FUNCTION	OPERATION	EXPRESSION	32	64
x+y	Addition	x+y	01	0
x/y	Multiplication	x*y	01	0
x/y	Division	x/y	22	0
1/x	Reciprocal	1/x	12	0
acos[f](x)	Inverse cosine	cos–1 x	3	2
acosh[f](x)	Inverse hyperbolic cosine		4	2
asin[f](x)	Inverse sine	sin–1x	4	2
asinh[f](x)	Inverse hyperbolic sine		3	2
atan[f](x)	Inverse tangent	tan–1x	2	2
atan2[f](y,x)	Inverse tangent of y/x		3	2
atanh[f](x)	Inverse hyperbolic tangent	tanh–1	3	2
cbrt[f](x)	Cube root		1	1
ceil[f](x)	“Ceiling,” nearest integer greater than or equal to x	⌊x⌋	0
copysign[f](x,y)	Sign of y, magnitude of x		n/a
cos[f](x)	Cosine	cos x	2	1
cosh[f](x)	Hyperbolic cosine		2
cospi[f](x)	Cosine, scaled byΠ	cosΠx	2
erf[f](x)	Error function		3	2
erfc[f](x)	Complementary error function		6	4
erfcinv[f](y)	Inverse complementary error function	Return x for which y=1-erff(x)	7	8
erfcx[f](x)	Scaled error function	ex2[erff(x))	6	3
erfinv[f](y)	Inverse error function	Return x for which y=erff(x)	3	5
exp[f](x)	Natural exponent	ex	2	1
exp10[f](x)	Exponent (base 10)	10x	2	1
exp2[f](x)	Exponent (base 2)	2x	2	1
expm1[f](x)	Natural exponent, minus one	ex–1	1	1
fabs[f](x)	Absolute value	\|x\|	0	0
fdim[f](x,y)	Positive difference	x–1,x>y+0,x≤NAN,X or y NAN	0	0
floor[f](x)	“Floor,” nearest integer less than or equal to x	⌊x⌋	0	0
fma[f](x,y,z)	Multiply-add	xy + z	0	0
fmax[f](x,y)	Maximum	x, x y or isNaN(y)y, otherwise	0	0
fmin[f](x,y)	Minimum	x, x y or isNaN(y)y, otherwise	0	0
fmod[f](x,y)	Floating-point remainder		0	0
frexp[f](x,exp)	Fractional component		0	0
hypot[f](x,y)	Length of hypotenuse		3	2
ilogb[f](x)	Get exponent		0	0
isfinite(x)	Nonzero if x is not ±INF			n/a
isinf(x)	Nonzero if x is ±INF			n/a
isnan(x)	Nonzero if x is a NaN			n/a
j0[f](x)	Bessel function of the first kind (n=0)	J0(x)	93	73
j1[f](x)	Bessel function of the first kind (n=1)	J1(x)	93	73
jn[f](n,x)	Bessel function of the first kind	Jn(x)		*
ldexp[f](x,exp)	Scale by power of 2	x2exp	0	0
lgamma[f](x)	Logarithm of gamma function	ln(Γ(x))	64	44
llrint[f](x)	Round to long long		0	0
llround[f](x)	Round to long long		0	0
lrint[f](x)	Round to long		0	*
lround[f](x)	Round to long		0	0
log[f](x)	Natural logarithm	ln(x)	1	1
log10[f](x)	Logarithm (base 10)	log10 x	3	1
log1p[f](x)	Natural logarithm of x+1	ln(x + 1)	2	1
log2[f](x)	Logarithm (base 2)	log2 x	3	1
logb[f](x)	Get exponent		0	0
modff(x,iptr)	Split fractional and integer parts		0	0
nan[f](cptr)	Returns NaN	NaN		n/a
nearbyint[f](x)	Round to integer		0	0
nextafter[f](x,y)	Returns the FP value closest to x in the direction of y			n/a
normcdf[f](x)	Normal cumulative distribution		6	5
normcdinv[f](x)	Inverse normal cumulative distribution		5	8
pow[f](x,y)	Power function	xy	8	2
rcbrt[f](x)	Inverse cube root		2	1
remainder[f](x,y)	Remainder		0	0
remquo[f](x,y,iptr)	Remainder (also returns quotient)		0	0
rsqrt[f](x)	rsqrt[f](x) Reciprocal		2	1
rint[f](x)	Round to nearest int		0	0
round[f](x)	Round to nearest int		0	0
scalbln[f](x,n)	Scale x by 2n (n is long int)	x2n	0	0
scalbn[f](x,n)	Scale x by 2n (n is int)	x2n	0	0
signbit(x)	Nonzero if x is negative		n/a	0
sin[f](x)	Sine	sin x	2	1
sincos[f](x,s,c)	Sine and cosine	s=sin(x);c=cos(x);	2	1
sincospi[f](x,s,c)	Sine and cosine	s=sin(Πx);c=cos(Πx);	2	1
sinh[f](x)	Hyperbolic sine		3	1
Sine, scaled by Π	sin Πx		2	1
sqrt[f](x)	Square root		36	0
tan[f](x)	Tangent	tan x	4	2
tanh[f](x)	Hyperbolic tangent		2	1
tgamma[f](x)	True gamma function	γ(x)	11	8
trunc[f](x)	Truncate (round to integer toward zero)		0	0
y0[f](x)	Bessel function of the second kind (n=0)	Y0(x)	93	73
y1[f](x)	Bessel function of the second kind (n=1)	Y1(x)	93	73
yn[f](n,x)	Bessel function of the second kind	Yn(x)		..