[ICPC Japan 2018] Floating-Point Numbers 位运算+细节处理

 

 Floating-Point Numbers 

In this problem, we consider floating-point number formats, data representation formats to approximate real numbers on computers.
Scientific notation is a method to express a number, frequently used for numbers too large or too small to be written tersely in usual decimal form. In scientific notation, all numbers are written in the form m × 10e. Here, m (called significand) is a number greater than or equal to 1 and less than 10, and e (called exponent) is an integer. For example, a number 13.5 is equal to 1.35×101, so we can express it in scientific notation with significand 1.35 and exponent 1.
As binary number representation is convenient on computers, let's consider binary scientific notation with base two, instead of ten. In binary scientific notation, all numbers are written in the form m × 2e. Since the base is two, m is limited to be less than 2. For example, 13.5 is equal to 1.6875×23, so we can express it in binary scientific notation with significand 1.6875 and exponent 3. The significand 1.6875 is equal to 1 + 1/2 + 1/8 + 1/16, which is 1.10112 in binary notation. Similarly, the exponent 3 can be expressed as 112 in binary notation.
A floating-point number expresses a number in binary scientific notation in finite number of bits. Although the accuracy of the significand and the range of the exponent are limited by the number of bits, we can express numbers in a wide range with reasonably high accuracy.
In this problem, we consider a 64-bit floating-point number format, simplified from one actually used widely, in which only those numbers greater than or equal to 1 can be expressed. Here, the first 12 bits are used for the exponent and the remaining 52 bits for the significand. Let's denote the 64 bits of a floating-point number by b64...b1. With e an unsigned binary integer (b64...b53)2, and with m a binary fraction represented by the remaining 52 bits plus one (1.b52...b1)2, the floating-point number represents the number m × 2e.
We show below the bit string of the representation of 13.5 in the format described above.
In floating-point addition operations, the results have to be approximated by numbers representable in floating-point format. Here, we assume that the approximation is by truncation. When the sum of two floating-point numbers a and b is expressed in binary scientific notation as a + b = m × 2e (1 ≤ m < 2, 0 ≤ e < 212), the result of addition operation on them will be a floating-point number with its first 12 bits representing e as an unsigned integer and the remaining 52 bits representing the first 52 bits of the binary fraction of m.
A disadvantage of this approximation method is that the approximation error accumulates easily. To verify this, let's make an experiment of adding a floating-point number many times, as in the pseudocode shown below. Here, s and a are floating-point numbers, and the results of individual addition are approximated as described above.
s := a
for n times {
    s := s + a
}
For a given floating-point number a and a number of repetitions n, compute the bits of the floating-point number s when the above pseudocode finishes.

 

输入

The input consists of at most 1000 datasets, each in the following format.

b52...b1 
n is the number of repetitions. (1 ≤ n ≤ 1018) For each i, bi is either 0 or 1. As for the floating-point number a in the pseudocode, the exponent is 0 and the significand is b52...b1.

The end of the input is indicated by a line containing a zero.

 

输出

For each dataset, the 64 bits of the floating-point number s after finishing the pseudocode should be output as a sequence of 64 digits, each being 0 or 1 in one line.
 

总的来说就是位运算处理,本人在这方面很薄弱所以决定把大佬的代码好好膜拜理解一下。

思路大致来说就是先用long long把数据存储进来 然后模拟double加法

但是在模拟的过程中 会出现一个问题: 精度不够 也就是double失真的原因 那么当精度不够时如何处理呢?

打个比方 若某一道计算题要求保留3位小数 精确计算的结果是0.1234 但是输出时会输出0.123

同理 当精度不够时 我们会选择舍弃最后一位

换言之 当计算时发现精度不够 我们完全可以选择不计算最后一位 这就是失真时处理的核心思想

剩下的解释都在代码里 (爆掉了的意思就是超出精度范围)

 1 #include <bits/stdc++.h>
 2 using namespace std;
 3 typedef long long ll;
 4 string s;
 5 int main()
 6 {
 7     ll n;
 8     while (~scanf("%lld", &n) && n)
 9     {
10         cin>>s;
11         s="1"+s;
12         ll a=0;                               // 10进制的s
13         for(int i=52; i>=0; i--)              // 转十进制
14             if (s[i]=='1') a+=(1ll<<(52-i));
15         ll e=0,m=a;                           // mx2^e
16         while(n)
17         {
18             ll t=((1ll<<53) - m) / a;         // 多少次会爆掉
19             if(((1ll<<53) - m) % a) t++;      // 整除时Time不用++ 小于时+1(默认floor)
20             if(n<t)                           // 没爆掉 直接计算
21             {
22                 m+=a*n;
23                 break;
24             }
25             else                               // 爆掉了 所以要把爆掉的舍去
26             {
27                 m+=a*t;
28                 e++;                           // 舍一次=爆一次=进一次位
29                 m>>=1,a>>=1;                   // 右移舍去
30                 n-=t;                          // 减去爆掉的部分
31                 if(!a) break;                  // 爆到不能爆 后面全为0
32             }
33         }
34          for(int i=11; i>=0; i--){             //  转二进制
35             if(e & (1ll << i)) printf("1");    //  e左移i位时是否为0
36             else printf("0");
37          }
38         for(int i=51; i>=0; i--){
39             if(m & (1ll << i)) printf("1");
40             else printf("0");
41         }
42         puts("");
43     }
44     return 0;
45 }
Floating-Point Numbers

 

 

posted @ 2018-08-23 21:21  阿枔  阅读(147)  评论(0编辑  收藏  举报