Compressed Integer In .NET/CLI Metadata
Compressed Integer In .NET/CLI Metadata
URL: http://www.cnblogs.com/AndersLiu/archive/2010/02/09/en-compressed-integer-in-metadata.html
Author: Anders Liu
Abstract: Compressed Integer is widely used in .NET/CLI PE files; this algorithm can place a 32-bit integer into 1, 2, or 4 bytes base on its value. Compressed integer can save the size of a PE file effectively, especially when the integer value is small. This document introduces the compression algorithm for integer, and gives the reference implement of the algorithm.
Bibliographies
- ECMA-335: Common Language Infrastructure (CLI) 4th Edition, June 2006.
- Expert .NET 2.0 IL Assembler, Serge Lidin, Apress, 2006.
Introduction
In short, the compression algorithm is used to place a 32-bit integer (takes 4 bytes) into as little as possible number of storage (1, 2, or 4 bytes).
This compression algorithm is widely used in .NET/CLI PE files, such as metadata signatures, #blob stream and #US stream. In such cases, integers are used to save the number of records, or size of data blocks. Since such numbers and sizes are all very small, use 32-bit integers will cause many bytes set to 0, which makes no sense. In such cases, compressed integer can effectively reduce the disk space a PE file takes, and saves network bandwidth.
Some scenarios of using compressed integer within a PE file are listed below:
- In the beginning of each record in Blob heap (storage format of #Blob stream and #US stream), compressed unsigned integer is used to store the size of the record data.
- In the method metadata signature, compressed unsigned integer is used to store the number of parameters.
- In metadata signatures, lower bounds of each array are saved in compressed signed integer.
Note, all compression and decompression algorithm referred here are applied for 32-bit integer. Also, if not special mentioned, all integers are present as big-endian (most significant byte presents in left or on top).
Compression and Decompression for Unsigned Integer
Compression Algorithm for Unsigned Integer
Compression for unsigned integer is simple, split the range of unsigned integer into 3 ranges, and then place the unsigned integer value into 1, 2, or 4 bytes based on which range the value fall off. Table 1 lists all ranges and the format of compressed value.
Range | Bytes Used | Mask | Binary Format |
---|---|---|---|
[00000000h, 0000007Fh] | 1 | 80h | 0BBBBBBBB |
[00000080h, 00003FFFh] | 2 | C0h | 10BBBBBB BBBBBBBB |
[00004000h, 1FFFFFFFh] | 4 | E0h | 110BBBBB BBBBBBBB BBBBBBBB BBBBBBBB |
In Table 1,
- Range lists the min value (inclusive) and max value (inclusive) of the range.
- Bytes Used lists how many bytes the compressed value will take.
- Mask lists mask value applied on the first byte of the compressed value,
- If the compressed value takes 1 byte, perform & (bitwise and) with 80h, the result will be 0h;
- If the compressed value takes 2 bytes, perform & with C0h, the result will be 80h;
- If the compressed value takes 4 bytes, perform & with E0h, the result will be C0h.
- Binary Format lists the binary format of the compressed value, where 1 and 0 are fixed bit, while B means significant bit.
From Table 1, we know that unsigned integers between [0h, 1FFFFFFFh] are suitable for this algorithm, values large than 1FFFFFFFh are not supported.
Code 1 shows a reference implement of unsigned integer compressing.
Code 1 – Reference implement of unsigned integer compressing
public static byte[] CompressUInt(uint data) { if (data <= 0x7F) { var bytes = new byte[1]; bytes[0] = (byte)data; return bytes; } else if (data <= 0x3FFF) { var bytes = new byte[2]; bytes[0] = (byte)(((data & 0xFF00) >> 8) | 0x80); bytes[1] = (byte)(data & 0x00FF); return bytes; } else if (data <= 0x1FFFFFFF) { var bytes = new byte[4]; bytes[0] = (byte)(((data & 0xFF000000) >> 24) | 0xC0); bytes[1] = (byte)((data & 0x00FF0000) >> 16); bytes[2] = (byte)((data & 0x0000FF00) >> 8); bytes[3] = (byte)(data & 0x000000FF); return bytes; } else throw new NotSupportedException(); }
Decompression Algorithm for Unsigned Integer
Decompression algorithm for unsigned integer is the same simple as compression, see below:
- If the first byte is in form of 0bbbbbbb (perform bitwise and with 80h, the result is 0h), the compressed value is stored in 1 byte (byte value is b0), then the original integer value is b0.
- If the first byte is in form of 10bbbbbb (perform bitwise and with C0h, the result is 80h), the compressed value is stored in 2 bytes (bytes values are b0, b1 in order), then the original integer value is (b0 & 0x3F) << 8 | b1.
- If the first byte is in form of 110bbbbb (perform bitwise and with E0h, the result is C0h), the compressed value is stored in 4 bytes (bytes values are b0, b1, b2, b3 in order), then the original integer value is (b0 & 0x1F) << 24 | b1 << 16 | b2 << 8 | b3.
The Code 2 gives reference implement of unsigned integer decompressing.
Code 2 – Reference implement of unsigned integer decompressing
public static uint DecompressUInt(byte[] data) { if (data == null) throw new ArgumentNullException("data"); if ((data[0] & 0x80) == 0 && data.Length == 1) { return (uint)data[0]; } else if ((data[0] & 0xC0) == 0x80 && data.Length == 2) { return (uint)((data[0] & 0x3F) << 8 | data[1]); } else if ((data[0] & 0xE0) == 0xC0 && data.Length == 4) { return (uint)((data[0] & 0x1F) << 24 | data[1] << 16 | data[2] << 8 | data[3]); } else throw new NotSupportedException(); }
Compression and Decompression for Signed Integer
Compression Algorithm for Signed Integer
The compressing of signed integer is slightly more complex than the unsigned integer, because we have to deal with the sign bit. In short, after determine how many bytes the compressed value will take, we should left shift the whole integer by 1 bit, and place the sign bit on the least significant bit (0 for positive, 1 for negative), and then set mask value for the first byte as the compressed unsigned integer value.
When determining how many bytes should use to store the compressed signed integer value, we should get the 'semi-absolute value' of the original integer, that is, for the negative value, we should take its bitwise reversed value (not the opposite number in mathematics). And then, left shift the 'semi-absolute value' by 1 bit, and search from Table 1 for getting the number bytes should use.
Or, you can omit the left ship operation, but use the Table 2 to search the range of the 'semi-absolute value'.
Range | Bytes Used | Significant Bit Mask |
---|---|---|
[00000000h, 0000003Fh] | 1 | 0000003Fh |
[00000040h, 00001FFFh] | 2 | 00001FFFh |
[00002000h, 0FFFFFFFh] | 4 | 0FFFFFFFh |
In Table 2,
- Range lists the min 'semi-absolute value' (inclusive) and the max 'semi-absolute value' (inclusive) of each range.
- Bytes Used lists the number of bytes that the compressed value will take.
- Significant Bit Mask lists a series of mask, on which perform & with the original integer value, you can get all the significant bits. In fact, for a positive value, all left side bits are 0, and make no sense so that can be omitted; also, for a negative value, all left side bits are 1, make no sense so that can be omitted too.
After you got the significant bits through the bitwise and operation with the corresponding mask value, left shift all the significant bits. Next, if the original integer is negative, set the least significant bit (the sign bit) to 1.
Finally, apply mask value to the first byte of the compressed value, use the same rule as compressed unsigned integer.
The range of signed integers which are suitable for the compression algorithm contains, for positive integer, [0h, 0FFFFFFFh] ([0, 268435455]), while for negative integer, [F0000000h, FFFFFFFFh] ([-268435456, -1]). Integers fall out of these ranges are not suitable.
Code 3 gives the reference implement of signed integer compressing.
Code 3 – Reference implement of signed integer compressing
public static byte[] CompressInt(int data) { var u = data >= 0 ? (uint)data : ~(uint)data; if (u <= 0x3F) { var uv = ((uint)data & 0x0000003F) << 1; if (data < 0) uv |= 0x01; var bytes = new byte[1]; bytes[0] = (byte)uv; return bytes; } else if (u <= 0x1FFF) { var uv = ((uint)data & 0x00001FFF) << 1; if (data < 0) uv |= 0x01; var bytes = new byte[2]; bytes[0] = (byte)(((uv & 0xFF00) >> 8) | 0x80); bytes[1] = (byte)(uv & 0x00FF); return bytes; } else if (u <= 0x0FFFFFFF) { var uv = ((uint)data & 0x0FFFFFFF) << 1; if (data < 0) uv |= 0x01; var bytes = new byte[4]; bytes[0] = (byte)(((uv & 0xFF000000) >> 24) | 0xC0); bytes[1] = (byte)((uv & 0x00FF0000) >> 16); bytes[2] = (byte)((uv & 0x0000FF00) >> 8); bytes[3] = (byte)(uv & 0x000000FF); return bytes; } else throw new NotSupportedException(); }
Note, the 'semi-absolute value' is used only when determining the number bytes the compressed value takes, once the number is calculated, use the original integer value for compressing, treat it as unsigned.
Decompression Algorithm for Signed Integer
Since the compressed signed integer and the compressed unsigned integer use the same binary format, the decompression of signed integer can be based on the decompression of unsigned integer.
First, decompress the compressed value as unsigned, and got a 32-bit unsigned integer. Then, get the sign of the original integer according to the least significant bit (sign bit).
If the original integer is positive (the least significant bit, i.e. the sign bit is 0), right shift the decompressed value by 1 bit, and convert to signed integer, then you get the original signed integer.
If the original integer is negative (the least significant bit, i.e. the sign bit is 1), right shift the decompressed value by 1 bit, and bring back the non-sense 1 bits in the left side of the integer:
- If the compressed value takes 1 byte, perform | (bitwise or) operation with FFFFFFC0h;
- If the compressed value takes 2 bytes, perform | operation with FFFFE000h;
- If the compressed value takes 4 bytes, perform | operation with F0000000h.
Finally, convert the result to signed integer; you will get the original negative signed integer.
Code 4 give the reference implement of signed integer decompressing.
Code 4 – Reference implement of signed integer decompressing
public static int DecompressInt(byte[] data) { var u = DecompressUInt(data); if ((u & 0x00000001) == 0) return (int)(u >> 1); var nb = GetCompressedIntSize(data[0]); uint sm; switch (nb) { case 1: sm = 0xFFFFFFC0; break; case 2: sm = 0xFFFFE000; break; case 4: sm = 0xF0000000; break; default: throw new NotSupportedException(); } return (int)((u >> 1) | sm); }
Here a utility method GetCompressedIntSize is called, which is used to determine how many bytes the compressed value takes, through the first byte of the compressed value. This method is really simple, see Code 5.
Code 5 – Get bytes number of the compressed value through the first byte
public static uint GetCompressedIntSize(byte firstByte) { if ((firstByte & 0x80) == 0) return 1; else if ((firstByte & 0xC0) == 0x80) return 2; else if ((firstByte & 0xE0) == 0xC0) return 4; else throw new NotSupportedException(); }
Implement Issues
The compressed signed integer is used less in .NET/CLI metadata, as I know, only in array lower bound value with in metadata signatures (which means, the negative array lower bound is supported by the .NET/CLI naturally). In such a scenario, almost all CLI implements have problems when dealing with compressed signed integer, more or less. And in all bibliographies, the description of compression for signed integer is not clear enough. Fortunately, most high level programming language don't support array with negative lower bound, and in CLS, all lower bounds of an array should be 0, so these problems don't have serious implications for actual projects.
In the following sections, I'll list problems occurred in some CLI implements that I've researched, followed by the issues appear in bibliographies.
ILASM/ILDASM
Obviously, Microsoft doesn't clarify the compression algorithm for signed integer itself. ILASM is the only compiler can accept negative array lower bound that I've used, it is also the most used compiler when I researching on this question. For the positive lower bound within in array, no problem in ILASM; while for the negative lower bound, you'll get an incorrect compressed value when the lower bound value is between -8192 (inclusive) and -8129 (inclusive).
In addition, ILASM uses different decompression algorithm for signed integer other than the one described in this article, which cannot cover all theoretically supported integers ([-268435456, 268435455]), when the lower bound is less than or equal to -268427265, you'll also get an incorrect value.
We can't test the ILDASM precisely, because of the problem occurred in the ILASM. However, though try to decompress the incorrect value generated by the ILASM, the ILDASM and the reference implement referred in this article both get the same value, so I prefer to consider that the decompression algorithm used in ILDASM is correct. But the incorrect value will make ILDASM crashed randomly.
The problems introduced above appear in version 2.0, 3.0, and 3.5 of ILASM, in version 4.0 beta, all the problems are resolved. The ILASM shipped with .NET Framework SDK 4.0 Beta can accept all suitable signed value as the lower bound of an array, and generate correct compressed value; and the ILDASM can also decompress the compressed value correctly.
Mono Cecil
After read the source code of Mono Cecil, I find that Mono Cecil is loyal to the ECMA-335 standard, but ECMA-335 makes mistake on the description of array lower bound (see Revision of Bibliographies section later), where the array lower bound is treat as unsigned (not signed) integer.
So, Mono Cecil provides only compression and decompression for unsigned integer (see 'Mono.Cecil.Metadata.Utilities.WriteCompressedInteger(BinaryWriter, Int32) : Int32' method and 'Mono.Cecil.Metadata.Utilities.ReadCompressedInteger(Byte[], Int32, Int32&) : Int32' method in Mono.Cecil.dll). When writing and reading array lower bounds, it also treat the lower bounds as unsigned integers (see 'Mono.Cecil.Signatures.SignatureWriter.Write(SigType) : Void' method and 'Mono.Cecil.Signatures.SignatureReader.ReadType(Byte[], Int32, Int32&) : SigType' method in the same library).
When you reflecting an assembly by using Mono Cecil, if the array lower bound is positive, you will get a lower bound twice as the real value (because the right shift operation is missed); or if the array lower bound is negative, the result is totally wrong.
I only researched version 0.6 of Mono Cecil, no sure in other versions, you can research them yourself.
CCI Metadata
CCI Metadata treats the array lower bound as signed integer indeed, but uses an oversimplification algorithm: left shift the absolute value of the original integer, then place the sign bit in the least significant bit (see 'Microsoft.Cci.BinaryWriter.WriteCompressedInt(Int32) : Void' method in Microsoft.Cci.PeWriter.dll), and compress the value as an unsigned integer. The decompression algorithm is opposite, decompress the compressed value as unsigned integer, determine the sign according to the least significant bit, right shift the decompressed unsigned value by 1 bit, then convert it to signed integer and set the sign according the sign bit (see 'Microsoft.Cci.UtilityDataStructures.MemoryReader.ReadCompressedInt32() : Int32' method in Microsoft.Cci.PeReader.dll).
CCI Metadata uses the same algorithm with Expert .NET 2.0 IL Assembler, which has problem also (see Revision of Bibliographies section later).
Version 2.0.49.23471 of CCI Metadata has been researched.
Implements Not Researched
Some other implements are not covered in this article, such as:
- System.Reflection/System.Reflection.Emit
- Shared Source CLI (Rotor)
Revision of Bibliographies
Expert .NET 2.0 IL Assembler
This book describes the compression algorithm in Chapter 8, in the paragraph after Table 8-4 (first paragraph in P150). The description is incorrect, for the correct description, see Compression Algorithm for Signed Integer section in this article.
ECMA-335——Common Language Infrastructure (CLI) 4th Edition
The ECMA-335 standard doesn't discriminate the terms compressed unsigned integer and compressed signed integer, they are collectively called compressed integer.
23.2 Blobs and signatures section in ECMA-335 Partition II: Metadata Definition and Semantics defines compression algorithm for compressed integer (P153), which is in fact compressed unsigned integer and is correct when applied on unsigned integer.
23.2.13 ArrayShape section in ECMA-335 Partition II: Metadata Definition and Semantics defines the array shape used in metadata signatures (P161), where the Size element and LoBound element are all called compressed integer, which is incorrect.
The revision is that, involve term compressed unsigned integer to describe the original compressed integer other than LoBound in ArrayShape; and involve term compressed signed integer for the LoBound in ArrayShape. And provide description for signed integer compression algorithm according to the description in the Compression Algorithm for Signed Integer section.
(End)