代码改变世界

.NET Assembly Structure

2010-09-06 15:48  Johnny Qian  阅读(697)  评论(0编辑  收藏  举报

Every once in a while, I've been asked by a non .NET developer (VB6, C++ etc) to explain "how .NET works, how GC works, why boxing is bad etc". I'm usually trying to find a link and save some of my time but for some of the subjects i am not able to find the appropriate ones (either they are too wide or too short and partial in presenting answers). Therefore, to save me some time in a future repeating the same whiteboard session I decided to make a couple of blog post  explaining .NET foundations. That + I got bored from all this architecture posts :)

Today's post would try to give answer on first half of the "how .NET works" question by trying to explain NET assembly structure only. Next post would then cover the second part of that answer explaining the NET assembly execution model.

General level explanation

.NET framework gives to developers freedom of choosing the language they would like to use in doing .NET programming (C#, VB, C++/CLI ...). It even enables using multiple languages in the same project where the code developed in different languages cooperate seamlessly. That is possible due to the fact that .NET framework operates only with the intermediate language (IL) code. IL code is created during the compile time by language compilers which translate high level code (c#, vb..) concepts to a combination of language agnostic IL code and its metadata. That IL code together with that metadata and headers makes a unit called managed module.

One or more managed modules and zero or more resource files are linked by language compiler or assembly linker to a managed assembly, which we see as .net DLL file. Every assembly contains also embedded manifest file which describes structure of the assembly types member definition, structure  of external assembly member references etc

image

This diagram and general level explanation are roughly sufficient for L100 explanation, but my personal preference is that always complement the big picture approximated explanation with some concrete implementation details so I'll do that here to by explaining in more details structure of managed module . I'll try to minimize talking and maximize illustrations and pictures so it would be shorter, more reader friendly while still having some weight.

C# code file

In this post, I would use very simple example where we would have a single class console application which would write two lines to console.

Something like this:

view plaincopy to clipboardprint?

  1. namespace CSharp_ILCode  
  2. {  
  3. class Program  
  4.     {  
  5. static void Main(string[] args)  
  6.         {  
  7.             System.Console.WriteLine("Hello world!");  
  8.             Hello2();  
  9.         }  
  10. static void Hello2()  
  11.         {  
  12.             System.Console.WriteLine("Hello world 2x!");  
  13.         }  
  14.     }  
namespace CSharp_ILCode
{
    class Program
    {
        static void Main(string[] args)
        {
            System.Console.WriteLine("Hello world!");
            Hello2();
        }

        static void Hello2()
        {
            System.Console.WriteLine("Hello world 2x!");
        }
    }
}

As we can see on a diagram above, that code would during compile time be "translated" to IL code with appropriate metadata definition and that would all then become one managed module and through that NET assembly

Managed module

Any managed module, regardless of the fact from which  code was created consist of next four big parts: 

  1. PE32 header
  2. CLR header
  3. Metadata
  4. IL code

PE header

Every managed module contains the standard windows PE execute header like the non managed - native application contain too. The only difference in case of managed code is in the fact that bulk of PE header information is just ignored, while  in case of native code that PE header information contain information about the native CPU code.

To get some information about PE header, in Visual studio command prompt, you have to execute next command

dumpbin /all assembly_name > result.txt

That command would result with result.txt file being created and that file would contain next PE header specific information's (among many other information):

PE Header

We can see on this image that PE header contains information about:

  • what type of module it is,
  • what is the value of module time stamp creation
  • for which CPU architecture IL code is optimized (PE32 32 bit/64 bit Windows, PE32+ Win 64 bit only)
  • entry point representing memory address of the _CorExeMain() function  (more about this in assembly execution  part of the post)

I'll use this opportunity (while being here at PE header part) to answer one of the common .NET questions I heard:

"How to recognize from PE header information if a module is managed module?"

If we would scroll down the optional header values results in the dumpbin result text file we would see that number of directories is 10h (14) which is number higher then the number of directories in native assemblies.The extra one specific to managed code is "COM Descriptor Directory"  and that is entry in this "table of contents" which describes where to reach during execution for the metadata and the IL

image

CLR header

In we would scroll the result.txt file to a CLR header section we would see next:

image

Here we can see:

  • targeted version of CLR for this module is 2.05 (NET 2 SP1)
  • module consist only from managed code.
  • managed module entry's point - Main method has a 60000001 metadata token value
Metadata

While we still have dumbin result.txt open, let's take a quick look at something very cool and that is how to recognize where in module metadata segment begins.

If we would scroll to raw data #1 section we would see something like this

image

Start of metadata definition block is marked with 4 bytes 42 53 4A 42 (BSJB) which are first letters of the names of developers implementing metadata part of the framework in NET 1.0. I spend 2 hours trying to find their names but no success... Looks like either no one knows who they are or no one wants to name them..

After this been said, we can close the dumbin result file because in investigating metadata and IL code we would be using ILDasm.exe tool and it's results

To use a tool we should again open Visual Studio command prompt, navigate to folder where resulting assembly  is and execute next command

ildasm CSharp_ILCode.exe

Once that would be executed, we would see ILDasm application window which would show as assembly first part embedded manifest file item.

image

To see what manifest contains, I have double clicked it. Resulting window contains definition of assembly level data and defines data required for external assembly binding.

In our example definition of mscoree.dll would look like this

image

We saw already in CLR header part that there is information about metadata token of assembly entry point, which had value of 600001.

Knowing that tokens starting with 06 are MethodDef tokens, would lead us to examining the MethodDef related metadata. So, while ILDasm windows in focus, I have pressed <Ctrl>+<M> and found easily MethodDef with that given token value which pointed to Main method (as we already saw that in c# code definition)

image

Summarized:

  • in CLR header we defined token value
  • that token value is used then in metadata to lookup appropriate MethodDef entry
  • that entry describes part of the IL code which would be executed as entry point.

Metadata binary block of data consist of several tables which can be categorized in definition tables, reference tables and manifest tables, but the scope of this post is not allowing its deeper explanations (In case you are interested in more details on metadata check out the MSDN metadata start page

IL Code

I then expanded the ILDasm tree and double clicked the Main method entry.

image

As we can see, the static method Main is marked as .entrypoint.

IL is stack based, which means that operand values are pushed on execution stack and results are pop of the stack, without manipulating registries.

Therefore, in L_0000 the code is pushing on operand  stack "Hello world" value which would be used in L_0005.

IL is namespace ignorant which means that namespaces from C# code in IL are becoming just prefix in the "full" type name. In IL code every member is defined in full type name format like "Namespace.Type:MemberName"

That's why we have in:

  • L_0005  System.Console:WriteLine(string) (System namespace, console type, write line member)
  • L_000a CSharp_ILCode.Program:Hello2()    (CSharp_ILCode namespace, Program type, Hello2 member)

As highlighted on IL code image, L_0005 full type name  is having one more additional prefix because the Console type is defined in external assembly (in this case that is core .NET assembly - mscorwks.dll)

That been said, one question is inevitable

"How .NET knows where to find that [mscorlib]?"

Answer is very easy and already shown in a part of describing Manifest content where we saw on the beginning of Manifest data mscorlib public token key definition which would be used for accessing that assembly in GAC

While we are still at IL code window, let's answer one more question:

"How .NET debugging works?"

The IL code presented in upper screen is built optimized in release mode (I've blogged about compiler optimizations it more details here) and we all know that we can not debug code built in release mode. To get an answer why we need to build only debug builds, let's take a quick look at how IL code for the same C# code built in debug model looks:

image

As we can see, compiler inserted before each line one NOP statement. When we set a break point on a line in Visual Studio IDE, the breakpoint is set in fact onto the NOP function before that line. Because in release mode there are no NOP instructions created by compiler and therefore there's no possibility to set a breakpoint.

From : http://blog.vuscode.com/malovicn/archive/2007/12/24/net-foundations-net-assembly-structure.aspx