C源码编译过程8个阶段

来源

https://en.cppreference.com/w/c/language/translation_phases

Phase 1 阶段 1 字符映射

The individual bytes of the source code file (which is generally a text file in some multibyte encoding such as UTF-8) are mapped, in implementation defined manner, to the characters of the source character set. In particular, OS-dependent end-of-line indicators are replaced by newline characters.

源码文件的字节流映射到源码字符集。尤其时OS 相关的换行符，都映射为源码字符集里面的 newline 字符。

The source character set is a multibyte character set which includes the basic source character set as a single-byte subset, consisting of the following 96 characters:

源码字符集是一个多字节字符集，这个字符集包含基础源码字符集这个单字节字符子集。基础源码字符集包含 96 个单字节字符。

a) 5 whitespace characters (space, horizontal tab, vertical tab, form feed, new-line)

b) 10 digit characters from '0' to '9'

c) 52 letters from 'a' to 'z' and from 'A' to 'Z'

d) 29 punctuation characters: _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " '

5个空白符

10个数字

52个字母

29个标点

2) Trigraph sequences are replaced by corresponding single-character representations.

三元字符组转化为对应的单字符。

Phase 2 阶段 2 处理行尾

Whenever backslash appears at the end of a line (immediately followed by the newline character), both backslash and newline are deleted, combining two physical source lines into one logical source line. This is a single-pass operation: a line ending in two backslashes followed by an empty line does not combine three lines into one.

反斜线+换行这样的组合被删除。两个源码物理行被合并为一个逻辑行。

2) If a non-empty source file does not end with a newline character after this step (whether it had no newline originally, or it ended with a backslash), the behavior is undefined.

上面处理后，如果非空源码文件不以 newline 字符结尾，后果未定义。

Phase 3 阶段 3 预处理符号化【将字符组合为预处理符号】

The source file is decomposed into comments, sequences of whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), and preprocessing tokens, which are the following

阶段2 输出的字符流分为为注释空字符预处理符号

a) header names: <stdio.h> or "myfile.h"

b) identifiers

c) preprocessing numbers, which cover integer constants and floating constants, but also cover some invalid tokens such as 1..E+3.foo or 0JBK

d) character constants and string literals

e) operators and punctuators, such as +, <<=, <%, or ##.

f) individual non-whitespace characters that do not fit in any other category

预处理符号包含6类：

头文件名称。比如 <stdio.h> “myfile.h”

标识符

预处理数字举例： 1..e+3

字符常量或字符串常量

操作符和标点符号

其他非空字符组合

2) Each comment is replaced by one space character

注释替换为空白字符【这样，就剩下空白字符和预处理符号了】

3) Newlines are kept, and it's implementation-defined whether non-newline whitespace sequences may be collapsed into single space characters.

newline 字符被保留。编译器实现决定是否需要把不是 newline 字符的其他连续空白字符组合为一个空白字符。

If the input has been parsed into preprocessing tokens up to a given character, the next preprocessing token is generally taken to be the longest sequence of characters that could constitute a preprocessing token, even if that would cause subsequent analysis to fail. This is commonly known as maximal munch.

在字符组合为预处理符号时，规则是“最长匹配法”

int foo = 1;

int bar = 0xE+foo; // error: invalid preprocessing number 0xE+foo

int baz = 0xE + foo; // OK

int pub = bar+++baz; // OK: bar++ + baz

int ham = bar++-++baz; // OK: bar++ - ++baz

int qux = bar+++++baz; // error: bar++ ++ +baz, not bar++ + ++baz.

The sole exception to the maximal munch rule is:

唯一例外（不满足最长匹配法）的情况是：头文件名称符号。头文件名称符号只在 #include 和编译器实现的 #pragma 指令中【即不在这两个场景下时，不会遵从最长匹配法得到头文件名称符号】。

Header name preprocessing tokens are only formed within a #include directive and in implementation-defined locations within a #pragma directive.

#define MACRO_1 1

#define MACRO_2 2

#define MACRO_3 3

#define MACRO_EXPR (MACRO_1 <MACRO_2> MACRO_3) // OK: <MACRO_2> is not a header-name

Phase 4 阶段4 执行预处理

1) Preprocessor is executed.

2) Each file introduced with the #include directive goes through phases 1 through 4, recursively.

3) At the end of this phase, all preprocessor directives are removed from the source.

阶段4 结束后，所有的预处理指令都被移除了。

Phase 5 阶段5 字符常量和字符串常量编码转码

1) All characters and escape sequences in character constants and string literals are converted from source character set to execution character set (which may be a multibyte character set such as UTF-8, as long as all 96 characters from the basic source character set listed in phase 1 have single-byte representations). If the character specified by an escape sequence isn't a member of the execution character set, the result is implementation-defined, but is guaranteed to not be a null (wide) character.

Note: the conversion performed at this stage can be controlled by command line options in some implementations: gcc and clang use -finput-charset to specify the encoding of the source character set, -fexec-charset and -fwide-exec-charset to specify the encodings of the execution character set in the string literals and character constants that don't have an encoding prefix (since C11).

将字符常量和字符串常量的编码从源文件编码转换为前缀指定的编码方式。如果没有前缀，则转换过程编译器具体实现决定。有的编译器提供了命令行参数来控制源文件的编码和转换后的可执行文件中的【字符常量字符串常量】的编码。

Phase 6 阶段 6 字符串拼接

Adjacent string literals are concatenated.

Phase 7 阶段 7 编译

Compilation takes place: the tokens are syntactically and semantically analyzed and translated as a translation unit.

Phase 8 阶段8 链接

Linking takes place: Translation units and library components needed to satisfy external references are collected into a program image which contains information needed for execution in its execution environment (the OS).

posted @ 2022-01-05 21:01 张志伟122 阅读(236) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部