C源码编译过程8个阶段
来源
https://en.cppreference.com/w/c/language/translation_phases
Phase 1 阶段 1 字符映射
The individual bytes of the source code file (which is generally a text file in some multibyte encoding such as UTF-8) are mapped, in implementation defined manner, to the characters of the source character set. In particular, OS-dependent end-of-line indicators are replaced by newline characters.
源码文件的字节流映射到 源码字符集。尤其时OS 相关的换行符,都映射为源码字符集里面的 newline 字符。
The source character set is a multibyte character set which includes the basic source character set as a single-byte subset, consisting of the following 96 characters:
源码字符集是一个 多字节字符集,这个字符集包含 基础源码字符集 这个 单字节字符 子集。基础源码字符集 包含 96 个 单字节字符。
a) 5 whitespace characters (space, horizontal tab, vertical tab, form feed, new-line)
b) 10 digit characters from '0' to '9'
c) 52 letters from 'a' to 'z' and from 'A' to 'Z'
d) 29 punctuation characters: _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " '
5个空白符
10个数字
52个字母
29个标点
2) Trigraph sequences are replaced by corresponding single-character representations.
三元字符组 转化为 对应的 单字符。
Phase 2 阶段 2 处理行尾
Whenever backslash appears at the end of a line (immediately followed by the newline character), both backslash and newline are deleted, combining two physical source lines into one logical source line. This is a single-pass operation: a line ending in two backslashes followed by an empty line does not combine three lines into one.
反斜线+换行 这样的组合 被 删除。两个源码物理行被合并为一个逻辑行。
2) If a non-empty source file does not end with a newline character after this step (whether it had no newline originally, or it ended with a backslash), the behavior is undefined.
上面处理后,如果非空源码文件不以 newline 字符结尾,后果未定义。
Phase 3 阶段 3 预处理符号化【将字符 组合为 预处理符号】
The source file is decomposed into comments, sequences of whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), and preprocessing tokens, which are the following
阶段2 输出的字符流 分为为 注释 空字符 预处理符号
a) header names: <stdio.h> or "myfile.h"
b) identifiers
c) preprocessing numbers, which cover integer constants and floating constants, but also cover some invalid tokens such as 1..E+3.foo or 0JBK
d) character constants and string literals
e) operators and punctuators, such as +, <<=, <%, or ##.
f) individual non-whitespace characters that do not fit in any other category
预处理符号包含6类:
头文件名称。比如 <stdio.h> “myfile.h”
标识符
预处理数字 举例: 1..e+3
字符常量 或 字符串 常量
操作符 和 标点符号
其他非空字符组合
2) Each comment is replaced by one space character
注释替换为 空白字符【这样,就剩下 空白字符 和 预处理符号 了】
3) Newlines are kept, and it's implementation-defined whether non-newline whitespace sequences may be collapsed into single space characters.
newline 字符被保留。 编译器实现决定是否需要把 不是 newline 字符的其他 连续空白字符 组合为一个 空白字符。
If the input has been parsed into preprocessing tokens up to a given character, the next preprocessing token is generally taken to be the longest sequence of characters that could constitute a preprocessing token, even if that would cause subsequent analysis to fail. This is commonly known as maximal munch.
在字符组合为 预处理符号 时,规则是“最长匹配法”
int foo = 1;
int bar = 0xE+foo; // error: invalid preprocessing number 0xE+foo
int baz = 0xE + foo; // OK
int pub = bar+++baz; // OK: bar++ + baz
int ham = bar++-++baz; // OK: bar++ - ++baz
int qux = bar+++++baz; // error: bar++ ++ +baz, not bar++ + ++baz.
The sole exception to the maximal munch rule is:
唯一例外(不满足 最长匹配法)的情况是:头文件名称 符号。头文件名称符号 只在 #include 和 编译器实现的 #pragma 指令 中【即 不在这两个场景下时,不会遵从最长匹配法得到 头文件名称 符号】。
Header name preprocessing tokens are only formed within a #include directive and in implementation-defined locations within a #pragma directive.
#define MACRO_1 1
#define MACRO_2 2
#define MACRO_3 3
#define MACRO_EXPR (MACRO_1 <MACRO_2> MACRO_3) // OK: <MACRO_2> is not a header-name
Phase 4 阶段4 执行预处理
1) Preprocessor is executed.
2) Each file introduced with the #include directive goes through phases 1 through 4, recursively.
3) At the end of this phase, all preprocessor directives are removed from the source.
阶段4 结束后,所有的 预处理指令都被移除了。
Phase 5 阶段5 字符常量 和 字符串常量 编码转码
1) All characters and escape sequences in character constants and string literals are converted from source character set to execution character set (which may be a multibyte character set such as UTF-8, as long as all 96 characters from the basic source character set listed in phase 1 have single-byte representations). If the character specified by an escape sequence isn't a member of the execution character set, the result is implementation-defined, but is guaranteed to not be a null (wide) character.
Note: the conversion performed at this stage can be controlled by command line options in some implementations: gcc and clang use -finput-charset to specify the encoding of the source character set, -fexec-charset and -fwide-exec-charset to specify the encodings of the execution character set in the string literals and character constants that don't have an encoding prefix (since C11).
将字符常量 和 字符串 常量的编码 从 源文件编码 转换为 前缀指定的编码方式。如果没有前缀,则转换 过程 编译器具体实现决定。有的编译器提供了命令行参数来控制源文件的编码和转换后的可执行文件中的【字符常量 字符串常量】的编码。
Phase 6 阶段 6 字符串 拼接
Adjacent string literals are concatenated.
Phase 7 阶段 7 编译
Compilation takes place: the tokens are syntactically and semantically analyzed and translated as a translation unit.
Phase 8 阶段8 链接
Linking takes place: Translation units and library components needed to satisfy external references are collected into a program image which contains information needed for execution in its execution environment (the OS).