relocation overflow log

问题背景:
https://airflow-megengine.iap.hh-d.brainpp.cn/log?dag_id=megbrain-release&task_id=prebuild-cu111&execution_date=2022-10-08T06%3A06%3A51%2B00%3A00#

megengine发版时,跑cu11.1 prebuild FAILED,该错误挂在linking libmegengine.so,错误原因为relocation overflow。具体见以上link,忽略一些输出如下:

[2378/2388] Linking CXX shared library src/libmegengine.so                                                                                                            
FAILED: src/libmegengine.so                                                                                                                                           
: && /usr/bin/c++ -fPIC -include /home/tangke/MegBrain/src/bin_reduce_cmake.h -ffunction-sections -fdata-sections -Wall -Wextra -Wno-unused-parameter -Wno-extra -m64 
-msse4.2 -mfpmath=sse  -g -O3 -DNDEBUG -fno-finite-math-only  -Wl,--gc-sections -flto=full -fuse-ld=gold   -Wl,--no-undefined -Wl,--version-script=/home/tangke/MegBra
in/src/version.ld -shared -Wl,-soname,libmegengine.so -o src/libmegengine.so @CMakeFiles/megengine.rsp  && :                                                          
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):cutlass_init.compute_80.cudafe1.cpp:function __sti____cudaRegisterAll(): error
: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o)                     
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):matmul_cutlass_template.compute_80.cudafe1.cpp:function __sti____cudaRegisterA
ll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o)          
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):matmul_cutlass_template.compute_80.cudafe1.cpp:function __sti____cudaRegisterA
ll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o)          
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):sm_50_52_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:function __
sti____cudaRegisterAll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_s
tatic.a.o)                                                                                                                                                            
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):sm_50_52_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:function __
sti____cudaRegisterAll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_s
tatic.a.o)                                                                                                                                                            
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):sm_50_52_53_60_61_62_sass_wrapper_part1.asm.compute_61.cudafe1.cpp:function __
sti____cudaRegisterAll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_s
tatic.a.o)                                                                                                                                                            
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):sm_50_52_53_60_61_62_sass_wrapper_part1.asm.compute_61.cudafe1.cpp:function __
sti____cudaRegisterAll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_s
tatic.a.o)                                                                                                                                                            
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):sm_50_52_53_60_61_62_sass_wrapper_part2.asm.compute_61.cudafe1.cpp:function __sti____cudaRegisterAll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_s
tatic.a.o)

思路
首先是仔细阅读linker给出的log。

Linking CXX shared library src/libmegengine.so 连接megengine.so时出错了

-fPIC link参数,值得注意,后面调试可能需要注意

--version-script=/home/tangke/MegBrain/src/version.ld 使用了动态链接的控制脚本,
image

脚本中定义了大量符号导出(GLOBAL,定义于动态符号表)到megengine.so,需要注意符号量占用了多少.text/.data/.bss

@CMakeFiles/megengine.rsp cmake文件,暂时不管

/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o) relocate涉及cublas静态库

reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o) 在megengine.so中重定位cublas.a中的符号时overflow了

然后要搞清楚什么是"relocation overflow"。假设我们猜测是导出了太多的符号,导致.text+.data overflow,即>2G,那么需要先检查一下总共有哪些符号?可以使用cmake的graphviz参数可视化一下整体的compile&link顺序,目的是看一下linking megengine.so时的依赖关系。cmake在生成graphviz时,配置的参数会写在CMakeGraphVizOptions.cmake。

图中,线段的含义在cmake graphviz文档中有说明。边框的话,双边框是动态库,比如图中的cuda_stub。单边框是静态库,例如libnvinfer。可以在ldd megengine_shared.so中看到,当前图的依赖动态库中,只能显示cuda_stub,而不会显示libnvinfer

图中实线是cmake PUBLIC link属性,虚线是INTERFACE,点线是PRIVATE,这三种属性控制了链接时头文件对外的可见性。详细内容见reference

所以这些符号是否导出到libmegengine.so,和这些属性没有关系。

要直接有哪些symbol是exported to megengine.so的(一个方法是直接看cmakefile,这些megengine。so依赖的是静态还是动态,第二个方法是直接把图中所展示出来有依赖关系的库找到,看他们是动态还是静态)

那么下一步就该分析,为什么这些symbol export to megengine会导致overflow. 比如默认编译选项中mcmodel=small,要求.text + .data段 <2G,需要checkout到rrconv之前,看一下这两个段大小。再checkout到rrconv之后,看一下linking to megengine.so所涉及的.o/.a中.text+.data段大小。(因为他们会merge to megengine.so中的.text+.data)

假设是.text+.data导致的overflow,那么直接设置mcmodel=large,能否解决问题?

[2378/2388] Linking CXX shared library src/libmegengine_shared.so                                                                                                     
FAILED: src/libmegengine_shared.so                                                                                                                                    
: && /usr/bin/c++ -fPIC -include /home/tangke/MegBrain/src/bin_reduce_cmake.h -ffunction-sections -fdata-sections -Wall -Wextra -Wno-unused-parameter -Wno-extra -m64 
-msse4.2 -mfpmath=sse  -g -O3 -DNDEBUG -fno-finite-math-only  -Wl,--gc-sections -flto=full -mcmodel=large   -Wl,--no-undefined -Wl,--version-script=/home/tangke/MegBr
ain/src/version.ld -shared -Wl,-soname,libmegengine_shared.so -o src/libmegengine_shared.so @CMakeFiles/megengine_shared.rsp  && :                                    
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':                                    
cutlass_init.compute_80.cudafe1.cpp:(.text.startup+0x36fc6): relocation truncated to fit: R_X86_64_PC32 against `.bss'                                                
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':                                    
matmul_cutlass_template.compute_80.cudafe1.cpp:(.text.startup+0x37018): relocation truncated to fit: R_X86_64_PC32 against `.bss'                                     
matmul_cutlass_template.compute_80.cudafe1.cpp:(.text.startup+0x371c2): relocation truncated to fit: R_X86_64_PC32 against `.bss'                                     
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':                                    
sm_50_52_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x37218): relocation truncated to fit: R_X86_64_PC32 against `.bss'                 
sm_50_52_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x395e6): relocation truncated to fit: R_X86_64_PC32 against `.bss'                 
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':                                    
sm_50_52_53_60_61_62_sass_wrapper_part1.asm.compute_61.cudafe1.cpp:(.text.startup+0x39638): relocation truncated to fit: R_X86_64_PC32 against `.bss'                 
sm_50_52_53_60_61_62_sass_wrapper_part1.asm.compute_61.cudafe1.cpp:(.text.startup+0x3ba06): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_50_52_53_60_61_62_sass_wrapper_part2.asm.compute_61.cudafe1.cpp:(.text.startup+0x3ba58): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_50_52_53_60_61_62_sass_wrapper_part2.asm.compute_61.cudafe1.cpp:(.text.startup+0x3c8f2): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x3c948): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x3cd76): additional relocation overflows omitted from the output
collect2: error: ld returned 1 exit status

https://cmake.org/cmake/help/v3.18/command/target_link_libraries.html?highlight=target_link
[2379/2388] Linking CXX shared library src/libmegengine.so
FAILED: src/libmegengine.so
: && /usr/bin/c++ -fPIC -include /home/tangke/MegBrain/src/bin_reduce_cmake.h -ffunction-sections -fdata-sections -Wall -Wextra -Wno-unused-parameter -Wno-extra -m64
-msse4.2 -mfpmath=sse  -g -O3 -DNDEBUG -fno-finite-math-only  %MCEPASTEBIN%-Wl,--gc-sections -flto=full -mcmodel=large   -Wl,--no-undefined -Wl,--version-script=/home/tangke/MegBr
ain/src/version.ld -shared -Wl,-soname,libmegengine.so -o src/libmegengine.so @CMakeFiles/megengine.rsp  && :
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
cutlass_init.compute_80.cudafe1.cpp:(.text.startup+0x36fc6): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
matmul_cutlass_template.compute_80.cudafe1.cpp:(.text.startup+0x37018): relocation truncated to fit: R_X86_64_PC32 against `.bss'
matmul_cutlass_template.compute_80.cudafe1.cpp:(.text.startup+0x371c2): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_50_52_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x37218): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_50_52_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x395e6): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_50_52_53_60_61_62_sass_wrapper_part1.asm.compute_61.cudafe1.cpp:(.text.startup+0x39638): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_50_52_53_60_61_62_sass_wrapper_part1.asm.compute_61.cudafe1.cpp:(.text.startup+0x3ba06): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_50_52_53_60_61_62_sass_wrapper_part2.asm.compute_61.cudafe1.cpp:(.text.startup+0x3ba58): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_50_52_53_60_61_62_sass_wrapper_part2.asm.compute_61.cudafe1.cpp:(.text.startup+0x3c8f2): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x3c948): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x3cd76): additional relocation overflows omitted from the output
collect2: error: ld returned 1 exit status

这个log说的是大量重定位使用的pc相对寻址,而实际地址超过了32bit表示范围。R_X86_64_PC32表示32bit pc相对寻址 重定位中的指令修正。

那么问题是,为什么会超过32bit表示范围呢?

联想到version.ld中,把很多符号定义为global,是要export的,那么export这么多symbol有什么影响呢?

REF

https://www.technovelty.org/c/relocation-truncated-to-fit-wtf.html

https://man7.org/conf/lca2006/shared_libraries/slide18c.html

https://users.informatik.haw-hamburg.de/~krabat/FH-Labor/gnupro/5_GNUPro_Utilities/c_Using_LD/ldLinker_scripts.html

https://sourceware.org/binutils/docs/ld/VERSION.html

https://stackoverflow.com/questions/61416129/c-language-global-symbol-local-symbol-clarification

https://stackoverflow.com/questions/69783203/examples-of-when-public-private-interface-should-be-used-in-cmake

https://cmake.org/cmake/help/v3.18/command/target_link_libraries.html?highlight=target_link

https://cmake.org/cmake/help/v3.18/module/CMakeGraphVizOptions.html?highlight=graphviz

https://leimao.github.io/blog/CMake-Public-Private-Interface/

posted @ 2024-03-27 13:12  ijpq  阅读(54)  评论(0编辑  收藏  举报