relocation overflow log
megengine发版时,跑cu11.1 prebuild FAILED,该错误挂在linking libmegengine.so,错误原因为relocation overflow。具体见以上link,忽略一些输出如下:
[2378/2388] Linking CXX shared library src/libmegengine.so
FAILED: src/libmegengine.so
: && /usr/bin/c++ -fPIC -include /home/tangke/MegBrain/src/bin_reduce_cmake.h -ffunction-sections -fdata-sections -Wall -Wextra -Wno-unused-parameter -Wno-extra -m64
-msse4.2 -mfpmath=sse -g -O3 -DNDEBUG -fno-finite-math-only -Wl,--gc-sections -flto=full -fuse-ld=gold -Wl,--no-undefined -Wl,--version-script=/home/tangke/MegBra
in/src/version.ld -shared -Wl,-soname,libmegengine.so -o src/libmegengine.so @CMakeFiles/megengine.rsp && :
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):cutlass_init.compute_80.cudafe1.cpp:function __sti____cudaRegisterAll(): error
: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o)
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):matmul_cutlass_template.compute_80.cudafe1.cpp:function __sti____cudaRegisterA
ll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o)
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):matmul_cutlass_template.compute_80.cudafe1.cpp:function __sti____cudaRegisterA
ll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o)
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):sm_50_52_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:function __
sti____cudaRegisterAll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_s
tatic.a.o)
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):sm_50_52_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:function __
sti____cudaRegisterAll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_s
tatic.a.o)
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):sm_50_52_53_60_61_62_sass_wrapper_part1.asm.compute_61.cudafe1.cpp:function __
sti____cudaRegisterAll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_s
tatic.a.o)
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):sm_50_52_53_60_61_62_sass_wrapper_part1.asm.compute_61.cudafe1.cpp:function __
sti____cudaRegisterAll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_s
tatic.a.o)
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o):sm_50_52_53_60_61_62_sass_wrapper_part2.asm.compute_61.cudafe1.cpp:function __sti____cudaRegisterAll(): error: relocation overflow: reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_s
tatic.a.o)
思路
首先是仔细阅读linker给出的log。
Linking CXX shared library src/libmegengine.so 连接megengine.so时出错了
-fPIC link参数,值得注意,后面调试可能需要注意
--version-script=/home/tangke/MegBrain/src/version.ld 使用了动态链接的控制脚本,
脚本中定义了大量符号导出(GLOBAL,定义于动态符号表)到megengine.so,需要注意符号量占用了多少.text/.data/.bss
@CMakeFiles/megengine.rsp cmake文件,暂时不管
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o) relocate涉及cublas静态库
reference to local symbol 96350 in /usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o) 在megengine.so中重定位cublas.a中的符号时overflow了
然后要搞清楚什么是"relocation overflow"。假设我们猜测是导出了太多的符号,导致.text+.data overflow,即>2G,那么需要先检查一下总共有哪些符号?可以使用cmake的graphviz参数可视化一下整体的compile&link顺序,目的是看一下linking megengine.so时的依赖关系。cmake在生成graphviz时,配置的参数会写在CMakeGraphVizOptions.cmake。
图中,线段的含义在cmake graphviz文档中有说明。边框的话,双边框是动态库,比如图中的cuda_stub。单边框是静态库,例如libnvinfer。可以在ldd megengine_shared.so中看到,当前图的依赖动态库中,只能显示cuda_stub,而不会显示libnvinfer
图中实线是cmake PUBLIC
link属性,虚线是INTERFACE
,点线是PRIVATE
,这三种属性控制了链接时头文件对外的可见性。详细内容见reference
所以这些符号是否导出到libmegengine.so,和这些属性没有关系。
要直接有哪些symbol是exported to megengine.so的(一个方法是直接看cmakefile,这些megengine。so依赖的是静态还是动态,第二个方法是直接把图中所展示出来有依赖关系的库找到,看他们是动态还是静态)
那么下一步就该分析,为什么这些symbol export to megengine会导致overflow. 比如默认编译选项中mcmodel=small,要求.text + .data段 <2G,需要checkout到rrconv之前,看一下这两个段大小。再checkout到rrconv之后,看一下linking to megengine.so所涉及的.o/.a中.text+.data段大小。(因为他们会merge to megengine.so中的.text+.data)
假设是.text+.data导致的overflow,那么直接设置mcmodel=large,能否解决问题?
[2378/2388] Linking CXX shared library src/libmegengine_shared.so
FAILED: src/libmegengine_shared.so
: && /usr/bin/c++ -fPIC -include /home/tangke/MegBrain/src/bin_reduce_cmake.h -ffunction-sections -fdata-sections -Wall -Wextra -Wno-unused-parameter -Wno-extra -m64
-msse4.2 -mfpmath=sse -g -O3 -DNDEBUG -fno-finite-math-only -Wl,--gc-sections -flto=full -mcmodel=large -Wl,--no-undefined -Wl,--version-script=/home/tangke/MegBr
ain/src/version.ld -shared -Wl,-soname,libmegengine_shared.so -o src/libmegengine_shared.so @CMakeFiles/megengine_shared.rsp && :
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
cutlass_init.compute_80.cudafe1.cpp:(.text.startup+0x36fc6): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
matmul_cutlass_template.compute_80.cudafe1.cpp:(.text.startup+0x37018): relocation truncated to fit: R_X86_64_PC32 against `.bss'
matmul_cutlass_template.compute_80.cudafe1.cpp:(.text.startup+0x371c2): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_50_52_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x37218): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_50_52_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x395e6): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_50_52_53_60_61_62_sass_wrapper_part1.asm.compute_61.cudafe1.cpp:(.text.startup+0x39638): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_50_52_53_60_61_62_sass_wrapper_part1.asm.compute_61.cudafe1.cpp:(.text.startup+0x3ba06): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_50_52_53_60_61_62_sass_wrapper_part2.asm.compute_61.cudafe1.cpp:(.text.startup+0x3ba58): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_50_52_53_60_61_62_sass_wrapper_part2.asm.compute_61.cudafe1.cpp:(.text.startup+0x3c8f2): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x3c948): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x3cd76): additional relocation overflows omitted from the output
collect2: error: ld returned 1 exit status
https://cmake.org/cmake/help/v3.18/command/target_link_libraries.html?highlight=target_link
[2379/2388] Linking CXX shared library src/libmegengine.so
FAILED: src/libmegengine.so
: && /usr/bin/c++ -fPIC -include /home/tangke/MegBrain/src/bin_reduce_cmake.h -ffunction-sections -fdata-sections -Wall -Wextra -Wno-unused-parameter -Wno-extra -m64
-msse4.2 -mfpmath=sse -g -O3 -DNDEBUG -fno-finite-math-only %MCEPASTEBIN%-Wl,--gc-sections -flto=full -mcmodel=large -Wl,--no-undefined -Wl,--version-script=/home/tangke/MegBr
ain/src/version.ld -shared -Wl,-soname,libmegengine.so -o src/libmegengine.so @CMakeFiles/megengine.rsp && :
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
cutlass_init.compute_80.cudafe1.cpp:(.text.startup+0x36fc6): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
matmul_cutlass_template.compute_80.cudafe1.cpp:(.text.startup+0x37018): relocation truncated to fit: R_X86_64_PC32 against `.bss'
matmul_cutlass_template.compute_80.cudafe1.cpp:(.text.startup+0x371c2): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_50_52_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x37218): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_50_52_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x395e6): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_50_52_53_60_61_62_sass_wrapper_part1.asm.compute_61.cudafe1.cpp:(.text.startup+0x39638): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_50_52_53_60_61_62_sass_wrapper_part1.asm.compute_61.cudafe1.cpp:(.text.startup+0x3ba06): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_50_52_53_60_61_62_sass_wrapper_part2.asm.compute_61.cudafe1.cpp:(.text.startup+0x3ba58): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_50_52_53_60_61_62_sass_wrapper_part2.asm.compute_61.cudafe1.cpp:(.text.startup+0x3c8f2): relocation truncated to fit: R_X86_64_PC32 against `.bss'
/usr/local/cuda_dir/cuda-11.1/lib64/../lib/libcublasLt_static.a(libcublasLt_static.a.o): In function `__sti____cudaRegisterAll()':
sm_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x3c948): relocation truncated to fit: R_X86_64_PC32 against `.bss'
sm_53_60_61_62_sass_wrapper_part0.asm.compute_61.cudafe1.cpp:(.text.startup+0x3cd76): additional relocation overflows omitted from the output
collect2: error: ld returned 1 exit status
这个log说的是大量重定位使用的pc相对寻址,而实际地址超过了32bit表示范围。R_X86_64_PC32表示32bit pc相对寻址 重定位中的指令修正。
那么问题是,为什么会超过32bit表示范围呢?
联想到version.ld中,把很多符号定义为global,是要export的,那么export这么多symbol有什么影响呢?
REF
https://www.technovelty.org/c/relocation-truncated-to-fit-wtf.html
https://man7.org/conf/lca2006/shared_libraries/slide18c.html
https://sourceware.org/binutils/docs/ld/VERSION.html
https://stackoverflow.com/questions/61416129/c-language-global-symbol-local-symbol-clarification
https://cmake.org/cmake/help/v3.18/command/target_link_libraries.html?highlight=target_link
https://cmake.org/cmake/help/v3.18/module/CMakeGraphVizOptions.html?highlight=graphviz
https://leimao.github.io/blog/CMake-Public-Private-Interface/
本文来自博客园,作者:ijpq,转载请注明原文链接:https://www.cnblogs.com/ijpq/p/18098760