随笔分类 - CUDA
摘要:https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/ Efficient Matrix Multiplication on GPUs 计算密集度 = (时间复杂度/空间复杂度) = O(N^3)/O(N^2) = O(N) //
阅读全文
摘要:0x00 base of im2col https://zhuanlan.zhihu.com/p/491307328 0x01 base of implict GEMM https://zhuanlan.zhihu.com/p/372973726 so far, 0x00重点看im2col, 0x0
阅读全文
摘要:resources first GTC about cutlass, gtc2018 first GTC about cutlass, gtc2018 diigo pdf best nvidia tech blog about cutlass Accelerating Convolution wit
阅读全文
摘要:基本介绍从这个link看的:https://www.techcenturion.com/nvidia-cuda-cores/ 其中,抽象上这里表述较好理解: Let us consider an example to understand the working of CUDA cores. Thi
阅读全文
摘要:首次架构引入:volta, https://images.nvidia.cn/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf nvidia tech blog about tensor cores: https://d
阅读全文
摘要:cuda toolkit v11.8 docs, link:https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html preface assess评估 application 异构计算 application profil
阅读全文
摘要:shfl_xor cuda docs 搜索 shfl_xor https://tschmidt23.github.io/cse599i/CSE%20599%20I%20Accelerated%20Computing%20-%20Programming%20GPUs%20Lecture%2018.pd
阅读全文
摘要:# profiler ### nvprof 最早期的profiler,只提供cli ### nvvp 进化版本的nvprof,提供了gui ### ncu 写这个记录的时候,cuda已经不再支持nvprof,nvvp也变得异常难用(因为很多功能,比如metrics,去掉了)。现在推荐用nsight
阅读全文
摘要:0x00 基础知识 Prior to the 5.0 release, CUDA did not support separate compilation, so CUDA code could not call device functions or access variables across
阅读全文
摘要:假设A为3x4,B为4x3 physical structure A[0,1,2,...,11];B[0,...,11] logical structure A[0,1,2,3] A[4,5,6,7] A[8,9,10,11] B[0,1,2] B[3,4,5] B[6,7,8] B[9,10,11
阅读全文
摘要:overview 这是ECE408的一个作业,目标是实现3d卷积. 测试的时候使用link这个脚本对测试数据测试 课程给的测试环境是GTX1080.我用自己的RTX2070会出bug.而实验室服务器的titan xp是可以的 这个问题分为两种写法,目前只实现了一种相对好理解但效率低的写法。我认为效
阅读全文