随笔分类 -  CUDA

摘要:https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/ Efficient Matrix Multiplication on GPUs 计算密集度 = (时间复杂度/空间复杂度) = O(N^3)/O(N^2) = O(N) // 阅读全文
posted @ 2024-03-26 13:47 ijpq 阅读(19) 评论(0) 推荐(0) 编辑
摘要:0x00 base of im2col https://zhuanlan.zhihu.com/p/491307328 0x01 base of implict GEMM https://zhuanlan.zhihu.com/p/372973726 so far, 0x00重点看im2col, 0x0 阅读全文
posted @ 2022-11-03 14:19 ijpq 阅读(128) 评论(0) 推荐(0) 编辑
摘要:resources first GTC about cutlass, gtc2018 first GTC about cutlass, gtc2018 diigo pdf best nvidia tech blog about cutlass Accelerating Convolution wit 阅读全文
posted @ 2022-10-31 15:27 ijpq 阅读(52) 评论(0) 推荐(0) 编辑
摘要:基本介绍从这个link看的:https://www.techcenturion.com/nvidia-cuda-cores/ 其中,抽象上这里表述较好理解: Let us consider an example to understand the working of CUDA cores. Thi 阅读全文
posted @ 2022-10-31 15:26 ijpq 阅读(409) 评论(0) 推荐(0) 编辑
摘要:首次架构引入:volta, https://images.nvidia.cn/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf nvidia tech blog about tensor cores: https://d 阅读全文
posted @ 2022-10-31 14:17 ijpq 阅读(41) 评论(0) 推荐(0) 编辑
摘要:cuda toolkit v11.8 docs, link:https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html preface assess评估 application 异构计算 application profil 阅读全文
posted @ 2022-10-25 18:37 ijpq 阅读(61) 评论(0) 推荐(0) 编辑
摘要:shfl_xor cuda docs 搜索 shfl_xor https://tschmidt23.github.io/cse599i/CSE%20599%20I%20Accelerated%20Computing%20-%20Programming%20GPUs%20Lecture%2018.pd 阅读全文
posted @ 2022-07-12 17:12 ijpq 阅读(85) 评论(0) 推荐(0) 编辑
摘要:# profiler ### nvprof 最早期的profiler,只提供cli ### nvvp 进化版本的nvprof,提供了gui ### ncu 写这个记录的时候,cuda已经不再支持nvprof,nvvp也变得异常难用(因为很多功能,比如metrics,去掉了)。现在推荐用nsight 阅读全文
posted @ 2022-04-12 20:34 ijpq 阅读(734) 评论(0) 推荐(0) 编辑
摘要:0x00 基础知识 Prior to the 5.0 release, CUDA did not support separate compilation, so CUDA code could not call device functions or access variables across 阅读全文
posted @ 2022-01-14 21:41 ijpq 阅读(436) 评论(0) 推荐(0) 编辑
摘要:假设A为3x4,B为4x3 physical structure A[0,1,2,...,11];B[0,...,11] logical structure A[0,1,2,3] A[4,5,6,7] A[8,9,10,11] B[0,1,2] B[3,4,5] B[6,7,8] B[9,10,11 阅读全文
posted @ 2021-10-20 11:52 ijpq 阅读(73) 评论(0) 推荐(0) 编辑
摘要:overview 这是ECE408的一个作业,目标是实现3d卷积. 测试的时候使用link这个脚本对测试数据测试 课程给的测试环境是GTX1080.我用自己的RTX2070会出bug.而实验室服务器的titan xp是可以的 这个问题分为两种写法,目前只实现了一种相对好理解但效率低的写法。我认为效 阅读全文
posted @ 2021-10-14 08:52 ijpq 阅读(762) 评论(0) 推荐(0) 编辑

点击右上角即可分享
微信分享提示