CUDA - 随笔分类 - ijpq

CUTLASS: Fast Linear Algebra in CUDA C++

摘要：https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/ Efficient Matrix Multiplication on GPUs 计算密集度 = (时间复杂度/空间复杂度) = O(N^3)/O(N^2) = O(N) // 阅读全文

posted @ 2024-03-26 13:47 ijpq 阅读(66) 评论(0) 推荐(0)

implict GEMM

摘要：0x00 base of im2col https://zhuanlan.zhihu.com/p/491307328 0x01 base of implict GEMM https://zhuanlan.zhihu.com/p/372973726 so far, 0x00重点看im2col, 0x0 阅读全文

posted @ 2022-11-03 14:19 ijpq 阅读(182) 评论(0) 推荐(0)

all about CUTLASS

摘要：resources first GTC about cutlass, gtc2018 first GTC about cutlass, gtc2018 diigo pdf best nvidia tech blog about cutlass Accelerating Convolution wit 阅读全文

posted @ 2022-10-31 15:27 ijpq 阅读(72) 评论(0) 推荐(0)

cuda cores

摘要：基本介绍从这个link看的：https://www.techcenturion.com/nvidia-cuda-cores/ 其中，抽象上这里表述较好理解： Let us consider an example to understand the working of CUDA cores. Thi 阅读全文

posted @ 2022-10-31 15:26 ijpq 阅读(518) 评论(0) 推荐(0)

tensor cores

摘要：首次架构引入：volta, https://images.nvidia.cn/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf nvidia tech blog about tensor cores: https://d 阅读全文

posted @ 2022-10-31 14:17 ijpq 阅读(55) 评论(0) 推荐(0)

阅读cuda docs - best practice

摘要：cuda toolkit v11.8 docs, link:https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html preface assess评估 application 异构计算 application profil 阅读全文

posted @ 2022-10-25 18:37 ijpq 阅读(131) 评论(0) 推荐(0)

shfl_*

摘要：shfl_xor cuda docs 搜索 shfl_xor https://tschmidt23.github.io/cse599i/CSE%20599%20I%20Accelerated%20Computing%20-%20Programming%20GPUs%20Lecture%2018.pd 阅读全文

posted @ 2022-07-12 17:12 ijpq 阅读(111) 评论(0) 推荐(0)

CUDA profilier

摘要：profiler nvprof 最早期的profiler，只提供cli nvvp 进化版本的nvprof，提供了gui ncu 写这个记录的时候，cuda已经不再支持nvprof，nvvp也变得异常难用（因为很多功能，比如metrics，去掉了）。现在推荐用nsight compute，这个工具分为阅读全文

posted @ 2022-04-12 20:34 ijpq 阅读(945) 评论(0) 推荐(0)

NVCC

摘要：0x00 基础知识 Prior to the 5.0 release, CUDA did not support separate compilation, so CUDA code could not call device functions or access variables across 阅读全文

posted @ 2022-01-14 21:41 ijpq 阅读(559) 评论(0) 推荐(0)

cuda matrix tiled multiply

摘要：假设A为3x4,B为4x3 physical structure A[0,1,2,...,11];B[0,...,11] logical structure A[0,1,2,3] A[4,5,6,7] A[8,9,10,11] B[0,1,2] B[3,4,5] B[6,7,8] B[9,10,11 阅读全文

posted @ 2021-10-20 11:52 ijpq 阅读(109) 评论(0) 推荐(0)

CUDA 3D convolution

摘要：overview 这是ECE408的一个作业，目标是实现3d卷积. 测试的时候使用link这个脚本对测试数据测试课程给的测试环境是GTX1080.我用自己的RTX2070会出bug.而实验室服务器的titan xp是可以的这个问题分为两种写法，目前只实现了一种相对好理解但效率低的写法。我认为效阅读全文

posted @ 2021-10-14 08:52 ijpq 阅读(827) 评论(0) 推荐(0)

0x01

computer arch/parallel programming/

随笔分类 - CUDA