CUDA Learning Path
You can start by reading this blog for a simple introduction: Introductory Blog
Then read this: Casual Discussion on High-Performance Computing and Performance Optimization: Memory Access
After that, read this, which is a translation of the official documentation: Official Documentation Translation
While reading the official docs, you can start hands-on experiments with simple matrix multiplication:
CUDA Matrix Multiplication Ultimate Optimization Guide
Also, learn about CUDA's two performance analysis tools: NVIDIA Nsight and NVIDIA Compute.
There isn't much material on this topic. There are a few simple tutorials on Bilibili, plus the official documentation (which is entirely in English).
NVIDIA Performance Analysis Tool nsight-compute Getting Started
This is my matrix multiplication implementation with some insights. You can also use it as a reference. If you want to practice matrix multiplication, you can use the template in the folder "0" inside this archive.