The solution for this problem is that we perform bound checks inside the kernel and do processing only with those threads which fall inside the image bounds. This will result in some of the threads accessing memory outside the image bounds causing undefined behavior. Now in this case, we have total number of threads in each dimension greater than the corresponding image dimension. When the grid is created in the above mentioned ways, you will end up creating a grid of size 13 x 29 which will result in total number of threads equal to 416 x 928. Grid.y = (image_height block.y - 1 )/block.y Grid.x = (image_width block.x - 1 )/block.x Grid.y = ceil( float(image_height)/block.y ) Īnother smart way is to use the following formula int image_width = 400 Grid.x = ceil( float(image_width)/block.x ) Instead of integer division to calculate the number of blocks, we use floating point division and ceil the results. What we need to do is to round up the number of blocks so that if the image dimension is not a multiple of block dimension, we create an additional block which will cover up the remaining pixels. (because 32 x 12 = 384 and 32 x 28 = 896).Īs we can see that the total number of threads in each dimension are less than the corresponding image dimensions. But neither of the image dimensions are an integer multiple of the corresponding block dimensions, so due to integer division we will end up creating grid of size 12 x 28 which will result in total number of threads equal to 384 x 896. Then the number of blocks for the x and y dimensions of the image should be 400/32 and 900/32. Let's say you choose a block of size (32,32). For a 2D grid, the number of threads in X dimension is equal to block.x * grid.x and in Y dimension equal to block.y * grid.y.Īssuming you have an image of size 400 x 900, then the total number of threads in the corresponding dimension should also be at-least the same. It means that the total number of threads in a dimension is equal to the product of grid size and block size in that dimension. Remember that grid size means the number of block in each dimension. If we want to map a thread for every pixel, then the grid should be created such that the total number of threads in each dimension is at-least equal to the corresponding image dimension. Next comes the calculation of 2D grid size. Keeping in mind the limit of block size (1024), following are a few examples of valid block sizes. If we want to set maximum possible block size, we have to make sure that the product of its dimensions does not exceed the block size limit. the mentioned example above, are doing more or less the same thing when dispatching the kernel.Īny help is appreciated, thanks in advance.When processing 2D images with CUDA, a natural intuition is to use 2D block and grid shape. I do not find any error, as other code, e.g. If I add cudaDeviceSynchronize(), an error 700 is thrown with illegal memory access when I try to print out the result in Python. No values are copied towards the output, but CUDA does not throw an error (Error Code 0 is printed) and I get all “Hello from CUDA If” printouts. The matrix has the same shape, dtype and is on the same CUDA device, but it is just zero. The entire project and the setup.py install output are available here in this Github Gist.Īs described above, if I remove cudaDeviceSynchronize() from reduce_cuda(…) in reduce_cuda_kernel.cu I get in Python the zero matrix created with auto output = torch::zeros_like(matrix) Īs a result from the kernel call. Printf("Cuda Error: %d \n", cudaGetLastError()) The kernel code is the following, it copies the data from a const 3d input matrix into a 3d output matrix: template Ĭonst torch::PackedTensorAccessor32 matrix,Ĭonst int y = blockIdx.y * blockDim.y threadIdx.y Ĭonst int x = blockIdx.x * blockDim.x threadIdx.x My issue is that in this simple example I either get the created zero matrix as a result in Python or, if after the kernel call cudaDeviceSynchronize() is added, an illegal memory access error (Code 700). Following the C /CUDA extension tutorial on the pytorch website and having a look at the linked source code I have created my own CUDA kernel which does not do something useful, but is done as a learning project.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |