I'm trying to build a shared library containing CUDA code using CMake. I'm using the package findCUDA. I have a problem in the linking phase: Linking CXX shared library shlibcuda.so /usr/bin/c++ -fPIC -std=c++0x -fopenmp -O3 -DNDEBUG -shared -Wl,-son
I have a very strange bug in program. I spent many hours on it but I have not found a solution. I wrote simple program to reproduce my issue. Maybe someone help me. I tried cuda-memcheck & What is the canonical way to check for errors using the CUDA
Recently I developed a new method. The new method works perfect with CUDA (at 20 to 40FPS) and I have already tested it successfully. The problem comes when I try to make a comparison with an old method. The old method was implemented on CPU. It does
I'm trying to query the CUDA devices without adding the pycuda dependency. Here's what I've got so far: import ctypes cudart = ctypes.cdll.LoadLibrary('libcudart.so') numDevices = ctypes.c_int() cudart.cudaGetDeviceCount(ctypes.byref(numDevices)) pri
I am trying to replicate the linear programming solver that this person has made http://www.idi.ntnu.no/~elster/master-studs/spampinato/spampinato-linear-prog-gpu-report.pdf. First of, the device I am using is Quadro FX 1800M with compute capability
I have main.cu file that includes test.h which is header for test.c and all three files are in same project. test.h code: typedef struct { int a; } struct_a; void a(struct_a a); test.c code: void a(struct_a a) { printf("%d", a.a); } main.cu code
I have a pool of particles represented by an array of float4 where the w component is the particle's current lifetime in the range [0, 1]. I need to sort this array based on the lifetime of the particles in descending order so that I can keep an accu
I am trying to set up Cuda on my laptop with GeForce G210M, but I am not sure which architecture should I use while compile nvcc -arch=... Here is my DeviceQuery output: ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static
I am doing a video encode demo with CUDA 6.5, C/C++, SM3.0.The encode work uses NvEncCreateInputBuffer api to allocate input buffer . I want to reuse this buffer as kernel output memory when kernel finish re-size operation , so the output of re-size
I have a 5 million list of 32 bit integers (actually a 2048 x 2560 image) that is 90% zeros. The non-zero cells are labels (e.g. 2049, 8195, 1334300, 34320923, 4320932) that completely not sequential or consecutive in any way (it is the output of our
Suppose I compile the following with NVIDIA CUDA's nvcc compiler: template<typename T, typename Operator> __global__ void fooKernel(T t1, T t2) { Operator op; doSomethingWith(t1, t2); } template<typename T> __device__ __host__ void T bar(
NVidia does not distribute the NSight IDE for Windows (only Linux and MacOSX) I don't want to use Visual Studio because I'm not familiar with it; being a Java developer I prefer Eclipse. I want to use Maven, because well everyone should, and Mavenize
I have a 5000x500 matrix and I want to sort each row separately with cuda. I can use arrayfire but this is just a for loop over the thrust::sort, which should not be efficient. https://github.com/arrayfire/arrayfire/blob/devel/src/backend/cuda/kernel
My image is grayscale. I want to process avearage of column in one thread using float* type.And I want to add my output pixel's value with average value. When my code work just I can see one row. What is happening I couldn't understand. __global__ vo
I want to achieve the same effect as gcc -dM -E - < /dev/null (as described here) - but for nvcc. That is, I want to dump all of nvcc's preprocessor defines. Alas, nvcc doesn not support -dM. What do I do instead?
I have upgraded to Yosemite but nvcc doesn't like the new gcc4.9 gcc: warning: couldn’t understand kern.osversion ‘14.0.0 gcc: warning: couldn’t understand kern.osversion ‘14.0.0 In file included from /Developer/NVIDIA/CUDA-5.0/bin/../include/cuda_ru
In CUDA, there are two metrics I don't quite understand clearly, which are "requested global load throughput" and "global load throughput". from What's the difference between "gld/st_throughput" and "dram_read/write_throughput" metrics? I know the di
I am using CMake 3.1.0 and CUDA 6.5. I am trying to build OpenCV with CUDA support. But while building OpenCV I get the following error: CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or m
I'm trying to do a parallel reduction to sum an array in CUDA. Currently i pass an array in which to store the sum of the elements in each block. This is my code: #include <cstdlib> #include <iostream> #include <cuda.h> #include
I am using Cuda Kernel to convert an RGB data to YUV data. Then encode that YUV data to h264 encoded data using CudaEncoder. I am using two device pointers and I do Device to Device copy. I am allocating memory using cuMemAllocPitch() for device poin