Q1.a. See hw3.cu Q1.b. 304 GFlops Q1.c. I wrote a simple kernel, see logistic1_kernel. It achieved about 120 Gflops. Running nvcc --ptxas-options -v, I saw that it only uses 6 registers. A kernel can use up to 21 registers per thread and still have 48 active warps. I revised the loop to update 5 elements of the x array per iteration. This uses 20 registers per thread. I experimented with values for nblks, tpb (thread-per-block), and m (number of generations of the logistic map to compute). tpb: throughput increased with tpb until tpb=256. For larger values, the throughput did not change noticably with tpb. nblks: multiples of 6 are good. This makes sense, 6 blocks of 256 threads is 1536 threads, which fills a SM. At 12 blocks, I saw 295Gflops. The performance gradually increased and seemed to peak around 240 blocks. m: increasing m created significant increases in throughput up to m=500. I saw small increases up to m=8000, and not much change after that. Based on these observations, I tried with tpb=256, nblks=240, and m=8000 and 304.88 Gflops (on lin22). The GTX 550 Ti has a peak Gflops specification of 691 Gflops. That is counting fused-multiply-adds as two floating point operations. The logistic map does not benefit from fused multiply add. Thus, the peak should be around 345 Gflops. My code gets 88% of that. Q2.a. See hw3.cu Q2.b. 68.3Gbytes/sec Q2.c. I wrote a simple kernel, see norm1_kernel and got about 53 Gbytes/sec. I manually unrolled the loop. Using two "accumulators" for the sum and adding two terms of the dot-product to each accumulator for each sum seemed to work best. Then, I experimented with nblks, tpb, and m as for Q1, and got nblks=48, tpb=384, m=10000 seemed to do well. The GTX 550 Ti has a peak Gflops specification of 95 Gbytes/sec. Those are 2^30 byte Gbytes, so that's about 102 Gbytes/sec with 10^9 byt Gbytes. My code seems to get about 2/3 of that. I suspect this is because the GTX 550 Ti has four GDDR5 DIMMs connected to the GPU using three GDDR5 interfaces. I haven't found a way to get higher bandwidth. I'll be interested to see if any of the submitted solutions did better. Q3.a. See hw3.cu Q2.b. 13.24 GRand/sec (giga-random-units/sec). Q2.c. I wrote the obvious kernel, and it got 0.69 GRand/sec. I changed the code to compute the sum of the m random numbers per thread and only store the sum in the global memory -- that doesn't satisfy the assignment, but it ran *way* faster (nearly 17 GRand/sec). This suggests that global memory bandwidth was the problem. I restored my code to store each random uint in a separate location. I stared at the code, trying to think of a way to unroll the loop, or change the memory access pattern, but couldn't come up with anything. Then, I tried copying the random number generator state into shared memory. This removes the global memory accesses for the state. The throughput increased to 8.7 GRand/sec. From there, I tried adjusting nblks, tpb, and m. This time, I found that it helped to have a large number of threads-per-block, a large value for m, and a fairly small number of blocks. I got the highest througput with nblks=12, tpb=768, and m=20000.