Q1.a.  See hw3.cu
Q1.b.  304 GFlops
Q1.c.  I wrote a simple kernel, see logistic1_kernel.  It achieved about 120 Gflops.
       Running nvcc --ptxas-options -v, I saw that it only uses 6 registers.  A
       kernel can use up to 21 registers per thread and still have 48 active warps.
       I revised the loop to update 5 elements of the x array per iteration.  This
       uses 20 registers per thread.
         I experimented with values for nblks, tpb (thread-per-block), and m (number
       of generations of the logistic map to compute).
         tpb: throughput increased with tpb until tpb=256.  For larger values, the
           throughput did not change noticably with tpb.
         nblks: multiples of 6 are good.  This makes sense, 6 blocks of 256 threads
           is 1536 threads, which fills a SM.  At 12 blocks, I saw 295Gflops.  The
           performance gradually increased and seemed to peak around 240 blocks.
         m: increasing m created significant increases in throughput up to m=500.
           I saw small increases up to m=8000, and not much change after that.
       Based on these observations, I tried with tpb=256, nblks=240, and m=8000
       and 304.88 Gflops (on lin22).
         The GTX 550 Ti has a peak Gflops specification of 691 Gflops.  That is
       counting fused-multiply-adds as two floating point operations.  The logistic
       map does not benefit from fused multiply add.  Thus, the peak should be
       around 345 Gflops.  My code gets 88% of that.

Q2.a.  See hw3.cu
Q2.b.  68.3Gbytes/sec
Q2.c.  I wrote a simple kernel, see norm1_kernel and got about 53 Gbytes/sec.
       I manually unrolled the loop.  Using two "accumulators" for the sum
       and adding two terms of the dot-product to each accumulator for each
       sum seemed to work best.  Then, I experimented with nblks, tpb, and m
       as for Q1, and got nblks=48, tpb=384, m=10000 seemed to do well.
         The GTX 550 Ti has a peak Gflops specification of 95 Gbytes/sec.
       Those are 2^30 byte Gbytes, so that's about 102 Gbytes/sec with 10^9 byt
       Gbytes.  My code seems to get about 2/3 of that.  I suspect this is
       because the GTX 550 Ti has four GDDR5 DIMMs connected to the GPU using
       three GDDR5 interfaces.  I haven't found a way to get higher bandwidth.
       I'll be interested to see if any of the submitted solutions did better.
         
Q3.a.  See hw3.cu
Q2.b.  13.24 GRand/sec (giga-random-units/sec).
Q2.c.  I wrote the obvious kernel, and it got 0.69 GRand/sec.  I changed the
       code to compute the sum of the m random numbers per thread and only store
       the sum in the global memory -- that doesn't satisfy the assignment, but
       it ran *way* faster (nearly 17 GRand/sec).  This suggests that global
       memory bandwidth was the problem.
         I restored my code to store each random uint in a separate location.
       I stared at the code, trying to think of a way to unroll the loop, or
       change the memory access pattern, but couldn't come up with anything.
       Then, I tried copying the random number generator state into shared
       memory.  This removes the global memory accesses for the state.  The
       throughput increased to 8.7 GRand/sec.
         From there, I tried adjusting nblks, tpb, and m.  This time, I found
       that it helped to have a large number of threads-per-block, a large
       value for m, and a fairly small number of blocks.  I got the highest
       througput with nblks=12, tpb=768, and m=20000.