diff --git a/README.md b/README.md
index cad1abd..7b013eb 100644
--- a/README.md
+++ b/README.md
@@ -1,20 +1,92 @@
 CUDA Rasterizer
 ===============
 
-[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md)
-
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Jonathan Lee
+* Tested on: Tested on: Windows 7, i7-7700 @ 4.2GHz 16GB, GTX 1070 (Personal Machine)
+
+# Overview
+
+In this project, I was able to implement a basic graphics pipeline in CUDA including the *Vertex Shading*, *Primitive Assembly*, *Rasterization*, and *Fragment Shading* stages.
+
+![](renders/milktruck.PNG)
+
+## Features
+* UV Texture Mapping with Bilinear Filtering and Perspective Correct Coordinates
+* Line and Point Rasterization
+* Naive Backface Culling 
+
+## Flags
+* `DRAWLINES`
+* `DRAWPOINTS`
+* `TEXTURE`
+* `BILINEAR`
+* `PERPSECTIVE_CORRECT`
+* `MUTEX`
+
+# Results
+
+## Rasterization Types
+### Duck
+Textured  |  Triangles    |   Lines     | Points
+:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:
+![](renders/duck_tri.PNG) | ![](renders/duck_diff.PNG) | ![](renders/duck_line.PNG) | ![](renders/duck_point.PNG) | 
+
+### Cesium Milk Truck
+Textured  |  Triangles    |   Lines     | Points
+:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:
+![](renders/milktruck_diff_texture.png) | ![](renders/milktruck_diff.png) | ![](renders/milktruck_line.png) | ![](renders/milktruck_point.png) | 
+
+## Texturing
+### Perspective Correctness
+
+Perspective correctness accounts for the vertex positions in 3D space rather than interpolating in screen space. This is where most of the overhead comes from in the rasterization step.
+
+Without Perspective Correct Coordinates  | With Perspective Correct Coordinates
+:-------------------------:|:-------------------------:
+![](renders/noperspective2.png) | ![](renders/perspective2.png) 
+
 
-### (TODO: Your README)
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+### Flickering
 
+Having a lock for dealing with each specific pixel is fairly important. This prevents the flickering seen in the left gif. This is caused when two threads are working on the same pixel. 
+
+Without Locking  | With Locking
+:-------------------------:|:-------------------------:
+![](renders/(no)mutex.gif) | ![](renders/mutex.gif) 
+
+# Analysis
+
+Below is the data from rendering the Cesium Milk Truck model. Overall, the rasterization step takes the most time to process. This makes sense since all of the main computation occurs at this step. 
+
+#### Cesium Milk Truck
+
+Having the mutex lock causes a slight overhead across the texture methods which is well worth the overhead. 
+
+![](images/milktruck_pipeline.png)
+
+![](images/pipeline.PNG)
+
+![](images/rasterization_types.png)
+
+![](images/rastertype.PNG)
+
+![](images/line_point.png)
+
+Clearly, there is a huge dropoff between rasterizing triangles and lines and points. There are no calculations needed to rasterize points. To rasterize lines, `_rasterizeLine()` gets called three times to account for the three lines that make up the triangle: P1-P2, P2-P3, P3-P1. Using the slope, we can find the position along the line as we move from point to point.
 
 ### Credits
 
 * [tinygltfloader](https://github.com/syoyo/tinygltfloader) by [@soyoyo](https://github.com/syoyo)
 * [glTF Sample Models](https://github.com/KhronosGroup/glTF/blob/master/sampleModels/README.md)
+
+### References
+* https://www.scratchapixel.com/lessons/3d-basic-rendering/rasterization-practical-implementation/perspective-correct-interpolation-vertex-attributes
+*  https://www.opengl.org/discussion_boards/showthread.php/170651-Is-it-possible-to-get-the-pixel-color
+* https://stackoverflow.com/questions/35005603/get-color-of-the-texture-at-uv-coordinate
+* https://en.wikipedia.org/wiki/Bilinear_filtering
+* http://groups.csail.mit.edu/graphics/classes/6.837/F02/lectures/6.837-7_Line.pdf
+* https://en.wikipedia.org/wiki/Back-face_culling
+
diff --git a/images/chart.png b/images/chart.png
new file mode 100644
index 0000000..3101d06
Binary files /dev/null and b/images/chart.png differ
diff --git a/images/line_point.png b/images/line_point.png
new file mode 100644
index 0000000..5a83644
Binary files /dev/null and b/images/line_point.png differ
diff --git a/images/milktruck_pipeline.png b/images/milktruck_pipeline.png
new file mode 100644
index 0000000..6332482
Binary files /dev/null and b/images/milktruck_pipeline.png differ
diff --git a/images/pipeline.PNG b/images/pipeline.PNG
new file mode 100644
index 0000000..a9b32d7
Binary files /dev/null and b/images/pipeline.PNG differ
diff --git a/images/rasterization_types.png b/images/rasterization_types.png
new file mode 100644
index 0000000..872ea77
Binary files /dev/null and b/images/rasterization_types.png differ
diff --git a/images/rastertype.PNG b/images/rastertype.PNG
new file mode 100644
index 0000000..a919aa7
Binary files /dev/null and b/images/rastertype.PNG differ
diff --git a/renders/(no)mutex.gif b/renders/(no)mutex.gif
new file mode 100644
index 0000000..fd0cb32
Binary files /dev/null and b/renders/(no)mutex.gif differ
diff --git a/renders/cow.PNG b/renders/cow.PNG
new file mode 100644
index 0000000..431f1f3
Binary files /dev/null and b/renders/cow.PNG differ
diff --git a/renders/duck.PNG b/renders/duck.PNG
new file mode 100644
index 0000000..0b63559
Binary files /dev/null and b/renders/duck.PNG differ
diff --git a/renders/duck_diff.PNG b/renders/duck_diff.PNG
new file mode 100644
index 0000000..d91fbc4
Binary files /dev/null and b/renders/duck_diff.PNG differ
diff --git a/renders/duck_line.PNG b/renders/duck_line.PNG
new file mode 100644
index 0000000..ce14335
Binary files /dev/null and b/renders/duck_line.PNG differ
diff --git a/renders/duck_point.PNG b/renders/duck_point.PNG
new file mode 100644
index 0000000..43b8900
Binary files /dev/null and b/renders/duck_point.PNG differ
diff --git a/renders/duck_tri.PNG b/renders/duck_tri.PNG
new file mode 100644
index 0000000..b5f3be8
Binary files /dev/null and b/renders/duck_tri.PNG differ
diff --git a/renders/fail1.PNG b/renders/fail1.PNG
new file mode 100644
index 0000000..097bb5d
Binary files /dev/null and b/renders/fail1.PNG differ
diff --git a/renders/frontfaceculling.PNG b/renders/frontfaceculling.PNG
new file mode 100644
index 0000000..5af574e
Binary files /dev/null and b/renders/frontfaceculling.PNG differ
diff --git a/renders/lines.PNG b/renders/lines.PNG
new file mode 100644
index 0000000..6ea2130
Binary files /dev/null and b/renders/lines.PNG differ
diff --git a/renders/milk.PNG b/renders/milk.PNG
new file mode 100644
index 0000000..bc81cef
Binary files /dev/null and b/renders/milk.PNG differ
diff --git a/renders/milk2.PNG b/renders/milk2.PNG
new file mode 100644
index 0000000..88dcdb3
Binary files /dev/null and b/renders/milk2.PNG differ
diff --git a/renders/milktruck.PNG b/renders/milktruck.PNG
new file mode 100644
index 0000000..e4bb99f
Binary files /dev/null and b/renders/milktruck.PNG differ
diff --git a/renders/milktruck_diff.png b/renders/milktruck_diff.png
new file mode 100644
index 0000000..e105eed
Binary files /dev/null and b/renders/milktruck_diff.png differ
diff --git a/renders/milktruck_diff_texture.png b/renders/milktruck_diff_texture.png
new file mode 100644
index 0000000..bca6682
Binary files /dev/null and b/renders/milktruck_diff_texture.png differ
diff --git a/renders/milktruck_diffuse.PNG b/renders/milktruck_diffuse.PNG
new file mode 100644
index 0000000..35f54d2
Binary files /dev/null and b/renders/milktruck_diffuse.PNG differ
diff --git a/renders/milktruck_line.png b/renders/milktruck_line.png
new file mode 100644
index 0000000..931d3f1
Binary files /dev/null and b/renders/milktruck_line.png differ
diff --git a/renders/milktruck_lines.PNG b/renders/milktruck_lines.PNG
new file mode 100644
index 0000000..46e6c18
Binary files /dev/null and b/renders/milktruck_lines.PNG differ
diff --git a/renders/milktruck_point.png b/renders/milktruck_point.png
new file mode 100644
index 0000000..a08dd06
Binary files /dev/null and b/renders/milktruck_point.png differ
diff --git a/renders/milktruck_points.PNG b/renders/milktruck_points.PNG
new file mode 100644
index 0000000..c1d75d5
Binary files /dev/null and b/renders/milktruck_points.PNG differ
diff --git a/renders/milktruck_texture.PNG b/renders/milktruck_texture.PNG
new file mode 100644
index 0000000..3b6acbd
Binary files /dev/null and b/renders/milktruck_texture.PNG differ
diff --git a/renders/mutex.gif b/renders/mutex.gif
new file mode 100644
index 0000000..8585e49
Binary files /dev/null and b/renders/mutex.gif differ
diff --git a/renders/nobilinear.gif b/renders/nobilinear.gif
new file mode 100644
index 0000000..3f25707
Binary files /dev/null and b/renders/nobilinear.gif differ
diff --git a/renders/noperspective.PNG b/renders/noperspective.PNG
new file mode 100644
index 0000000..a87fbde
Binary files /dev/null and b/renders/noperspective.PNG differ
diff --git a/renders/noperspective2.png b/renders/noperspective2.png
new file mode 100644
index 0000000..eb45bb2
Binary files /dev/null and b/renders/noperspective2.png differ
diff --git a/renders/perspective.PNG b/renders/perspective.PNG
new file mode 100644
index 0000000..e601a96
Binary files /dev/null and b/renders/perspective.PNG differ
diff --git a/renders/perspective2.png b/renders/perspective2.png
new file mode 100644
index 0000000..6f73131
Binary files /dev/null and b/renders/perspective2.png differ
diff --git a/renders/uvindex.PNG b/renders/uvindex.PNG
new file mode 100644
index 0000000..6475fec
Binary files /dev/null and b/renders/uvindex.PNG differ
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..be649d5 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -17,6 +17,19 @@
 #include "rasterize.h"
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
+#include <thrust/execution_policy.h>
+#include <thrust/random.h>
+#include <thrust/remove.h>
+#include <thrust/partition.h>
+#include "../stream_compaction/common.h"
+
+#define BACKFACE_CULLING 0
+#define DRAWLINES 0
+#define DRAWPOINTS 0
+#define TEXTURE 1
+#define BILINEAR 1
+#define PERPSECTIVE_CORRECT 1
+#define MUTEX 1
 
 namespace {
 
@@ -28,12 +41,13 @@ namespace {
 
 	typedef unsigned char BufferByte;
 
-	enum PrimitiveType{
-		Point = 1,
-		Line = 2,
-		Triangle = 3
+	enum PrimitiveType {
+		Point		= 1,
+		Line		= 2,
+		Triangle	= 3
 	};
 
+	// Assembled and transformed vertices
 	struct VertexOut {
 		glm::vec4 pos;
 
@@ -43,10 +57,10 @@ namespace {
 
 		 glm::vec3 eyePos;	// eye space position used for shading
 		 glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
-		// glm::vec3 col;
+		 glm::vec3 col;
 		 glm::vec2 texcoord0;
 		 TextureData* dev_diffuseTex = NULL;
-		// int texWidth, texHeight;
+		 int texWidth, texHeight;
 		// ...
 	};
 
@@ -62,13 +76,17 @@ namespace {
 		// The attributes listed below might be useful, 
 		// but always feel free to modify on your own
 
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
-		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
+		 glm::vec3 eyePos;	// eye space position used for shading
+		 glm::vec3 eyeNor;
+		 VertexAttributeTexcoord texcoord0;
+		 TextureData* dev_diffuseTex;
+		 int diffuseTexWidth;
+		 int diffuseTexHeight;
 		// ...
 	};
 
+	// Buffers from glTF
+	// Passed into _vertexTransformAndAssembly 
 	struct PrimitiveDevBufPointers {
 		int primitiveMode;	//from tinygltfloader macro
 		PrimitiveType primitiveType;
@@ -110,6 +128,7 @@ static Fragment *dev_fragmentBuffer = NULL;
 static glm::vec3 *dev_framebuffer = NULL;
 
 static int * dev_depth = NULL;	// you might need this buffer when doing depth test
+static int * dev_mutex = NULL;
 
 /**
  * Kernel that writes the image to the OpenGL PBO directly.
@@ -143,10 +162,83 @@ void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
     int index = x + (y * w);
 
     if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
+		glm::vec3 lightPos(10.f, 20.f, 30.f);
+		glm::vec3 intensity(1.5f);
 
-		// TODO: add your fragment shader code here
+		if (DRAWPOINTS || DRAWLINES) {
+			framebuffer[index] = fragmentBuffer[index].color;
+			return;
+		}
+
+		// Use Lambert
+		if (fragmentBuffer[index].dev_diffuseTex == NULL || !TEXTURE) {
+			glm::vec3 lightWi = glm::normalize(fragmentBuffer[index].eyePos - lightPos);
+
+			float absDot = glm::abs(glm::dot(fragmentBuffer[index].eyeNor, lightWi));
+			glm::vec3 col = fragmentBuffer[index].color * intensity * absDot;
+
+			framebuffer[index] = glm::clamp(col, 0.f, 1.f);
+		}
+		// Use texture
+		else if(TEXTURE) {
+			glm::vec3 color(0.f);
+
+			if (BILINEAR) {
+				// Bilinear
+				TextureData* tex = fragmentBuffer[index].dev_diffuseTex;
+				int width = fragmentBuffer[index].diffuseTexWidth;
+				int height = fragmentBuffer[index].diffuseTexHeight;
+
+				float u = fragmentBuffer[index].texcoord0.x * width;
+				float v = fragmentBuffer[index].texcoord0.y * height;
+
+				int x = glm::floor(u);
+				int y = glm::floor(v);
+
+				float uRatio = u - x;
+				float vRatio = v - y;
 
+				float uOpposite = 1.f - uRatio;
+				float vOpposite = 1.f - vRatio;
+
+				int uvIndex1 = 3 * (x + y * width);
+				int uvIndex2 = 3 * (x + 1 + y * width);
+				int uvIndex3 = 3 * (x + (y + 1) * width);
+				int uvIndex4 = 3 * (x + 1 + (y + 1) * width);
+
+				// https://en.wikipedia.org/wiki/Bilinear_filtering
+				float r = (tex[uvIndex1] * uOpposite + tex[uvIndex2] * uRatio) * vOpposite +
+					(tex[uvIndex3] * uOpposite + tex[uvIndex4] * uRatio) * vRatio;
+				float g = (tex[uvIndex1 + 1] * uOpposite + tex[uvIndex2 + 1] * uRatio) * vOpposite +
+					(tex[uvIndex3 + 1] * uOpposite + tex[uvIndex4 + 1] * uRatio) * vRatio;
+				float b = (tex[uvIndex1 + 2] * uOpposite + tex[uvIndex2 + 2] * uRatio) * vOpposite +
+					(tex[uvIndex3 + 2] * uOpposite + tex[uvIndex4 + 2] * uRatio) * vRatio;
+
+				color = glm::vec3(r, g, b);
+				color /= 255.f;
+
+				framebuffer[index] = color;
+			}
+			else {
+				// Not bilinear
+				int u = fragmentBuffer[index].texcoord0.x * fragmentBuffer[index].diffuseTexWidth;
+				int v = fragmentBuffer[index].texcoord0.y * fragmentBuffer[index].diffuseTexHeight;
+
+				// https://stackoverflow.com/questions/35005603/get-color-of-the-texture-at-uv-coordinate
+				int uvIndex = 3 * (u + (v * fragmentBuffer[index].diffuseTexWidth));
+
+				// https://www.opengl.org/discussion_boards/showthread.php/170651-Is-it-possible-to-get-the-pixel-color
+				float r = fragmentBuffer[index].dev_diffuseTex[uvIndex];
+				float g = fragmentBuffer[index].dev_diffuseTex[uvIndex + 1];
+				float b = fragmentBuffer[index].dev_diffuseTex[uvIndex + 2];
+
+				color = glm::vec3(r, g, b);
+				color /= 255.f;
+
+				framebuffer[index] = color;
+			}
+		}
+		// TODO: add your fragment shader code here
     }
 }
 
@@ -166,6 +258,10 @@ void rasterizeInit(int w, int h) {
 	cudaFree(dev_depth);
 	cudaMalloc(&dev_depth, width * height * sizeof(int));
 
+	cudaFree(dev_mutex);
+	cudaMalloc(&dev_mutex, width * height * sizeof(int));
+	cudaMemset(&dev_mutex, 0, width * height * sizeof(int));
+
 	checkCUDAError("rasterizeInit");
 }
 
@@ -617,18 +713,15 @@ void rasterizeSetBuffers(const tinygltf::Scene & scene) {
 
 		checkCUDAError("Free BufferView Device Mem");
 	}
-
-
 }
 
-
-
 __global__ 
 void _vertexTransformAndAssembly(
 	int numVertices, 
 	PrimitiveDevBufPointers primitive, 
 	glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal, 
-	int width, int height) {
+	int width, int height) 
+{
 
 	// vertex id
 	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
@@ -639,14 +732,46 @@ void _vertexTransformAndAssembly(
 		// Then divide the pos by its w element to transform into NDC space
 		// Finally transform x and y to viewport space
 
+		// World coordinates
+		glm::vec4 pWorld(primitive.dev_position[vid], 1.f);
+		glm::vec3 norWorld(primitive.dev_normal[vid]);
+
+		// Take point to homogenous coordinates/clip space
+		glm::vec4 pClip = MVP * pWorld;
+		// Take point to NDC
+		pClip /= pClip.w;
+
+		// Take point to screen space
+		glm::vec4 pScreen(pClip);
+		pScreen.x = ((pClip.x + 1.f) / 2.f) * (float)width;
+		pScreen.y = ((1.f - pClip.y) / 2.f) * (float)height;
+		pScreen.z = (-(pClip.z + 1.f) / 2.f);
+
+		// Get the point's position in camera space
+		glm::vec3 pEye(MV * pWorld);
+		glm::vec3 norEye(glm::normalize(MV_normal * norWorld));
+
 		// TODO: Apply vertex assembly here
 		// Assemble all attribute arraies into the primitive array
-		
+
+		primitive.dev_verticesOut[vid].pos = pScreen;
+		primitive.dev_verticesOut[vid].eyePos = pEye;
+		primitive.dev_verticesOut[vid].eyeNor = norEye;
+
+		// Texture stuff
+		if (primitive.dev_diffuseTex == NULL) {
+			primitive.dev_verticesOut[vid].dev_diffuseTex = NULL;
+		}
+		else {
+			primitive.dev_verticesOut[vid].texcoord0 = primitive.dev_texcoord0[vid];
+			primitive.dev_verticesOut[vid].dev_diffuseTex = primitive.dev_diffuseTex;
+			primitive.dev_verticesOut[vid].texWidth = primitive.diffuseTexWidth;
+			primitive.dev_verticesOut[vid].texHeight = primitive.diffuseTexHeight;
+		}
 	}
 }
 
 
-
 static int curPrimitiveBeginId = 0;
 
 __global__ 
@@ -660,17 +785,194 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 		// TODO: uncomment the following code for a start
 		// This is primitive assembly for triangles
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		int pid;	// id for cur primitives vector
+		if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
+			pid = iid / (int)primitive.primitiveType;
+			dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+				= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		} 
+		else if (primitive.primitiveMode == TINYGLTF_MODE_POINTS) {
+			
+		}
+		else if (primitive.primitiveMode == TINYGLTF_MODE_LINE) {
+			
+		}
+
+		// TODO: other primitive types (point, line)
+	}
+}
+
+// http://groups.csail.mit.edu/graphics/classes/6.837/F02/lectures/6.837-7_Line.pdf
+__device__
+void _rasterizeLine(int width, int height,
+	Fragment *dev_fragmentBuffer,
+	glm::vec3 &p1, glm::vec3 &p2)
+{
+	int x1 = p1.x;
+	int x2 = p2.x;
+	int y1 = p1.y;
+	int y2 = p2.y;
+
+	int dx = x2 - x1;
+	int dy = y2 - y1;
+	float m = dy / dx;
+
+	for (int x = x1; x <= x2; x++) {
+		int y = y1 + (int)m * (x - x1);
+
+		int index = x + (y * width);
+		if (x <= width - 1 && x >= 0 && y <= height - 1 && height >= 0)
+			dev_fragmentBuffer[index].color = glm::vec3(1.f, 0.f, 0.f);
+	}
+}
+
+__device__
+void _rasterizePoints(int width, int height, glm::vec3 *triPos, Fragment *dev_fragmentBuffer)
+{
+	// Since we only need to draw the points
+	// we can just set a color for each of the 
+	// points on the triangle. 
+	// For whatever reason, I couldn't set the color during render()
+	for (int i = 0; i < 3; i++) {
+		int x = triPos[i].x;
+		int y = triPos[i].y;
+
+		int index = x + (y * width);
+		dev_fragmentBuffer[index].color = glm::vec3(0.f, 1.f, 1.f);
+	}
+}
+
+struct cullPrim
+{
+	__host__ __device__
+	bool operator()(const Primitive &prim)
+	{
+		glm::vec3 triEyePos[3];
+		triEyePos[0] = prim.v[0].eyePos;
+		triEyePos[1] = prim.v[1].eyePos;
+		triEyePos[2] = prim.v[2].eyePos;
+
+		glm::vec3 triEyeNor[3];
+		triEyeNor[0] = prim.v[0].eyeNor;
+		triEyeNor[1] = prim.v[1].eyeNor;
+		triEyeNor[2] = prim.v[2].eyeNor;
+
+		return (glm::dot(triEyePos[0], triEyeNor[0]) >= 0.f);
+	}
+};
+
+__global__ 
+void _rasterize(int numPrims, int width, int height,
+				Primitive *dev_primitives, 
+				Fragment *dev_fragmentBuffer,
+				int *dev_depth, int *dev_mutex)
+{
+	// Primitive ID
+	int pid = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (pid < numPrims) {
+		Primitive prim = dev_primitives[pid];
+		
+		glm::vec3 triPos[3];
+		triPos[0] = glm::vec3(prim.v[0].pos);
+		triPos[1] = glm::vec3(prim.v[1].pos);
+		triPos[2] = glm::vec3(prim.v[2].pos);
+
+		glm::vec3 triEyePos[3];
+		triEyePos[0] = prim.v[0].eyePos;
+		triEyePos[1] = prim.v[1].eyePos;
+		triEyePos[2] = prim.v[2].eyePos;
+
+		glm::vec3 triEyeNor[3];
+		triEyeNor[0] = prim.v[0].eyeNor;
+		triEyeNor[1] = prim.v[1].eyeNor;
+		triEyeNor[2] = prim.v[2].eyeNor;
+
+		//int culledPrims = numPrims;
+		//if (BACKFACE_CULLING) {
+		//	Primitive *unculledPrims = thrust::partition(thrust::device, dev_primitives, dev_primitives + culledPrims, cullPrim());
+		//	culledPrims = unculledPrims - dev_primitives;
 		//}
 
+		// https://en.wikipedia.org/wiki/Back-face_culling
+		if (glm::dot(triEyePos[0], triEyeNor[0]) >= 0.f && BACKFACE_CULLING) {
+			return;
+		}
 
-		// TODO: other primitive types (point, line)
+		// Get bounding box for the primitive
+		AABB triBB = getAABBForTriangle(triPos);
+
+#if DRAWPOINTS
+		// Pass in triangle and draw each point
+		_rasterizePoints(width, height, triPos, dev_fragmentBuffer);
+#elif DRAWLINES
+		// Pass in 2 points of a triangle and draw the line connecting them
+		_rasterizeLine(width, height, dev_fragmentBuffer, triPos[0], triPos[1]);
+		_rasterizeLine(width, height, dev_fragmentBuffer, triPos[1], triPos[2]);
+		_rasterizeLine(width, height, dev_fragmentBuffer, triPos[2], triPos[0]);
+// Triangles
+#else 
+		for (int i = triBB.min.x; i < triBB.max.x; i++) {
+			for (int j = triBB.min.y; j < triBB.max.y; j++) {
+				// Get the barycentric coordinate 
+				glm::vec2 pixel(i, j);
+				glm::vec3 p = calculateBarycentricCoordinate(triPos, pixel);
+
+				// Check to see if the point is in the triangle
+				if (isBarycentricCoordInBounds(p)) {
+					int index = i + (j * width);
+					
+					bool isLocked = false;
+#if MUTEX
+					do
+					{
+						isLocked = (atomicCAS(&dev_mutex[index], 0, 1) == 0);
+#endif		// end mutex
+						// Get the z coordinate and scale by some factor
+						int depth = getZAtCoordinate(p, triPos) * INT_MAX;
+
+						// Store the resulting minimum depth at dev_depth[index]
+						atomicMin(&dev_depth[index], depth);
+
+						// Need to check if the the current depth is the closest one
+						if (dev_depth[index] == depth) {
+							// Color point if in bounds
+							dev_fragmentBuffer[index].color = glm::vec3(1.f);
+							dev_fragmentBuffer[index].eyePos = triEyePos[0] * p.x + triEyePos[1] * p.y + triEyePos[2] * p.z;
+							dev_fragmentBuffer[index].eyeNor = triEyeNor[0] * p.x + triEyeNor[1] * p.y + triEyeNor[2] * p.z;
+
+							// Texture stuff
+							dev_fragmentBuffer[index].dev_diffuseTex = prim.v[0].dev_diffuseTex;
+							dev_fragmentBuffer[index].diffuseTexWidth = prim.v[0].texWidth;
+							dev_fragmentBuffer[index].diffuseTexHeight = prim.v[0].texHeight;
+							dev_fragmentBuffer[index].texcoord0 = p.x * prim.v[0].texcoord0 + p.y * prim.v[1].texcoord0 + p.z * prim.v[2].texcoord0;
+
+							if (PERPSECTIVE_CORRECT) {
+								// Perspective Correct Texture Coordinate
+								// Divide all attributes (in this case barycentric) by eye space z
+								// https://www.scratchapixel.com/lessons/3d-basic-rendering/rasterization-practical-implementation/perspective-correct-interpolation-vertex-attributes
+								glm::vec3 w(p.x / triEyePos[0].z, p.y / triEyePos[1].z, p.z / triEyePos[2].z);
+
+								float s = w.x * prim.v[0].texcoord0.x + w.y * prim.v[1].texcoord0.x + w.z * prim.v[2].texcoord0.x;
+								float t = w.x * prim.v[0].texcoord0.y + w.y * prim.v[1].texcoord0.y + w.z * prim.v[2].texcoord0.y;
+
+								float z = 1.f / (w.x + w.y + w.z);
+								s *= z;
+								t *= z;
+
+								dev_fragmentBuffer[index].texcoord0 = glm::vec2(s, t);
+							}
+						}
+#if MUTEX
+						if (isLocked && MUTEX) {
+							dev_mutex[index] = 0;
+						}
+					} while (!isLocked);
+#endif		// end mutex
+				}
+			}
+		}
+#endif
 	}
-	
 }
 
 
@@ -679,6 +981,11 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
  * Perform rasterization.
  */
 void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) {
+	StreamCompaction::Common::PerformanceTimer vertexTransformTimer;
+	StreamCompaction::Common::PerformanceTimer primitiveAssemblyTimer;
+	StreamCompaction::Common::PerformanceTimer rasterizationTimer;
+	StreamCompaction::Common::PerformanceTimer renderTimer;
+
     int sideLength2d = 8;
     dim3 blockSize2d(sideLength2d, sideLength2d);
     dim3 blockCount2d((width  - 1) / blockSize2d.x + 1,
@@ -695,6 +1002,9 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 		auto it = mesh2PrimitivesMap.begin();
 		auto itEnd = mesh2PrimitivesMap.end();
 
+		float vertTotal = 0.f;
+		float assembleTotal = 0.f;
+
 		for (; it != itEnd; ++it) {
 			auto p = (it->second).begin();	// each primitive
 			auto pEnd = (it->second).end();
@@ -702,19 +1012,29 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 				dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 				dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 
+				vertexTransformTimer.startGpuTimer();
 				_vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height);
+				vertexTransformTimer.endGpuTimer();
+				vertTotal += vertexTransformTimer.getGpuElapsedTimeForPreviousOperation();
 				checkCUDAError("Vertex Processing");
 				cudaDeviceSynchronize();
+				
+				primitiveAssemblyTimer.startGpuTimer();
 				_primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> >
 					(p->numIndices, 
 					curPrimitiveBeginId, 
 					dev_primitives, 
 					*p);
+				primitiveAssemblyTimer.endGpuTimer();
+				assembleTotal += primitiveAssemblyTimer.getGpuElapsedTimeForPreviousOperation();
 				checkCUDAError("Primitive Assembly");
 
 				curPrimitiveBeginId += p->numPrimitives;
 			}
 		}
+		
+		printf("Vertex Transformation: %f\n", vertexTransformTimer.getGpuElapsedTimeForPreviousOperation());
+		printf("Primitive Assembly: %f\n", primitiveAssemblyTimer.getGpuElapsedTimeForPreviousOperation());
 
 		checkCUDAError("Vertex Processing and Primitive Assembly");
 	}
@@ -723,11 +1043,26 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
 	
 	// TODO: rasterize
+	
+	{
+		dim3 numThreadsPerBlock(128);
+		dim3 numBlocksForPrims((totalNumPrimitives + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 
-
+		rasterizationTimer.startGpuTimer();
+		_rasterize << < numBlocksForPrims, numThreadsPerBlock >> > (totalNumPrimitives, width, height, dev_primitives, dev_fragmentBuffer, dev_depth, dev_mutex);
+		rasterizationTimer.endGpuTimer();
+		printf("Rasterization: %f\n", rasterizationTimer.getGpuElapsedTimeForPreviousOperation());
+		
+		checkCUDAError("rasterization");
+	}
+	
 
     // Copy depthbuffer colors into framebuffer
+	renderTimer.startGpuTimer();
 	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
+	renderTimer.endGpuTimer();
+	printf("Render: %f\n", renderTimer.getGpuElapsedTimeForPreviousOperation());
+
 	checkCUDAError("fragment shader");
     // Copy framebuffer into OpenGL buffer for OpenGL previewing
     sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
@@ -772,5 +1107,8 @@ void rasterizeFree() {
 	cudaFree(dev_depth);
 	dev_depth = NULL;
 
+	cudaFree(dev_mutex);
+	dev_mutex = NULL;
+
     checkCUDAError("rasterize Free");
 }
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
new file mode 100644
index 0000000..ac358c9
--- /dev/null
+++ b/stream_compaction/CMakeLists.txt
@@ -0,0 +1,7 @@
+set(SOURCE_FILES
+    )
+
+cuda_add_library(stream_compaction
+    ${SOURCE_FILES}
+    OPTIONS -arch=sm_20
+    )
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
new file mode 100644
index 0000000..7652631
--- /dev/null
+++ b/stream_compaction/common.cu
@@ -0,0 +1,7 @@
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace Common {
+
+    }
+}
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
new file mode 100644
index 0000000..d8cfa93
--- /dev/null
+++ b/stream_compaction/common.h
@@ -0,0 +1,107 @@
+#pragma once
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+
+#include <cstdio>
+#include <cstring>
+#include <cmath>
+#include <algorithm>
+#include <chrono>
+#include <stdexcept>
+
+namespace StreamCompaction {
+    namespace Common {
+	    /**
+	    * This class is used for timing the performance
+	    * Uncopyable and unmovable
+        *
+        * Adapted from WindyDarian(https://github.com/WindyDarian)
+	    */
+	    class PerformanceTimer
+	    {
+	    public:
+		    PerformanceTimer()
+		    {
+			    cudaEventCreate(&event_start);
+			    cudaEventCreate(&event_end);
+		    }
+
+		    ~PerformanceTimer()
+		    {
+			    cudaEventDestroy(event_start);
+			    cudaEventDestroy(event_end);
+		    }
+
+		    void startCpuTimer()
+		    {
+			    if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
+			    cpu_timer_started = true;
+
+			    time_start_cpu = std::chrono::high_resolution_clock::now();
+		    }
+
+		    void endCpuTimer()
+		    {
+			    time_end_cpu = std::chrono::high_resolution_clock::now();
+
+			    if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
+
+			    std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
+			    prev_elapsed_time_cpu_milliseconds =
+				    static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());
+
+			    cpu_timer_started = false;
+		    }
+
+		    void startGpuTimer()
+		    {
+			    if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
+			    gpu_timer_started = true;
+
+			    cudaEventRecord(event_start);
+		    }
+
+		    void endGpuTimer()
+		    {
+			    cudaEventRecord(event_end);
+			    cudaEventSynchronize(event_end);
+
+			    if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
+
+			    cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
+			    gpu_timer_started = false;
+		    }
+
+		    float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
+		    {
+			    return prev_elapsed_time_cpu_milliseconds;
+		    }
+
+		    float getGpuElapsedTimeForPreviousOperation() //noexcept
+		    {
+			    return prev_elapsed_time_gpu_milliseconds;
+		    }
+
+		    // remove copy and move functions
+		    PerformanceTimer(const PerformanceTimer&) = delete;
+		    PerformanceTimer(PerformanceTimer&&) = delete;
+		    PerformanceTimer& operator=(const PerformanceTimer&) = delete;
+		    PerformanceTimer& operator=(PerformanceTimer&&) = delete;
+
+	    private:
+		    cudaEvent_t event_start = nullptr;
+		    cudaEvent_t event_end = nullptr;
+
+		    using time_point_t = std::chrono::high_resolution_clock::time_point;
+		    time_point_t time_start_cpu;
+		    time_point_t time_end_cpu;
+
+		    bool cpu_timer_started = false;
+		    bool gpu_timer_started = false;
+
+		    float prev_elapsed_time_cpu_milliseconds = 0.f;
+		    float prev_elapsed_time_gpu_milliseconds = 0.f;
+	    };
+    }
+}
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
new file mode 100644
index 0000000..06cef1a
--- /dev/null
+++ b/stream_compaction/cpu.cu
@@ -0,0 +1,103 @@
+#include <cstdio>
+#include "cpu.h"
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace CPU {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+	        static PerformanceTimer timer;
+	        return timer;
+        }
+
+        /**
+         * CPU scan (prefix sum).
+         * For performance analysis, this is supposed to be a simple for loop.
+         * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
+         */
+        void scan(int n, int *odata, const int *idata) {
+			timer().startCpuTimer();
+            // TODO
+
+			// Exclusive scan
+			odata[0] = 0;
+			for (int i = 1; i < n; i++) {
+				odata[i] = odata[i - 1] + idata[i - 1];
+			}
+
+			timer().endCpuTimer();
+        }
+
+        /**
+         * CPU stream compaction without using the scan function.
+         *
+         * @returns the number of elements remaining after compaction.
+         */
+        int compactWithoutScan(int n, int *odata, const int *idata) {
+	        timer().startCpuTimer();
+            // TODO
+
+			int idx = 0;
+			for (int i = 0; i < n; i++) {
+				if (idata[i] != 0) {
+					odata[idx] = idata[i];
+					idx++;
+				}
+			}
+			
+	        timer().endCpuTimer();
+            return idx;
+        }
+
+		void exclusiveScan(int n, int *odata, const int *idata) {
+			// Exclusive scan
+			odata[0] = 0;
+			for (int i = 1; i < n; i++) {
+				odata[i] = odata[i - 1] + idata[i - 1];
+			}
+		}
+
+        /**
+         * CPU stream compaction using scan and scatter, like the parallel version.
+         *
+         * @returns the number of elements remaining after compaction.
+         */
+        int compactWithScan(int n, int *odata, const int *idata) {
+			int *temp = new int[n];
+			int *scanned = new int[n];
+
+			timer().startCpuTimer();
+	        // TODO
+
+			// Create a temporary array 
+			for (int i = 0; i < n; i++) {
+				if (idata[i] == 0) {
+					temp[i] = 0;
+				}
+				else {
+					temp[i] = 1;
+				}
+			}
+
+			// Exclusive Scan
+			exclusiveScan(n, scanned, temp);
+
+			// Scatter
+			int numElements = 0;
+			for (int i = 0; i < n; i++) {
+				if (temp[i] == 1) {
+					odata[scanned[i]] = idata[i];
+					numElements++;
+				}
+			}
+
+			timer().endCpuTimer();
+			delete[]temp;
+			delete[]scanned;
+
+			return numElements;
+        }
+    }
+}
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
new file mode 100644
index 0000000..236ce11
--- /dev/null
+++ b/stream_compaction/cpu.h
@@ -0,0 +1,15 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace CPU {
+	    StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+
+        int compactWithoutScan(int n, int *odata, const int *idata);
+
+        int compactWithScan(int n, int *odata, const int *idata);
+    }
+}
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
new file mode 100644
index 0000000..cbb03f4
--- /dev/null
+++ b/stream_compaction/efficient.cu
@@ -0,0 +1,218 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "common.h"
+#include "efficient.h"
+
+#define blocksize 128
+
+namespace StreamCompaction {
+    namespace Efficient {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+		__global__ void upSweep(int n, int pow2dPlus1, int pow2d, int *odata, bool reachedRoot)
+		{
+			int index = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (index >= n) {
+				return;
+			}
+
+			if (reachedRoot) {
+				odata[n - 1] = 0;
+			}
+			else {
+				index *= pow2dPlus1;
+				if (index < n)
+					odata[index + pow2dPlus1 - 1] += odata[index + pow2d - 1];
+			}
+		}
+
+		__global__ void downSweep(int n, int pow2dPlus1, int pow2d, int *odata)
+		{
+			int index = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (index >= n) {
+				return;
+			}
+
+			index *= pow2dPlus1;
+			if (index < n) {
+				int t = odata[index + pow2d - 1];
+				odata[index + pow2d - 1] = odata[index + pow2dPlus1 - 1];
+				odata[index + pow2dPlus1 - 1] += t;
+			}
+		}
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int *odata, const int *idata) {
+			dim3 fullBlocksPerGrid((n + blocksize - 1) / blocksize);
+			
+			// Get the next power of 2
+			int currPow = ilog2ceil(n) - 1;
+			int nextPow = 2 << currPow;
+
+			int *temp = new int[nextPow];
+			for (int i = 0; i < nextPow; i++) {
+				if (i < n) {
+					temp[i] = idata[i];
+				}
+				// Fill the rest of the array with 0 if not a power of 2.
+				else {
+					temp[i] = 0;
+				}
+			}
+
+			int *out;
+			cudaMalloc((void**)&out, nextPow * sizeof(int));
+			checkCUDAError("cudaMalloc out failed!");
+			cudaMemcpy(out, temp, sizeof(int) * nextPow, cudaMemcpyHostToDevice);
+
+			timer().startGpuTimer();
+            // TODO
+
+			// Up-Sweep
+			for (int d = 0; d <= ilog2ceil(nextPow) - 1; d++) {
+				int pow2dPlus1 = pow(2, d + 1);
+				int pow2d = pow(2, d);
+
+				// If we hit the end of the depth then we should be writing to the very last spot in the array.
+				bool reachedRoot = (d == ilog2ceil(nextPow) - 1);
+				upSweep << < fullBlocksPerGrid, blocksize >> > (nextPow, pow2dPlus1, pow2d, out, reachedRoot);
+			}
+
+			// Down-Sweep
+			for (int d = ilog2ceil(nextPow) - 1; d >= 0; d--) {
+				int pow2dPlus1 = pow(2, d + 1);
+				int pow2d = pow(2, d);
+			
+				downSweep << < fullBlocksPerGrid, blocksize >> > (nextPow, pow2dPlus1, pow2d, out);
+			}
+
+            timer().endGpuTimer();
+
+			// Copy final values into odata
+			cudaMemcpy(odata, out, sizeof(int) * nextPow, cudaMemcpyDeviceToHost);
+
+			delete[]temp;
+			cudaFree(out);
+        }
+
+		void exclusiveScan(int n, int *odata, const int *idata) {
+			dim3 fullBlocksPerGrid((n + blocksize - 1) / blocksize);
+
+			// Get the next power of 2
+			int currPow = ilog2ceil(n) - 1;
+			int nextPow = 2 << currPow;
+
+			int *temp = new int[nextPow];
+			for (int i = 0; i < nextPow; i++) {
+				if (i < n) {
+					temp[i] = idata[i];
+				}
+				// Fill the rest of the array with 0 if not a power of 2.
+				else {
+					temp[i] = 0;
+				}
+			}
+
+			int *out;
+			cudaMalloc((void**)&out, nextPow * sizeof(int));
+			checkCUDAError("cudaMalloc out failed!");
+			cudaMemcpy(out, temp, sizeof(int) * nextPow, cudaMemcpyHostToDevice);
+
+			// Up-Sweep
+			for (int d = 0; d <= ilog2ceil(nextPow) - 1; d++) {
+				//int pow2dPlus1 = pow(2, d + 1);
+				//int pow2d = pow(2, d);
+
+				int pow2dPlus1 = pow(2, d + 1);
+				int pow2d = pow(2, d);
+
+				// If we hit the end of the depth then we should be writing to the very last spot in the array.
+				bool reachedRoot = (d == ilog2ceil(nextPow) - 1);
+				upSweep << < fullBlocksPerGrid, blocksize >> > (nextPow, pow2dPlus1, pow2d, out, reachedRoot);
+			}
+
+			// Down-Sweep
+			for (int d = ilog2ceil(nextPow) - 1; d >= 0; d--) {
+				int pow2dPlus1 = pow(2, d + 1);
+				int pow2d = pow(2, d);
+
+				downSweep << < fullBlocksPerGrid, blocksize >> > (nextPow, pow2dPlus1, pow2d, out);
+			}
+
+			// Copy final values into odata
+			cudaMemcpy(odata, out, sizeof(int) * nextPow, cudaMemcpyDeviceToHost);
+
+			delete[]temp;
+			cudaFree(out);
+		}
+
+
+        /**
+         * Performs stream compaction on idata, storing the result into odata.
+         * All zeroes are discarded.
+         *
+         * @param n      The number of elements in idata.
+         * @param odata  The array into which to store elements.
+         * @param idata  The array of elements to compact.
+         * @returns      The number of elements remaining after compaction.
+         */
+		int compact(int n, int *odata, const int *idata) {
+			dim3 fullBlocksPerGrid((n + blocksize - 1) / blocksize);
+
+			// Get the next power of 2
+			int currPow = ilog2ceil(n) - 1;
+			int nextPow = 2 << currPow;
+
+			int *dev_in;
+			cudaMalloc((void**)&dev_in, sizeof(int) * nextPow);
+			cudaMemcpy(dev_in, idata, sizeof(int) * nextPow, cudaMemcpyHostToDevice);
+
+			int *dev_out;
+			cudaMalloc((void**)&dev_out, sizeof(int) * nextPow);
+
+			int *dev_indices;
+			cudaMalloc((void**)&dev_indices, sizeof(int) * nextPow);
+
+			int *bools = new int[nextPow];
+			int *dev_bools;
+			cudaMalloc((void**)&dev_bools, sizeof(int) * nextPow);
+
+			timer().startGpuTimer();
+			// TODO
+
+			StreamCompaction::Common::kernMapToBoolean << < fullBlocksPerGrid, blocksize >> > (nextPow, dev_bools, dev_in);
+			cudaMemcpy(bools, dev_bools, sizeof(int) * nextPow, cudaMemcpyDeviceToHost);
+
+			exclusiveScan(nextPow, dev_indices, bools);
+
+			StreamCompaction::Common::kernScatter << < fullBlocksPerGrid, blocksize >> > (nextPow, dev_out, dev_in, dev_bools, dev_indices);
+
+			timer().endGpuTimer();
+
+			int numElements = 0;
+			for (int i = 0; i < n; i++) {
+				if (bools[i] == 1) {
+					numElements++;
+				}
+			}
+
+			cudaMemcpy(odata, dev_out, sizeof(int) * numElements, cudaMemcpyDeviceToHost);
+
+			cudaFree(dev_in);
+			cudaFree(dev_out);
+			cudaFree(dev_indices);
+			cudaFree(dev_bools);
+
+			delete[]bools;
+
+			return numElements;
+		}
+    }
+}
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
new file mode 100644
index 0000000..803cb4f
--- /dev/null
+++ b/stream_compaction/efficient.h
@@ -0,0 +1,13 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace Efficient {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+
+        int compact(int n, int *odata, const int *idata);
+    }
+}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
new file mode 100644
index 0000000..c3e4df5
--- /dev/null
+++ b/stream_compaction/naive.cu
@@ -0,0 +1,79 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "common.h"
+#include "naive.h"
+
+#define blocksize 128
+
+namespace StreamCompaction {
+    namespace Naive {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+        // TODO: __global__
+
+		__global__ void kernScanHelper(int n, int pow2dMinus1, int *in, int *out)
+		{
+			int index = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (index >= n) {
+				return;
+			}
+
+			// Inclusive Scan
+			if (index >= pow2dMinus1) {
+				out[index] = in[index - pow2dMinus1] + in[index];
+			}
+			else {
+				out[index] = in[index];
+			}
+		}
+
+		__global__ void inclusiveToExclusive(int n, int *in, int *out)
+		{
+			int index = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (index >= n) {
+				return;
+			}
+
+			out[index] = index > 0 ? in[index - 1] : 0;
+		}
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int *odata, const int *idata) {
+			dim3 fullBlocksPerGrid((n + blocksize - 1) / blocksize);
+
+			int *in;
+			int *out;
+			cudaMalloc((void**)&in, n * sizeof(int));
+			cudaMalloc((void**)&out, n * sizeof(int));
+			cudaMemcpy(in, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+
+			timer().startGpuTimer();
+            // TODO
+
+			for (int d = 1; d <= ilog2ceil(n); d++) {
+				int pow2dMinus1 = pow(2, d - 1);
+				
+				kernScanHelper << <fullBlocksPerGrid, blocksize >> > (n, pow2dMinus1, in, out);
+
+				std::swap(in, out);
+			}
+
+			// Shift to the right
+			inclusiveToExclusive << < fullBlocksPerGrid, blocksize >> > (n, in, out);
+            
+			timer().endGpuTimer();
+
+			// Copy final values into odata
+			cudaMemcpy(odata, out, sizeof(int) * n, cudaMemcpyDeviceToHost);
+
+			cudaFree(in);
+			cudaFree(out);
+		}
+    }
+}
diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h
new file mode 100644
index 0000000..37dcb06
--- /dev/null
+++ b/stream_compaction/naive.h
@@ -0,0 +1,11 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace Naive {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        void scan(int n, int *odata, const int *idata);
+    }
+}