gettingAndCleaningData/CodeBook.html at master · andrewszwec/gettingAndCleaningData · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
<!DOCTYPE html>
<!-- saved from url=(0014)about:internet -->
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

<title>CodeBook</title>

<style type="text/css">
body, td {
   font-family: sans-serif;
   background-color: white;
   font-size: 12px;
   margin: 8px;
}

tt, code, pre {
   font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
}

h1 {
   font-size:2.2em;
}

h2 {
   font-size:1.8em;
}

h3 {
   font-size:1.4em;
}

h4 {
   font-size:1.0em;
}

h5 {
   font-size:0.9em;
}

h6 {
   font-size:0.8em;
}

a:visited {
   color: rgb(50%, 0%, 50%);
}

pre {
   margin-top: 0;
   max-width: 95%;
   border: 1px solid #ccc;
   white-space: pre-wrap;
}

pre code {
   display: block; padding: 0.5em;
}

code.r, code.cpp {
   background-color: #F8F8F8;
}

table, td, th {
  border: none;
}

blockquote {
   color:#666666;
   margin:0;
   padding-left: 1em;
   border-left: 0.5em #EEE solid;
}

hr {
   height: 0px;
   border-bottom: none;
   border-top-width: thin;
   border-top-style: dotted;
   border-top-color: #999999;
}

@media print {
   * {
      background: transparent !important;
      color: black !important;
      filter:none !important;
      -ms-filter: none !important;
   }

   body {
      font-size:12pt;
      max-width:100%;
   }

   a, a:visited {
      text-decoration: underline;
   }

   hr {
      visibility: hidden;
      page-break-before: always;
   }

   pre, blockquote {
      padding-right: 1em;
      page-break-inside: avoid;
   }

   tr, img {
      page-break-inside: avoid;
   }

   img {
      max-width: 100% !important;
   }

   @page :left {
      margin: 15mm 20mm 15mm 10mm;
   }

   @page :right {
      margin: 15mm 10mm 15mm 20mm;
   }

   p, h2, h3 {
      orphans: 3; widows: 3;
   }

   h2, h3 {
      page-break-after: avoid;
   }
}

</style>


</head>

<body>
<h1>CodeBook</h1>

<h2>Coursera Getting and Cleaning Data: Project</h2>

<h3>Purpose</h3>

<p>This Code Book describes the variables, the data, and any transformations or work that was performed to clean up the data and produce &ldquo;tidydata.csv&rdquo;.</p>

<h3>Variables</h3>

<p>Below is an extract that describes each of the variable types in the dataset &ldquo;tidydata.csv&rdquo;.</p>

<ul>
<li>id: Unique record identifier in tidydata.csv</li>
<li>Subjects: The subjects who undertook the trial</li>
<li>activityDesc:   A Description of the activity that was carried out e.g. WALKING, STANDING, etc&hellip;</li>
<li>Features which have been selected from the original UCI HAR Dataset. These features are described below:</li>
</ul>

<h4>Feature Selection</h4>

<p>The features selected for this database come from the accelerometer and gyroscope 3-axial raw signals tAcc-XYZ and tGyro-XYZ. These time domain signals (prefix &#39;t&#39; to denote time) were captured at a constant rate of 50 Hz. Then they were filtered using a median filter and a 3rd order low pass Butterworth filter with a corner frequency of 20 Hz to remove noise. Similarly, the acceleration signal was then separated into body and gravity acceleration signals (tBodyAcc-XYZ and tGravityAcc-XYZ) using another low pass Butterworth filter with a corner frequency of 0.3 Hz. </p>

<p>Subsequently, the body linear acceleration and angular velocity were derived in time to obtain Jerk signals (tBodyAccJerk-XYZ and tBodyGyroJerk-XYZ). Also the magnitude of these three-dimensional signals were calculated using the Euclidean norm (tBodyAccMag, tGravityAccMag, tBodyAccJerkMag, tBodyGyroMag, tBodyGyroJerkMag). </p>

<p>Finally a Fast Fourier Transform (FFT) was applied to some of these signals producing fBodyAcc-XYZ, fBodyAccJerk-XYZ, fBodyGyro-XYZ, fBodyAccJerkMag, fBodyGyroMag, fBodyGyroJerkMag. (Note the &#39;f&#39; to indicate frequency domain signals). </p>

<p>These signals were used to estimate variables of the feature vector for each pattern:<br/>
&#39;-XYZ&#39; is used to denote 3-axial signals in the X, Y and Z directions.</p>

<p>tBodyAcc-XYZ<br/>
tGravityAcc-XYZ<br/>
tBodyAccJerk-XYZ<br/>
tBodyGyro-XYZ<br/>
tBodyGyroJerk-XYZ<br/>
tBodyAccMag<br/>
tGravityAccMag<br/>
tBodyAccJerkMag<br/>
tBodyGyroMag<br/>
tBodyGyroJerkMag<br/>
fBodyAcc-XYZ<br/>
fBodyAccJerk-XYZ<br/>
fBodyGyro-XYZ<br/>
fBodyAccMag<br/>
fBodyAccJerkMag<br/>
fBodyGyroMag<br/>
fBodyGyroJerkMag  </p>

<p>The set of variables that were estimated from these signals are: </p>

<p>mean(): Mean value<br/>
std(): Standard deviation<br/>
mad(): Median absolute deviation<br/>
max(): Largest value in array<br/>
min(): Smallest value in array<br/>
sma(): Signal magnitude area<br/>
energy(): Energy measure. Sum of the squares divided by the number of values.<br/>
iqr(): Interquartile range<br/>
entropy(): Signal entropy<br/>
arCoeff(): Autorregresion coefficients with Burg order equal to 4<br/>
correlation(): correlation coefficient between two signals<br/>
maxInds(): index of the frequency component with largest magnitude<br/>
meanFreq(): Weighted average of the frequency components to obtain a mean frequency<br/>
skewness(): skewness of the frequency domain signal<br/>
kurtosis(): kurtosis of the frequency domain signal<br/>
bandsEnergy(): Energy of a frequency interval within the 64 bins of the FFT of each window.<br/>
angle(): Angle between to vectors.  </p>

<p>Additional vectors obtained by averaging the signals in a signal window sample. These are used on the angle() variable:</p>

<p>gravityMean<br/>
tBodyAccMean<br/>
tBodyAccJerkMean<br/>
tBodyGyroMean<br/>
tBodyGyroJerkMean  </p>

<p>The complete list of variables of each feature vector is available in &#39;UCI HAR Dataset/features.txt&#39; </p>

<h3>Data</h3>

<p>Data can be found in your work directory in &ldquo;tidydata.csv&rdquo;&ldquo; and &quot;averages.csv&rdquo;</p>

<h4>tidydata.csv</h4>

<p>This file contains the means and standard deviations for each set of the measurements for each feature plus the activity that was taking place and the subject that carried out the trial. There are multiple records for each activity and each subject.    </p>

<p>For a complete list of variables is &ldquo;tidydata.csv&rdquo; please see variables.md</p>

<h4>averages.csv</h4>

<p>This file contains the avearges of the means and standard deviations for all the measurements in both the training and test data sets. The averages have been calculated by for each subject within each activity .The data is ordered by activity then by subject.      </p>

<p>Variables in this file are:  </p>

<ul>
<li>activityDesc<br/></li>
<li>Subjects<br/></li>
<li>average of means and standard deviations&hellip;.</li>
</ul>

<p>For a complete list of variables is &ldquo;averages.csv&rdquo; please see variables.md  </p>

<h3>Transformations</h3>

<p>Steps involved in producing &ldquo;tidydata.csv&rdquo;&ldquo; and &quot;averages.csv&rdquo;</p>

<h4>UCI HAR Dataset Data Model</h4>

<p>For an idea of how the UCI HAR data links together see the data model click the link below:<br/>
<a href="https://docs.google.com/presentation/d/1c38KQPjkOHfm-b4j9FZWysKkAPRU-CrxhdJAYTKkL4Q/edit?usp=sharing">https://docs.google.com/presentation/d/1c38KQPjkOHfm-b4j9FZWysKkAPRU-CrxhdJAYTKkL4Q/edit?usp=sharing</a></p>

<h4>Import Data from UCI HAR Dataset and produce &ldquo;tidydata.csv&rdquo;</h4>

<ol>
<li>Read in reference data from files and add column names. This reference data (contained in features.txt and activities.txt provided the link to the activities that occured to produce the results in X_test.txt, Y_test.txt, X_train.txt, Y_train.txt)</li>
<li>Read in Test data from files and add column names. Note: X_test.txt gets its column names from features.txt</li>
<li>Make a large flat data frame for Test Data called &#39;tt&#39;</li>
<li>Repeat the same steps for the Training data. Read in Training data from files and add column names. Note: X_train.txt gets its column names from features.txt</li>
<li>Make a large flat data frame for Training Data called &#39;tn&#39;</li>
<li>Merge training and test data sets using a union methodoloygy (rbind to append test data to the training dataset). During the Training and Test dataset creation a column called dataSetInd was added to indicate which dataset the record came from originally e.g. Training or Test.</li>
<li>Next get only measurements on the mean and standard deviation for each measurement as the assignment requests.</li>
<li>Add these measurements to the Tidy Dataset and write it to file as &ldquo;tidydata.csv&rdquo;.</li>
</ol>

<h4>Calculate Averages for each measurement in &ldquo;tidydata.csv&rdquo; export as &ldquo;averages.csv&rdquo;&ldquo;</h4>

<ol>
<li>Next create second, independant dataset with averages of each variable for each activity and each subject. Export as &quot;averages.csv&rdquo;.</li>
<li>Create data.table called dt (data table is faster for aggregations and merges)</li>
<li>Remove &#39;-&#39;, &#39;,&#39; and &#39;()&#39; from names in data table so they can be accessed in the data.table</li>
<li>Group by Activities and Subjects and calculate the mean for each column of mean and standard deviation measurements</li>
<li>Order data by Activity then by Subjects </li>
<li>Write data to file averages.csv in your working directory</li>
</ol>

</body>

</html>