-
Notifications
You must be signed in to change notification settings - Fork 8
Added required assignment documents - Chad #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cjc2238
wants to merge
1
commit into
feature-engineering-studio:master
Choose a base branch
from
cjc2238:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| # Project | ||
| Course Project | ||
|
|
||
| This page introduces the anonymised Open University Learning Analytics Dataset (OULAD). It contains data about courses, students and their interactions with Virtual Learning Environment (VLE) for seven selected courses (called modules). Presentations of courses start in February and October - they are marked by “B” and “J” respectively. The dataset consists of tables connected using unique identifiers. All tables are stored in the csv format. | ||
|
|
||
| Dataset that will be used for this project will be sourced from Open University, located here: https://analyse.kmi.open.ac.uk/open_dataset | ||
|
|
||
| Kuzilek, J., Hlosta, M., Herrmannova, D., Zdrahal, Z. and Wolff, A. OU Analyse: Analysing At-Risk Students at The Open University. Learning Analytics Review, no. LAK15-1, March 2015, ISSN: 2057-7494. | ||
|
|
||
| Description of the data: | ||
|
|
||
| courses.csv | ||
|
|
||
| File contains the list of all available modules and their presentations. The columns are: | ||
|
|
||
| code_module – code name of the module, which serves as the identifier. | ||
| code_presentation – code name of the presentation. It consists of the year and “B” for the presentation starting in February and “J” for the presentation starting in October. | ||
| length - length of the module-presentation in days. | ||
|
|
||
| The structure of B and J presentations may differ and therefore it is good practice to analyse the B and J presentations separately. Nevertheless, for some presentations the corresponding previous B/J presentation do not exist and therefore the J presentation must be used to inform the B presentation or vice versa. In the dataset this is the case of CCC, EEE and GGG modules. | ||
|
|
||
|
|
||
| assessments.csv | ||
|
|
||
| This file contains information about assessments in module-presentations. Usually, every presentation has a number of assessments followed by the final exam. CSV contains columns: | ||
|
|
||
| code_module – identification code of the module, to which the assessment belongs. | ||
| code_presentation - identification code of the presentation, to which the assessment belongs. | ||
| id_assessment – identification number of the assessment. | ||
| assessment_type – type of assessment. Three types of assessments exist: Tutor Marked Assessment (TMA), Computer Marked Assessment (CMA) and Final Exam (Exam). | ||
| date – information about the final submission date of the assessment calculated as the number of days since the start of the module-presentation. The starting date of the presentation has number 0 (zero). | ||
| weight - weight of the assessment in %. Typically, Exams are treated separately and have the weight 100%; the sum of all other assessments is 100%. | ||
|
|
||
| If the information about the final exam date is missing, it is at the end of the last presentation week. | ||
|
|
||
| vle.csv | ||
|
|
||
| The csv file contains information about the available materials in the VLE. Typically these are html pages, pdf files, etc. Students have access to these materials online and their interactions with the materials are recorded. The vle.csv file contains the following columns: | ||
|
|
||
| id_site – an identification number of the material. | ||
| code_module – an identification code for module. | ||
| code_presentation - the identification code of presentation. | ||
| activity_type – the role associated with the module material. | ||
| week_from – the week from which the material is planned to be used. | ||
| week_to – week until which the material is planned to be used. | ||
|
|
||
| studentInfo.csv | ||
|
|
||
| This file contains demographic information about the students together with their results. File contains the following columns: | ||
|
|
||
| code_module – an identification code for a module on which the student is registered. | ||
| code_presentation - the identification code of the presentation during which the student is registered on the module. | ||
| id_student – a unique identification number for the student. | ||
| gender – the student’s gender. | ||
| region – identifies the geographic region, where the student lived while taking the module-presentation. | ||
| highest_education – highest student education level on entry to the module presentation. | ||
| imd_band – specifies the Index of Multiple Depravation band of the place where the student lived during the module-presentation. | ||
| age_band – band of the student’s age. | ||
| num_of_prev_attempts – the number times the student has attempted this module. | ||
| studied_credits – the total number of credits for the modules the student is currently studying. | ||
| disability – indicates whether the student has declared a disability. | ||
| final_result – student’s final result in the module-presentation. | ||
|
|
||
| studentRegistration.csv | ||
|
|
||
| This file contains information about the time when the student registered for the module presentation. For students who unregistered the date of unregistration is also recorded. File contains five columns: | ||
|
|
||
| code_module – an identification code for a module. | ||
| code_presentation - the identification code of the presentation. | ||
| id_student – a unique identification number for the student. | ||
| date_registration – the date of student’s registration on the module presentation, this is the number of days measured relative to the start of the module-presentation (e.g. the negative value -30 means that the student registered to module presentation 30 days before it started). | ||
| date_unregistration – date of student unregistration from the module presentation, this is the number of days measured relative to the start of the module-presentation. Students, who completed the course have this field empty. Students who unregistered have Withdrawal as the value of the final_result column in the studentInfo.csv file. | ||
|
|
||
| studentAssessment.csv | ||
|
|
||
| This file contains the results of students’ assessments. If the student does not submit the assessment, no result is recorded. The final exam submissions is missing, if the result of the assessments is not stored in the system. This file contains the following columns: | ||
|
|
||
| id_assessment – the identification number of the assessment. | ||
| id_student – a unique identification number for the student. | ||
| date_submitted – the date of student submission, measured as the number of days since the start of the module presentation. | ||
| is_banked – a status flag indicating that the assessment result has been transferred from a previous presentation. | ||
| score – the student’s score in this assessment. The range is from 0 to 100. The score lower than 40 is interpreted as Fail. The marks are in the range from 0 to 100. | ||
|
|
||
| studentVle.csv | ||
|
|
||
| The studentVle.csv file contains information about each student’s interactions with the materials in the VLE. This file contains the following columns: | ||
| code_module – an identification code for a module. | ||
| code_presentation - the identification code of the module presentation. | ||
| id_student – a unique identification number for the student. | ||
| id_site - an identification number for the VLE material. | ||
| date – the date of student’s interaction with the material measured as the number of days since the start of the module-presentation. | ||
| sum_click – the number of times a student interacts with the material in that day. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| #################### | ||
| ## Load Libraries ## | ||
| #################### | ||
|
|
||
| library(tidyr) | ||
| library(dplyr) | ||
|
|
||
| ################################# | ||
| ## Load Data Frames from Files ## | ||
| ################################# | ||
|
|
||
| # Set working directory to data folder | ||
| setwd("~/GitHub/Project Data Files") | ||
|
|
||
| # Load .CSV files and convert to dataframes | ||
| assessments_df <- read.csv("assessments.csv") | ||
| courses_df <- read.csv("courses.csv") | ||
| std_assessments_df <- read.csv("studentAssessment.csv") | ||
| std_info_df <- read.csv("studentinfo.csv") | ||
| std_registration_df <- read.csv("studentRegistration.csv") | ||
| std_vle_df <- read.csv("studentVle.csv") | ||
| vle_df <- read.csv("vle.csv") | ||
|
|
||
| # Set Project working directory | ||
| setwd("~/GitHub/Feature-Engineering-Project") | ||
|
|
||
| ################################### | ||
| #### Begin Data Processing #### | ||
| ################################### | ||
|
|
||
| ## Create data frames for semester 2014j of course "FFF." ## | ||
|
|
||
| assessment_fff_2014j <- subset(assessments_df, code_module %in% "FFF" & code_presentation %in% "2014J", select = c(id_assessment, assessment_type, date, weight)) | ||
|
|
||
| std_info_df_fff_2014j <- subset(std_info_df, code_module %in% "FFF" & code_presentation %in% "2014J", select = c(id_student, gender, region, highest_education, imd_band, age_band, num_of_prev_attempts, studied_credits, disability, final_result)) | ||
|
|
||
| std_registration_fff_2014j <- subset(std_registration_df, code_module %in% "FFF" & code_presentation %in% "2014J", select = c(id_student, date_registration, date_unregistration)) | ||
|
|
||
| std_vle_fff_2014j <- subset(std_vle_df, code_module %in% "FFF" & code_presentation %in% "2014J", select = c(id_student, id_site, date, sum_click)) | ||
|
|
||
| vle_fff_2014j <- subset(vle_df, code_module %in% "FFF" & code_presentation %in% "2014J", select = c(id_site, activity_type)) | ||
|
|
||
|
|
||
| ######################## | ||
| ## Feature Generation ## | ||
| ######################## | ||
|
|
||
| # Create new data frame by combining Assessment data with Student Assessment data | ||
| combined_assessments_fff_2014j <- merge(std_assessments_df, assessment_fff_2014j, by = "id_assessment", all = FALSE) | ||
|
|
||
| # Create new data frame by combining Assessment data with Student VLE data | ||
| std_assessment_vle_fff_2014j <- merge(combined_assessments_fff_2014j, std_vle_fff_2014j, by= "id_student") | ||
|
|
||
| # Create new data frame by student vle information data with vle discription data | ||
| std_assessment_vle_fff_2014j <- merge(std_assessment_vle_fff_2014j, vle_fff_2014j, by= "id_site") | ||
|
|
||
| # Create data frame of average score for specific assessment type of each student. | ||
| assessment_score_avg_fff_2014j <- std_assessment_vle_fff_2014j %>% dplyr::group_by(assessment_type, id_student) %>% dplyr::summarise(mean(score)) | ||
|
|
||
| # Create table with average number of clicks per student and activity type | ||
| activity_type_sum_click_fff_2014j <- std_assessment_vle_fff_2014j %>% dplyr::group_by(activity_type, id_student) %>% dplyr::summarise(mean(sum_click)) | ||
|
|
||
| # Spread new tables | ||
| spread_assessment_core_avg_fff_2014j <- tidyr::spread(assessment_score_avg_fff_2014j, assessment_type, 'mean(score)') | ||
|
|
||
| spread_activity_type_avg_click_fff_2014j <- tidyr::spread(activity_type_sum_click_fff_2014j, activity_type, 'mean(sum_click)') | ||
|
|
||
| # Merge student data and average data and feature together | ||
| df <- merge(spread_activity_type_avg_click_fff_2014j, spread_assessment_core_avg_fff_2014j, by = "id_student") | ||
| df <- merge(df, std_info_df_fff_2014j, by ="id_student") | ||
| df <- merge(df, std_registration_fff_2014j, by = "id_student") | ||
|
|
||
| ################################## | ||
| ## Prepare Data.frame for model ## | ||
| ################################## | ||
|
|
||
| # Check for missing values in data frame | ||
| apply(df,2,function(x) sum(is.na(x))) | ||
|
|
||
| ################################# | ||
| ## Export and Save New Dataset ## | ||
| ################################# | ||
|
|
||
| # Set Project working directory | ||
| setwd("~/GitHub/Feature-Engineering-Project/Data Upload Assignment") | ||
|
|
||
| write.csv(df, file = "Tidy_Data_fff_2014j_Dataset.csv") | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work