diff --git a/Stata Fundamentals I/Stata1_Polls.docx b/Stata Fundamentals I/Stata1_Polls.docx index 64bb9e8..5f9508d 100644 Binary files a/Stata Fundamentals I/Stata1_Polls.docx and b/Stata Fundamentals I/Stata1_Polls.docx differ diff --git a/Stata Fundamentals I/workshop1_content.do b/Stata Fundamentals I/workshop1_content.do index bb23e9b..4d3033f 100644 --- a/Stata Fundamentals I/workshop1_content.do +++ b/Stata Fundamentals I/workshop1_content.do @@ -28,102 +28,103 @@ * SETTING UP * ****************************** +* In this section, we'll learn how to import your data into Stata and how to run your code. In Stata, we often refer to the file that contains your code as a ".do file". -* clear all previously open data, variables, labels, matrices, memory, etc., -* and close all open files, graph windows, etc +* first, clear all previously open data, variables, labels, matrices, memory, etc., and close all open files, graph windows, etc clear all +* SET A WORKING DIRECTORY -* SET A PROJECT DIRECTORY +* The working directory is the folder on your computer containing the files associated with a given Stata file - it's where Stata will look when you try to load data, and where it will export your data, unless you specify otherwise. Your working directory doesn't necessarily have to be the same folder where your do-file is saved. Stata will choose a default directory based on your application settings, and we can use a command to check which directory that is: pwd // check the current working directory +* Often, the current working directory isn't actually the folder we want to use to store our work - below, we show how you can change the working directory to your desired folder. * Method: Copy & Paste /**** - This is not the most efficient method for telling Stata to locate - and open your files, but it is the simplest, so we will work with - this for day 1 of the workshop +This is not the most efficient method for telling Stata where to locate and open your files, but it is the simplest, so we will work with this for day 1 of the workshop ****/ - - - -/* Step 1: File > Change Working Directory > Navigate to the folder where you - have saved the data file nlsw88.dta */ +/* Step 1: File > Change Working Directory > Navigate to the folder where you have saved the data file nlsw88.dta */ /* Step 2: Copy-paste the last command that shows up on result screen. My results window shows this:*/ -cd "C:\Users\heroa\Google Drive\DLab\stata-fundamentals\Stata Fundamentals I" +cd "\\Client\C$\Users\salma\Box\dlab_workshops-s21\stata-fundamentals\Stata Fundamentals I" + - /*** - We paste this command above so that next time we can just run this - do-file from the top and it will run smoothly. We will not need to - use the file menu or copy-paste again. We should be able to run - everything from the do-file. +We paste this command above so that next time we can just run this do-file from the top and it will run smoothly. We will not need to use the file menu or copy-paste again. We should be able to run everything from the do-file. ***/ // POLL 1 // +/*** +Run the command “pwd”. Is your working directory set to the proper folder on YOUR computer? +(1) Yes +(2) No +(3) Don’t know +***/ - -/* Step 3: Open the data file */ +/* Step 3: Open the data file.*/ use nlsw88.dta , clear // open data file -/* You can also write: */ - -use nlsw88 , clear -// don't have to specify .dta, that is the default extension - - /*** - If you are using an older version of Stata, then please use: - -use nlsw88_13, clear +Stata uses a special file type called .dta files to save data in a table format (similar to how Excel has their own .xlsx file types). At the end of today's workshop, we'll go over how to import other file types that are more common (e.g. .csv files). ***/ +/* You can also write: */ +use nlsw88 , clear // don't have to specify .dta, that is the default extension +/*** +If you are using an older version of Stata, then please use: use nlsw88_13, clear +***/ /* - Pause here, and now highlight everything above this point and click - on the "do" button. It should run smoothly! You've started to create a - functional do-file! +Pause here, and now highlight everything above this point and click on the "do" button. It should run smoothly! You've started to create a functional .do file! */ - ****************************** * COMMENTING * ****************************** -// There are a bunch of ways to comment your .do file. -// Commenting is key for understanding your work. +// Comments are meant to make your code easier to understand. There are a bunch of ways to comment your .do file; comments will show up in green text while commands (the steps you want Stata to execute) will show up in black or blue text. -* You can also just put an asterisk at the beginning of a line -* You can use * to comment out lines of code that you want to suspend -// you can use double slash to make comments at the end of a command line or just as a line by itself (like this one) -// asterisk (*) cannot be placed at the end of a command line +/* +If you want to write a really long and super informative comment and you want to clearly show where the comment begins and ends, you can wrap it in a slash-asterisk (/* at the start, and */ at the end), like this one we're typing right now. +*/ +* You can put an asterisk at the beginning of a line to comment it out +* You can also use * to comment out lines of code that you want to suspend +// You can use double slash to make comments at the end of a command line or just as a line by itself (like this one) +// asterisk (*) cannot be placed at the end of a command line - it can only be used on a line by itself. For example: des // describes the variables in the data des * describes the variables in the data <-- this is wrong! * des // this suspended the command altogether -/* But then say you wanted to write a really long -and super informative comment that you didn't want -to have all on one line, like this one we're typing -right now. */ - +/* +Try highlighting and running just the three lines above. The first line should run smoothly, giving us a description of the data. The second should give an error because it's not commented properly! +*/ // POLL 2 // +/*** +Which of these include both a command and a comment (and would run the command without error)? -** CHALLENGE ** +(1) /*clear all – make sure environment is clear before reading in new data*/ +(2) clear all * make sure environment is clear before reading in new data +(3) *clear all – make sure environment is clear before reading in new data +(4) // clear all // make sure environment is clear before reading in new data +(5) clear all // make sure environment is clear before reading in new data +***/ + +/* Challenge question 1 */ /* (1) write "describes data" NEXT to the command "des" below as a comment @@ -134,62 +135,68 @@ des sum count - - ********************************************** * EXAMINING A DATA SET: THE BASICS * ********************************************** -** It is good practice to LOOK at your data before you start working with it -** That way, you get an idea of its shape and the variables in it quickly +* It is good practice to LOOK at your data before you start working with it +* That way, you get an idea of its shape and the variables in it quickly * DESCRIBE des // describes dataset and variables - * BROWSE br // browse data in data editor -**What do the different colors mean? -des married married_txt - +/* Challenge question 2 */ +/* +What do the different colors mean in the data editor? +*/ +des married married_txt // we can describe selected variables by specifying which ones - here, we're just describing the married and married_txt fields * CODEBOOK +* The codebook command in Stata provides additional details about each variable, beyond what is output by des +* We could use it to get a codebook of all the variables, which will result in a lengthy output. For now, let's just look at the contents of the variable union codebook union // shows the contents of the variable union - // POLL 3 // - +/*** +What information is NOT included in the output of the command “codebook union”? +(1) variable type +(2) value labels +(3) number of observations +(4) mean +***/ * COUNT count // counts the number of observations +* The command above counts the total number of observations in our dataset. But we can also count observations with a condition. For instance, if we want to know how many rows represent individuals who are over 40, we can use: +count if age > 40 // counts the number of observations where age is greater than 40 -/* CHALLENGE ** -(3) Count the number of observations that are union members. -variable: union +/* Challenge question 3 */ +/* +Count the number of observations that are union members. +(hint: you can use the command + codebook union +to first figure out what different values the variable union can have) */ +* SUMMARIZE +* shows number of observations, mean, min & max of all or some vars +sum // summarize all variables - - -* SUMMARIZE * - -* shows number of observations, mean, min & max of all/some vars -sum - - -* MISSING VALUES * - - -/* Notice the observation numbers. Why do some variables have fewer observations? */ +* MISSING VALUES +* Notice the observation numbers. Why do some variables have fewer observations? misstable summarize // tabulates missing values -codebook union +/* +Stata uses a period (".") to indicate missing values - so the count of missing values for each vairable can be found in the column "Obs=." +*/ ************************************************************** @@ -203,13 +210,13 @@ Now let's start looking at some summary statistics using command: "summarize" (or "sum" for short) */ -sum // summarize the data, presents summary statistics -sum wage -sum wage, detail +sum // summarize all data, presents summary statistics +sum wage // summarize only the wage variable +sum wage, detail // summarize the wage variable with detailed summary statistics * we can combine conditional operators with the summarize command -* What is the average wage of observations who are married in this sample +* For example, we can detrmine the average wage of observations who are married in this sample sum wage if married==1 @@ -228,49 +235,47 @@ For the college graduates in the sample, those who are unmarried earn more (11.30) on average than those who are married (10.10).*/ - -** CHALLENGES ** - +/* Challenge question 4 */ /* -(4) What is the mean wage for those who are not married? +What is the mean wage for those who are not married? variables: wage married (hint: Use the operator "if") */ - +/* Challenge question 5 */ /* -(5) What is the average wage of those who have worked 10 or more years? +What is the average wage of those who have worked 10 or more years? variables: wage tenure +(hint: For numerical variables, Stata considers missing values to be very, very large numbers - so if you are selecting a subset of your data by value (e.g. "age > 40"), it is important to additionally specify that you want to exclude missing values (i.e. "age != .) */ - +/* Challenge question 6 */ /* -(6) What is the average number of hours worked in the sample? +What is the average number of hours worked in the sample? variable: hours */ - +/* Challenge question 7 */ /* -(7) What is the average age and age range of this sample? +What is the average age and age range of this sample? Variable: age */ - +/* Challenge question 8 */ /* -(8) What is the average age for non-married observations? +What is the average age for non-married observations? variables: age, married */ - // Let's look at how missing variables can affect results: // Suppose we want to summarize wages for those individuals who are in unions (union=1) @@ -283,87 +288,100 @@ variables: age, married /* E */ sum wage if union==1 // POLL 4 // - +/*** +Suppose we want to summarize wages for those individuals who are in unions (union=1) +Which of the five options in the do file are correct? +(1) All +(2) E, only +(3) A, B +(4) C, D, E +(5) B, C, D +***/ * TABULATION & CROSS TABULATION * +/* +Another useful command for summarizing variables is "tabulate", or "tab" for short. +*/ - - -// Very helpful for categorical variables +// It's particularly helpful for categorical variables tab race tab collgrad tab union tab union if hours>=60 & hours<. -* Twoway tables +// The tables we created above are one way tables - they summarize one variable. But we can also give tab a list of two variables and create two-way tables tab union collgrad -tab union collgrad, col -tab union collgrad, row -tab union collgrad, col row - -tab union collgrad, cell +tab union collgrad, col // the option "col" shows the relative frequency of each row value (union worker) within each column value (collgrad) - for instance, we can figure out what percentage of college graduates in our dataset are union workers +tab union collgrad, row // the option "row" shows the relative frequency of each column value (collgrad) within each row value (union worker) - for instance, we can figure out what percentage of non-union workers are college graduates +tab union collgrad, col row // combine "col" and "row" to get both sets of relative percentages +tab union collgrad, cell // the option "cell" shows the relative frequency of each row-column combination within the dataset - for instance, we can figure out what percentage of the dataset are union workers who graduated college // POLL 5 // +/*** +If you wanted to know what percentage of white respondents lived in central cities, what code would you write? +(1) tab race c_city, col +(2) tab race c_city, row +(3) tab c_city race, row +(4) tab c_city race, cell +(5) tab race c_city, cell +***/ - -** CHALLENGE ** +/* Challenge question 9 */ /* -(9) How many observations in this dataset fall into each race group? +How many observations in this dataset fall into each race group? Variables: race */ - +/* Challenge question 10 */ /* -(10) What percent of the sample is white? +What percent of the sample is white? Variable: race */ +* TABULATE, SUMMARIZE * +* You can combine tabulating and summary statistics using tab, summarize; this command gives summary statistics of one variable with respect to others -* TABULATE, SUMMARIZE -* Summary statistics of one variable with respect to others /* e.g. What is the average wage for married/non-married OR college graduates/non-graduates? */ +help tabulate_summarize // pull up the documentation for this command +tab collgrad, summarize(wage) means // get mean hourly wage for college graduates/non-graduates +tab married collgrad, summarize(wage) means // get mean hourly wage for combinations of married/non-married and college graduates/non-graduates -help tabulate_summarize -tab collgrad, summarize(wage) means -tab married collgrad, summarize(wage) means - -** CHALLENGE ** +/* Challenge question 11 */ /* -(11) Find average wage by industry. +Find average wage by industry. Variables: industry wage */ +// When you find the average wage by industry, do you notice anything strange? +// Mining wages are much higher than other industries, so let's explore the mining wage... - -// do you notice anything strange about the wages here? -// mining wages are the highest?? - -// Let's explore the mining wage... -// Let's take a look at the observations that work in mining -// first we have to find the industry code that belongs to mining +// Let's take a look at the observations that work in mining. First we have to find the industry code that belongs to mining * Finding numeric codes attached to value labels * - -br if industry==Mining // no luck, industry is a numerical variable +// We can start by using the browse command ("br") to open up the data editor for the observations we're interested in. To start, we can try specifying that we only want to look at observations where the industry is Mining... +br if industry==Mining +// ...no luck, industry is a numerical variable! If we try tabulating industry again, we'll only see the text labels... tab industry +// ...but we can figure out what the associated numeric values are by specifying the option "nolabel": tab industry, nolabel +// comparing the no label and the labelled tables shows that mining is given a value of 2: br if industry==2 -/* OR */ -// find name of value label + +// another way to figure out the number associated with each label is to use codebook: codebook industry // you can also use: des industry -// then list the contents of that label +// then list the contents of that label: label list indlbl br if industry==2 @@ -381,13 +399,13 @@ br if industry==2 * (1) Simple numeric variables +// the command "gen" (short for generate) creates a new variable -gen year88=1 +gen year88=1 // create a variable year88, that is equal to 1 for all observations -gen wage_day = wage*8 // wage per day (8 hour workday) - -gen tenure_sqr = tenure^2 +gen wage_day = wage*8 // create a variable wage_day, that estimates the wage per day (assuming an 8 hour workday) +gen tenure_sqr = tenure^2 // create a variable tenure_sqr, that is equal to the tenure squared * (2) Turning our string variable into a numeric variable @@ -397,47 +415,40 @@ gen tenure_sqr = tenure^2 sum wage if married_txt=="Married" sum wage if married == 1 -*Do you see any issues that might arrise from using a string variable? +*Do you see any issues that might arise from using a string variable? tab married_txt married - /* Method 1: manual labor */ - *Remember married and married_txt? tab married married_txt tab married_txt -gen married2=1 if married_txt=="M" | married_txt=="Married" | married_txt=="m" /// - | married_txt=="maried" | married_txt=="married" +// we can use a combination of "gen" and "replace" to clean up married_txt by creating another variable, married2 +gen married2=1 if married_txt=="M" | married_txt=="Married" | married_txt=="m" | married_txt=="maried" | married_txt=="married" -replace married2=0 if married_txt=="single" | married_txt=="S" | married_txt=="SINGLE" /// - | married_txt=="Single" | married_txt=="s" | married_txt=="sIngle" /// - | married_txt=="singLe" | married_txt=="single" | married_txt=="single " /// - | married_txt=="single " | married_txt==" single" | married_txt=="single " +replace married2=0 if married_txt=="single" | married_txt=="S" | married_txt=="SINGLE" | married_txt=="Single" | married_txt=="s" | married_txt=="sIngle" | married_txt=="singLe" | married_txt=="single" | married_txt=="single " | married_txt=="single " | married_txt==" single" | married_txt=="single " +// now we have a numerican variable, married2, that is equal to "1" for married observations and "0" for non-married observations + +// if the married_txt string was clean, we could simply use "encode" to convert it to a numerical value encode married_txt, gen(married3) // good when strings are clean -gen married_txt2 = married_txt -replace married_txt2=trim(married_txt2) // removes leading and trailing spaces -replace married_txt2=proper(married_txt2) -replace married_txt2=lower(married_txt2) -replace married_txt2=upper(married_txt2) +// another way to clean text is to use the "trim", "proper", "lower", and "upper" commands +gen married_txt2 = married_txt // create a new variable, married_txt2, that is equal to married_txt (for now) +replace married_txt2=trim(married_txt2) // remove leading and trailing spaces +replace married_txt2=proper(married_txt2) // change to "proper" case - first letter upper case +replace married_txt2=lower(married_txt2) // change to lower case +replace married_txt2=upper(married_txt2) // change to upper case /* Method 2 (advanced!): regular expressions */ - - /* -Regular expressions are one way that you can work with strings in variables -Regular expressions are methods that allows for searching, matching and replacing within strings. -There are two main commands: regexr and regexm. +Regular expressions are one way that you can work with strings in variables. Regular expressions are methods that allows for searching, matching and replacing within strings. There are two main commands: regexr and regexm. regexr REPLACES a value within a string with a new variable - regexm, which we'll use today, combines strings and conditional operators. -regexm lets you search within a string for a given character; it returns 1, or -TRUE, if the string has the character, and 0 otherwise. +regexm lets you search within a string for a given character; it returns 1, or TRUE, if the string has the character, and 0 otherwise. */ gen married4=1 if regexm(married_txt,"m") | regexm(married_txt,"M") @@ -449,7 +460,6 @@ br married_txt married2 married3 married4 tab married married2 - * (3) Create numeric variables from other numeric variables // create a variable that indicates highschool graduate @@ -457,7 +467,7 @@ tab married married2 // (1) missing values // (2) what should =1 and what should =0 for your variable -//let's first look at grade +// let's first look at grade codebook grade // it has missing values, so beware @@ -496,6 +506,12 @@ drop hs2 hs3 // POLL 6 // +/*** +We want to create a variable indicating if people worked full time (40 hrs or more). Which option would NOT generate this variable properly? +(1) gen full_time=hours>=40 +(2) gen full_time=hours>=40 if hours<. +(3) recode hours (0/39 = 0) (40/80=1), gen(full_time) +***/ @@ -506,8 +522,6 @@ tab hs1 * LABEL VARIABLES AND ADD VALUE LABELS * - - // Let's rename hs1 as hs rename hs1 hs @@ -535,34 +549,38 @@ label list hs_vallabel // POLL 7 // +/*** +We want to see the value labels and the values that they correspond to for the variable hs. Which option would NOT show us this? +(1) br hs +(2) label list +(3) label list hs_vallabel +(4) tab hs +(5) codebook hs +***/ -** CHALLENGE ** +/* Challenge question 12a */ /* -Let's make and label a new variable about college attendance. -(12a) Create a variable called somecollege, i.e. more than 12 and less than - 16 years of schooling) (call it somecollege, using any of the three methods - we used to create hs. +Let's make and label a new variable about college attendance. Create a variable called somecollege, i.e. more than 12 and less than 16 years of schooling) (call it somecollege, using any of the three methods we used to create hs. */ - - +/* Challenge question 12b */ /* -(12b) Label somecollege "Attended some years of college" +Label somecollege "Attended some years of college" */ - +/* Challenge question 12c */ /* -(12c) Create a new value label called somecollege_vallabel that assigns labels to 1 and 0 +Create a new value label called somecollege_vallabel that assigns labels to 1 and 0 */ - +/* Challenge question 12d */ /* -(12d) Add your new value label to somecollege and check it has added +Add your new value label to somecollege and check it has added */ diff --git a/Stata Fundamentals I/workshop1_solutions.do b/Stata Fundamentals I/workshop1_solutions.do index c4d29c4..1da8f94 100644 --- a/Stata Fundamentals I/workshop1_solutions.do +++ b/Stata Fundamentals I/workshop1_solutions.do @@ -1,7 +1,6 @@ ******************************** * STATA FUNDAMENTALS: PART 1 * SPRING 2021, D-LAB -* SOLUTIONS DO FILE ******************************** @@ -29,108 +28,113 @@ * SETTING UP * ****************************** +* In this section, we'll learn how to import your data into Stata and how to run your code. In Stata, we often refer to the file that contains your code as a "do-file". -* clear all previously open data, variables, labels, matrices, memory, etc., -* and close all open files, graph windows, etc +* first, clear all previously open data, variables, labels, matrices, memory, etc., and close all open files, graph windows, etc clear all +* SET A WORKING DIRECTORY -* SET A PROJECT DIRECTORY +* The working directory is the folder on your computer containing the files associated with a given Stata file - it's where Stata will look when you try to load data, and where it will export your data, unless you specify otherwise. Your working directory doesn't necessarily have to be the same folder where your do-file is saved. Stata will choose a default directory based on your application settings, and we can use a command to check which directory that is: pwd // check the current working directory +* Often, the current working directory isn't actually the folder we want to use to store our work - below, we show how you can change the working directory to your desired folder. * Method: Copy & Paste /**** - This is not the most efficient method for telling Stata to locate - and open your files, but it is the simplest, so we will work with - this for day 1 of the workshop +This is not the most efficient method for telling Stata where to locate and open your files, but it is the simplest, so we will work with this for day 1 of the workshop ****/ - - - -/* Step 1: File > Change Working Directory > Navigate to the folder where you - have saved the data file nlsw88.dta */ +/* Step 1: File > Change Working Directory > Navigate to the folder where you have saved the data file nlsw88.dta */ /* Step 2: Copy-paste the last command that shows up on result screen. My results window shows this:*/ -cd "C:\Users\heroa\Google Drive\DLab\stata-fundamentals\Stata Fundamentals I" +cd "\\Client\C$\Users\salma\Box\dlab_workshops-s21\stata-fundamentals\Stata Fundamentals I" + - /*** - We paste this command above so that next time we can just run this - do-file from the top and it will run smoothly. We will not need to - use the file menu or copy-paste again. We should be able to run - everything from the do-file. +We paste this command above so that next time we can just run this do-file from the top and it will run smoothly. We will not need to use the file menu or copy-paste again. We should be able to run everything from the do-file. ***/ // POLL 1 // +/*** +Run the command “pwd”. Is your working directory set to the proper folder on YOUR computer? +(1) Yes +(2) No +(3) Don’t know +(If you ran the command above without any errors, the answer should be Yes) +***/ - -/* Step 3: Open the data file */ +/* Step 3: Open the data file.*/ use nlsw88.dta , clear // open data file -/* You can also write: */ - -use nlsw88 , clear -// don't have to specify .dta, that is the default extension - - /*** - If you are using an older version of Stata, then please use: - -use nlsw88_13, clear +Stata uses a special file type called .dta files to save data in a table format (similar to how Excel has their own .xlsx file types). At the end of today's workshop, we'll go over how to import other file types that are more common (e.g. .csv files). ***/ +/* You can also write: */ +use nlsw88 , clear // don't have to specify .dta, that is the default extension +/*** +If you are using an older version of Stata, then please use: use nlsw88_13, clear +***/ /* - Pause here, and now highlight everything above this point and click - on the "do" button. It should run smoothly! You've started to create a - functional do-file! +Pause here, and now highlight everything above this point and click +on the "do" button. It should run smoothly! You've started to create a functional do-file! */ - ****************************** * COMMENTING * ****************************** -// There are a bunch of ways to comment your .do file. -// Commenting is key for understanding your work. +// Comments are meant to make your code easier to understand. There are a bunch of ways to comment your .do file; comments will show up in green text while commands (the steps you want Stata to execute) will show up in black or blue text. -* You can also just put an asterisk at the beginning of a line -* You can use * to comment out lines of code that you want to suspend -// you can use double slash to make comments at the end of a command line or just as a line by itself (like this one) -// asterisk (*) cannot be placed at the end of a command line +/* +If you want to write a really long and super informative comment and you want to clearly show where the comment begins and ends, you can wrap it in a slash-asterisk (/* at the start, and */ at the end), like this one we're typing right now. +*/ +* You can put an asterisk at the beginning of a line to comment it out +* You can also use * to comment out lines of code that you want to suspend +// You can use double slash to make comments at the end of a command line or just as a line by itself (like this one) +// asterisk (*) cannot be placed at the end of a command line - it can only be used on a line by itself. For example: des // describes the variables in the data des * describes the variables in the data <-- this is wrong! * des // this suspended the command altogether -/* But then say you wanted to write a really long -and super informative comment that you didn't want -to have all on one line, like this one we're typing -right now. */ +/* +Try highlighting and running just the three lines above. The first line should run smoothly, giving us a description of the data. The second should give an error because it's not commented properly! +*/ // POLL 2 // +/*** +Which of these include both a command and a comment (and would run the command without error)? -** CHALLENGE ** -/* +(1) /*clear all – make sure environment is clear before reading in new data*/ +(2) clear all * make sure environment is clear before reading in new data +(3) *clear all – make sure environment is clear before reading in new data +(4) // clear all // make sure environment is clear before reading in new data +(5) clear all // make sure environment is clear before reading in new data <-- CORRECT ANSWER +***/ + +/* Challenge question 1 */ +/* (1) write "describes data" NEXT to the command "des" below as a comment (2) Suspend all 3 lines of code below using one pair of /**/ */ +/* Challenge question 1 solution */ /* des // describe data sum @@ -141,57 +145,69 @@ count ********************************************** * EXAMINING A DATA SET: THE BASICS * ********************************************** -** It is good practice to LOOK at your data before you start working with it -** That way, you get an idea of its shape and the variables in it quickly +* It is good practice to LOOK at your data before you start working with it +* That way, you get an idea of its shape and the variables in it quickly * DESCRIBE -des // describes dataset and variables - +des // describes dataset and all variables * BROWSE br // browse data in data editor -**What do the different colors mean? -des married married_txt - +/* Challenge question 2 */ +/* +What do the different colors mean in the data editor? +*/ +/* Challenge question 2 solution */ +/* +The colors designate different data types - in my data editor, strings are red. +*/ +des married married_txt // we can describe selected variables by specifying which ones - here, we're just describing the married and married_txt fields * CODEBOOK +* The codebook command in Stata provides additional details about each variable, beyond what is output by des +* We could use it to get a codebook of all the variables, which will result in a long output. For now, let's just look at the contents of the variable union: codebook union // shows the contents of the variable union - // POLL 3 // - +/*** +What information is NOT included in the output of the command “codebook union”? +(1) variable type +(2) value labels +(3) number of observations +(4) mean <-- CORRECT ANSWER +***/ * COUNT count // counts the number of observations +* The command above counts the total number of observations in our dataset. But we can also count observations with a condition. For instance, if we want to know how many rows represent individuals who are over 40, we can use: +count if age > 40 // counts the number of observations where age is greater than 40 -/* CHALLENGE ** -(3) Count the number of observations that are union members. -variable: union +/* Challenge question 3 */ +/* +Count the number of observations that are union members. (hint: you can use the command + codebook union +to first figure out what different values the variable union can have) */ count if union == 1 +* SUMMARIZE +* shows number of observations, mean, min & max of all or some vars +sum // summarize all variables - -* SUMMARIZE * - -* shows number of observations, mean, min & max of all/some vars -sum - - -* MISSING VALUES * - - -/* Notice the observation numbers. Why do some variables have fewer observations? */ +* MISSING VALUES +* Notice the observation numbers. Why do some variables have fewer observations? misstable summarize // tabulates missing values -codebook union +/* +Stata uses a period (".") to indicate missing values - so the count of missing values for each vairable can be found in the column "Obs=." +*/ ************************************************************** @@ -205,13 +221,13 @@ Now let's start looking at some summary statistics using command: "summarize" (or "sum" for short) */ -sum // summarize the data, presents summary statistics -sum wage -sum wage, detail +sum // summarize all data, presents summary statistics +sum wage // summarize only the wage variable +sum wage, detail // summarize the wage variable with detailed summary statistics * we can combine conditional operators with the summarize command -* What is the average wage of observations who are married in this sample +* For example, we can detrmine the average wage of observations who are married in this sample sum wage if married==1 @@ -231,10 +247,9 @@ unmarried earn more (11.30) on average than those who are married (10.10).*/ -** CHALLENGES ** - +/* Challenge question 4 */ /* -(4) What is the mean wage for those who are not married? +What is the mean wage for those who are not married? variables: wage married (hint: Use the operator "if") */ @@ -244,32 +259,37 @@ sum wage if married != 1 +/* Challenge question 5 */ /* -(5) What is the average wage of those who have worked 10 or more years? +What is the average wage of those who have worked 10 or more years? variables: wage tenure +(hint: For numerical variables, Stata considers missing values to be very, very large numbers - so if you are selecting a subset of your data by value (e.g. "age > 40"), it is important to additionally specify that you want to exclude missing values (i.e. "age != .) */ sum wage if tenure >= 10 & tenure != . +/* Challenge question 6 */ /* -(6) What is the average number of hours worked in the sample? +What is the average number of hours worked in the sample? variable: hours */ sum hours +/* Challenge question 7 */ /* -(7) What is the average age and age range of this sample? +What is the average age and age range of this sample? Variable: age */ sum age +/* Challenge question 8 */ /* -(8) What is the average age for non-married observations? +What is the average age for non-married observations? variables: age, married */ sum age if married == 0 @@ -287,86 +307,103 @@ sum age if married == 0 /* E */ sum wage if union==1 // POLL 4 // - +/*** +Suppose we want to summarize wages for those individuals who are in unions (union=1) +Which of the five options in the do file are correct? +(1) All +(2) E, only +(3) A, B +(4) C, D, E <--- CORRECT ANSWER +(5) B, C, D +***/ * TABULATION & CROSS TABULATION * +/* +Another useful command for summarizing variables is "tabulate", or "tab" for short. +*/ - - -// Very helpful for categorical variables +// It's particularly helpful for categorical variables tab race tab collgrad tab union tab union if hours>=60 & hours<. -* Twoway tables +// The tables we created above are one way tables - they summarize one variable. But we can also give tab a list of two variables and create two-way tables tab union collgrad -tab union collgrad, col -tab union collgrad, row -tab union collgrad, col row - -tab union collgrad, cell +tab union collgrad, col // the option "col" shows the relative frequency of each row value (union worker) within each column value (collgrad) - for instance, we can figure out what percentage of college graduates in our dataset are union workers +tab union collgrad, row // the option "row" shows the relative frequency of each column value (collgrad) within each row value (union worker) - for instance, we can figure out what percentage of non-union workers are college graduates +tab union collgrad, col row // combine "col" and "row" to get both sets of relative percentages +tab union collgrad, cell // the option "cell" shows the relative frequency of each row-column combination within the dataset - for instance, we can figure out what percentage of the dataset are union workers who graduated college // POLL 5 // +/*** +If you wanted to know what percentage of white respondents lived in central cities, what code would you write? +(1) tab race c_city, col +(2) tab race c_city, row <--- CORRECT ANSWER +(3) tab c_city race, row +(4) tab c_city race, cell +(5) tab race c_city, cell +***/ -** CHALLENGE ** +/* Challenge question 9 */ /* -(9) How many observations in this dataset fall into each race group? +How many observations in this dataset fall into each race group? Variables: race */ tab race +/* Challenge question 10 */ /* -(10) What percent of the sample is white? +What percent of the sample is white? Variable: race */ tab race -* TABULATE, SUMMARIZE -* Summary statistics of one variable with respect to others +* TABULATE, SUMMARIZE * +* You can combine tabulating and summary statistics using tab, summarize; this command gives summary statistics of one variable with respect to others + /* e.g. What is the average wage for married/non-married OR college graduates/non-graduates? */ - -help tabulate_summarize -tab collgrad, summarize(wage) means -tab married collgrad, summarize(wage) means +help tabulate_summarize // pull up the documentation for this command +tab collgrad, summarize(wage) means // get mean hourly wage for college graduates/non-graduates +tab married collgrad, summarize(wage) means // get mean hourly wage for combinations of married/non-married and college graduates/non-graduates -** CHALLENGE ** +/* Challenge question 11 */ /* -(11) Find average wage by industry. +Find average wage by industry. Variables: industry wage */ tab industry, sum(wage) means +// When you find the average wage by industry, do you notice anything strange? +// Mining wages are much higher than other industries, so let's explore the mining wage... -// do you notice anything strange about the wages here? -// mining wages are the highest?? - -// Let's explore the mining wage... -// Let's take a look at the observations that work in mining -// first we have to find the industry code that belongs to mining +// Let's take a look at the observations that work in mining. First we have to find the industry code that belongs to mining * Finding numeric codes attached to value labels * - -br if industry==Mining // no luck, industry is a numerical variable +// We can start by using the browse command ("br") to open up the data editor for the observations we're interested in. To start, we can try specifying that we only want to look at observations where the industry is Mining... +br if industry==Mining +// ...no luck, industry is a numerical variable! If we try tabulating industry again, we'll only see the text labels... tab industry +// ...but we can figure out what the associated numeric values are by specifying the option "nolabel": tab industry, nolabel +// comparing the no label and the labelled tables shows that mining is given a value of 2: br if industry==2 -/* OR */ -// find name of value label + +// another way to figure out the number associated with each label is to use codebook: codebook industry // you can also use: des industry -// then list the contents of that label +// then list the contents of that label: label list indlbl br if industry==2 @@ -384,13 +421,13 @@ br if industry==2 * (1) Simple numeric variables +// the command "gen" (short for generate) creates a new variable -gen year88=1 +gen year88=1 // create a variable year88, that is equal to 1 for all observations -gen wage_day = wage*8 // wage per day (8 hour workday) - -gen tenure_sqr = tenure^2 +gen wage_day = wage*8 // create a variable wage_day, that estimates the wage per day (assuming an 8 hour workday) +gen tenure_sqr = tenure^2 // create a variable tenure_sqr, that is equal to the tenure squared * (2) Turning our string variable into a numeric variable @@ -400,47 +437,41 @@ gen tenure_sqr = tenure^2 sum wage if married_txt=="Married" sum wage if married == 1 -*Do you see any issues that might arrise from using a string variable? +*Do you see any issues that might arise from using a string variable? tab married_txt married - /* Method 1: manual labor */ - *Remember married and married_txt? tab married married_txt tab married_txt -gen married2=1 if married_txt=="M" | married_txt=="Married" | married_txt=="m" /// - | married_txt=="maried" | married_txt=="married" +// we can use a combination of "gen" and "replace" to clean up married_txt by creating another variable, married2 +gen married2=1 if married_txt=="M" | married_txt=="Married" | married_txt=="m" | married_txt=="maried" | married_txt=="married" -replace married2=0 if married_txt=="single" | married_txt=="S" | married_txt=="SINGLE" /// - | married_txt=="Single" | married_txt=="s" | married_txt=="sIngle" /// - | married_txt=="singLe" | married_txt=="single" | married_txt=="single " /// - | married_txt=="single " | married_txt==" single" | married_txt=="single " +replace married2=0 if married_txt=="single" | married_txt=="S" | married_txt=="SINGLE" | married_txt=="Single" | married_txt=="s" | married_txt=="sIngle" | married_txt=="singLe" | married_txt=="single" | married_txt=="single " | married_txt=="single " | married_txt==" single" | married_txt=="single " -encode married_txt, gen(married3) // good when strings are clean +// now we have a numerican variable, married2, that is equal to "1" for married observations and "0" for non-married observations -gen married_txt2 = married_txt -replace married_txt2=trim(married_txt2) // removes leading and trailing spaces -replace married_txt2=proper(married_txt2) -replace married_txt2=lower(married_txt2) -replace married_txt2=upper(married_txt2) +// if the married_txt string was clean, we could simply use "encode" to convert it to a numerical value +encode married_txt, gen(married3) // good when strings are clean +// another way to clean text is to use the "trim", "proper", "lower", and "upper" commands +gen married_txt2 = married_txt // create a new variable, married_txt2, that is equal to married_txt (for now) +replace married_txt2=trim(married_txt2) // remove leading and trailing spaces +replace married_txt2=proper(married_txt2) // change to "proper" case - first letter upper case +replace married_txt2=lower(married_txt2) // change to lower case +replace married_txt2=upper(married_txt2) // change to upper case -/* Method 2 (advanced!): regular expressions */ +/* Method 2 (advanced!): regular expressions */ /* -Regular expressions are one way that you can work with strings in variables -Regular expressions are methods that allows for searching, matching and replacing within strings. -There are two main commands: regexr and regexm. +Regular expressions are one way that you can work with strings in variables. Regular expressions are methods that allows for searching, matching and replacing within strings. There are two main commands: regexr and regexm. regexr REPLACES a value within a string with a new variable - regexm, which we'll use today, combines strings and conditional operators. -regexm lets you search within a string for a given character; it returns 1, or -TRUE, if the string has the character, and 0 otherwise. +regexm lets you search within a string for a given character; it returns 1, or TRUE, if the string has the character, and 0 otherwise. */ gen married4=1 if regexm(married_txt,"m") | regexm(married_txt,"M") @@ -499,7 +530,12 @@ drop hs2 hs3 // POLL 6 // - +/*** +We want to create a variable indicating if people worked full time (40 hrs or more). Which option would NOT generate this variable properly? +(1) gen full_time=hours>=40 <--- CORRECT ANSWER +(2) gen full_time=hours>=40 if hours<. +(3) recode hours (0/39 = 0) (40/80=1), gen(full_time) +***/ // Let's tabulate our new variable @@ -538,13 +574,18 @@ label list hs_vallabel // POLL 7 // +/*** +We want to see the value labels and the values that they correspond to for the variable hs. Which option would NOT show us this? +(1) br hs <--- CORRECT ANSWER +(2) label list +(3) label list hs_vallabel +(4) tab hs +(5) codebook hs +***/ -** CHALLENGE ** +/* Challenge question 12a */ /* -Let's make and label a new variable about college attendance. -(12a) Create a variable called somecollege, i.e. more than 12 and less than - 16 years of schooling) (call it somecollege, using any of the three methods - we used to create hs). +Let's make and label a new variable about college attendance. Create a variable called somecollege, i.e. more than 12 and less than 16 years of schooling) (call it somecollege, using any of the three methods we used to create hs. */ // method 1 gen somecollege1 = 1 if grade > 12 & grade < 16 @@ -563,23 +604,26 @@ rename somecollege1 somecollege +/* Challenge question 12b */ /* -(12b) Label somecollege "Attended some years of college" +Label somecollege "Attended some years of college" */ label variable somecollege "Attended some years of college" +/* Challenge question 12c */ /* -(12c) Create a new value label called somecollege_vallabel that assigns labels to 1 and 0 +Create a new value label called somecollege_vallabel that assigns labels to 1 and 0 */ label define somecollege_vallabel 0 "did not attend college or completed college" 1 "attended some college" +/* Challenge question 12d */ /* -(12d) Add your new value label to somecollege and check it has added +Add your new value label to somecollege and check it has added */ label val somecollege somecollege_vallabel tab somecollege diff --git a/Stata Fundamentals II/workshop2_content.do b/Stata Fundamentals II/workshop2_content.do index e17ce45..7b34444 100644 --- a/Stata Fundamentals II/workshop2_content.do +++ b/Stata Fundamentals II/workshop2_content.do @@ -23,7 +23,7 @@ /* Step 2: Copy-paste the last command that shows up on result screen. My result window shows this:*/ -cd "C:\Users\heroa\Google Drive\DLab\stata-fundamentals\Stata Fundamentals II" +cd "\\Client\C$\Users\salma\Box\dlab_workshops-s21\stata-fundamentals\Stata Fundamentals II" /*** We paste this command above so that next time we can just run this @@ -33,6 +33,12 @@ cd "C:\Users\heroa\Google Drive\DLab\stata-fundamentals\Stata Fundamentals II" ***/ // POLL 1 // +/* +Run the command “pwd”. Is your working directory set to the proper folder on YOUR computer? +(1) Yes +(2) No +(3) Don’t know +*/ ********************************************** * 0. WORKSHOP I WRAP-UP @@ -52,13 +58,23 @@ use nlsw88.dta , clear help histogram //let's take a look at the histogram command // POLL 2 // - +/* +Refresher on help files: based on the help file syntax, what information MUST be provided to run a command? +(1) bolded words +(2) bolded words and italicized arguments that are NOT in brackets +(3) bolded words and arguments in brackets +(4) bolded words and words after commas +(5) bolded words, italicized (non-bracketed) arguments, and words after commas +*/ + +/* the default histogram command gives the density of values per bin, but using the option freq allows us to visualize the frequency of values per bin */ histogram age histogram age, freq histogram wage histogram wage, freq +/* the option discrete allows us to specify that the data are discrete, which means that we visualize a separate bin for each value */ histogram age, discrete histogram wage, discrete @@ -85,10 +101,12 @@ histogram wage, freq width(2) /// xtitle("Hourly Wage in 1988 Dollars") -** CHALLENGE ** -* 1. Plot a histogram of weekly hours worked in which each bar represents 5 hours. -* Label the x-axis "Weekly Hours" +** Challenge question 1 ** +/* +(1) Plot a histogram of weekly hours worked in which each bar represents 5 hours. +(2) Label the x-axis "Weekly Hours" // variable: hours +*/ *** Additional options for a Histogram @@ -106,21 +124,27 @@ histogram wage, by(married) name(hist_wageXmarried) // POLL 3 // - -** CHALLENGE ** -* 2. Create a graph with one historgram of wage for each industry. +/* +Which of the following options can be combined in Stata’s histogram command? +(1) bin and width +(2) density and frequency +(3) start and width +(4) discrete and bin +(5) none of the above +*/ + +** Challenge question 2 ** +/* +(1) Create a graph with one historgram of wage for each industry. // variables: wage, industry -* Bonus: Include a (single) title for the whole graph +(2) Bonus: Include a (single) title for the whole graph // hint: this is an option WITHIN an option +*/ - - - *** SCATTERPLOT *** - help scatter //now scatterplots scatter wage age @@ -139,18 +163,15 @@ scatter wage age, title("Hourly vs. Age") scheme(s1mono) scatter wage age, title("Hourly vs. Age") scheme(s1mono) mcolor(blue) - - //There are other formatting changes we can also make scatter wage age, title("Hourly vs. Age") legend(on) /// mcolor(blue) xlabel(34(1)46, format(%2.0f)) ylabel(,format(%2.1f)) - *** COMBINE GRAPHS help twoway -//We want to make a scatterplot, and add a linear prediction-based line of best fit +// We want to make a scatterplot, and add a linear prediction-based line of best fit twoway (scatter wage age, mcolor(blue)) /// (lfit wage age), title("Hourly vs. Age") xlabel(34(1)46, format(%2.0f)) /// ylabel(,format(%2.1f)) legend(on) @@ -176,17 +197,13 @@ graph save hist_wageXmarried "hist_wageXmarried.gph", replace graph display hist_wageXmarried graph export "hist_wageXmarried.png", name(hist_wageXmarried) replace - *Remember- you can code all these graphs on one line without the /// *I have them broken up into multiple lines for easy display in class *Do what is best for you! - ** Additional options for a Scatter Plot - - *Scatter plot by wage and age- separate graph for each scatter wage age, by(race) @@ -196,8 +213,6 @@ scatter wage age, by(race, total) ***This is the same syntax as the histogram above! - - *** More Advanced Plotting Options *** *What if we want to put two histograms on the same plot? @@ -227,29 +242,32 @@ twoway (histogram wage if union==1, percent fcolor(blue%50) lcolor(black) start( legend(order (1 "Union" 2 "Non-Union")) title("Wage by Union Status") -** CHALLENGE ** -* 3. Create a graph with a scatter plot of wage (y-axis) and total work experience (x-axis) -* for (1) white women and (2) black women on the same set of axes. -* Include a legend that labels the plot for each race - // variables: wage, ttl_exp, race (1=white, 2=black) +** Challenge question 3 ** +/* +Create a graph with a scatter plot of wage (y-axis) and total work experience (x-axis) for (1) white women and (2) black women on the same set of axes. -* BONUS: change the marker colors from the default to 2 different fun colors - // hint: help colorstyle - - - - - +Include a legend that labels the plot for each race + // variables: wage, ttl_exp, race (1=white, 2=black) +BONUS: change the marker colors from the default to 2 different fun colors + // hint: help colorstyle +*/ + // POLL 4 // +/* +Which of the following is true about plotting in Stata? +(1) You can plot UP TO two graphs at a time +(2) There is only one right way to write code for any desired graph +(3) You can only have one graph open at a time +(4) Once you set a scheme, you cannot adjust the appearance of your graph +(5) None of the above +*/ ********************************************** * II. CORRELATION AND T-TESTS ********************************************** - - *CORRELATION AND T-TESTS //How do we use the correlation command in Stata? @@ -261,24 +279,29 @@ corr age wage // What if we want to look at age wage and tenure? // Notice anything different? -//COMMAND: +// COMMAND: corr age wage tenure pwcorr age wage tenure // POLL 5 // -corr age grade wage hours ttl_exp - - -** CHALLENGE ** -* 4. Correlate ALL of the continuous variables in the dataset that are non-missing for ALL variables - // hint: continuous variables are numeric variables for which a "unit increase" - // (or decrease) has inherent meaning. +corr age grade wage hours ttl_exp // run this line to answer poll 5 +/* +Which pair of variables has the WEAKEST correlation? +(1) age and wage +(2) hours and age +(3) wage and hours +(4) grade and ttl_exp +*/ + +** Challenge question 4 ** +/* +Correlate ALL of the continuous variables in the dataset that are non-missing for ALL variables + // hint: continuous variables are numeric variables for which a "unit increase" (or decrease) has inherent meaning. +*/ - - *T-TESTS //Now, let's test whether wages are different by union membership @@ -293,14 +316,12 @@ ttest wage, by(south) *How would you interpret this? -** CHALLENGE ** -* 5. Is there a statistically significant difference in the mean wage of white and black women? +** Challenge question 5 ** +/* +Is there a statistically significant difference in the mean wage of white and black women? // variables: wage race // hint: the ttest approach requires a conditional statement - - - - +*/ ************************************************** @@ -309,7 +330,7 @@ ttest wage, by(south) *LINEAR REGRESSION -help regress //Let's look at the doccumentation for the regress commend +help regress //Let's look at the documentation for the regress commend *Lets regress wage and age reg wage age @@ -317,10 +338,16 @@ reg wage age *How about wage on age, union, and married? reg wage age union married - - // POLL 6 // - +reg wage age union married // run this line to answer poll 6 +/* +Which coefficients are statistically significant (at the conventional 95% confidence level)? +(1) age +(2) union +(3) married +(4) union and married +(5) age, union, and married +*/ *Why can't we use married_txt? @@ -330,43 +357,33 @@ reg wage age union married_txt *What happens when we do a categorical variable? *What does this output mean? -reg wage age union married industry // Not right +reg wage age union married industry // not right: an increase or decreae in the value of industry does not have inherent meaning -*We want to treat each industry number as its own category instead of assuming a linear -*relationship between them +*We want to treat each industry number as its own category instead of assuming a linear relationship between them **How do we fix this? //COMMAND: - reg wage age union married i.industry - *The i. here lets us split up the categorical industry variable into dummies by value - - - //What if we only want to run this regression for certain industries? //COMMAND: - - reg wage age union married if industry==5 reg wage age union married if industry==12 - - *Note number of observations in these regressions *Do all of them match? *Why not? -* OMMITTED CATEGORY + +* OMITTED CATEGORY * when we run a regression with a categorical variable, there is always an ommitted category -* the coefficients are interpretted relative to the ommitted category - reg wage age union married i.industry - - * you can change the ommitted category - say you wanted it to be relative to farmers - codebook occupation - label list occlbl // farmers = 9 - reg wage ttl_exp collgrad union ib9.occupation +* the coefficients are interpreted relative to the omitted category +reg wage age union married i.industry +* you can change the omitted category - say you wanted it to be relative to farmers +codebook occupation +label list occlbl // farmers = 9 +reg wage ttl_exp collgrad union ib9.occupation * a bivariate (two variable) regression is equivalent to testing the difference in group means reg wage i.race @@ -376,19 +393,17 @@ reg wage i.race ttest wage if race <3, by(race) -** CHALLENGE ** -* 6. Regress wage (dependent variable) on: -* total experience, -* college graduation, -* union status, and -* occupation. -* Omit respondents in occupations that are: -* (1) unknown (i.e., "other" or missing) or (2) have fewer than 20 respondents. - // variables: wage ttl_exp collgrad union occupation - - - - +** Challenge question 6 ** +/* +Regress wage (dependent variable) on: + total experience, + college graduation, + union status, and + occupation. +Omit respondents in occupations that are: + (1) unknown (i.e., "other" or missing) or (2) have fewer than 20 respondents. + variables: wage ttl_exp collgrad union occupation +*/ *INTERACTIONS @@ -397,10 +412,8 @@ ttest wage if race <3, by(race) *Basic regression reg wage age union married collgrad - gen marriedXcollgrad= married*collgrad - reg wage age union married collgrad marriedXcollgrad // Another way to do this: @@ -413,13 +426,11 @@ reg wage age union married##collgrad // How do these two specifications differ? - - ****************************************** * IV. POST-ESTIMATION ****************************************** -//We can do more than just display coefficients following regression -//Examples from linear regression +// We can do more than just display coefficients following regression +// Examples from linear regression help regress postestimation // here is the relevant help file @@ -427,21 +438,16 @@ help regress postestimation // here is the relevant help file reg wage union age married estat hettest - - *WALD TESTS reg wage union age married test union = married - - -** CHALLENGE ** -* 7. Are the wages of clerical/unskilled workers significantly different from -* unskilled workers? -*(hint: two methods - one using Wald test, another using ommitted categories) - - +** Challenge question 7** +/* +Are the wages of clerical/skilled workers significantly different from unskilled workers? +(hint: two methods - one using Wald test, another using omitted categories) +*/ ****************************************** @@ -450,7 +456,7 @@ test union = married // Sometimes, we may want to display results in figures rather than tables -//you will need to run the below to install this very useful user-written command +// you will need to run the below to install this very useful user-written command ssc install coefplot reg wage union age married i.industry @@ -458,12 +464,9 @@ coefplot coefplot, horizontal coefplot, drop(_cons) horizontal - -//What if you want to use 99 percent confidence intervals instead of 95? -//Use the help file for coefplot to figure out how to plot the above figure that way +// What if you want to use 99 percent confidence intervals instead of 95? +// Use the help file for coefplot to figure out how to plot the above figure that way //COMMAND: - - reg wage union age married i.industry coefplot, levels(99 95) diff --git a/Stata Fundamentals II/workshop2_solutions.do b/Stata Fundamentals II/workshop2_solutions.do index 21b0c73..67233a5 100644 --- a/Stata Fundamentals II/workshop2_solutions.do +++ b/Stata Fundamentals II/workshop2_solutions.do @@ -1,7 +1,6 @@ ******************************** * STATA FUNDAMENTALS: WORKSHOP 2 * SPRING 2021, D-LAB -* SOLUTIONS DO FILE ******************************** ************************************************** @@ -24,7 +23,7 @@ /* Step 2: Copy-paste the last command that shows up on result screen. My result window shows this:*/ -cd "C:\Users\heroa\Google Drive\DLab\stata-fundamentals\Stata Fundamentals II" +cd "\\Client\C$\Users\salma\Box\dlab_workshops-s21\stata-fundamentals\Stata Fundamentals II" /*** We paste this command above so that next time we can just run this @@ -34,6 +33,12 @@ cd "C:\Users\heroa\Google Drive\DLab\stata-fundamentals\Stata Fundamentals II" ***/ // POLL 1 // +/* +Run the command “pwd”. Is your working directory set to the proper folder on YOUR computer? +(1) Yes <-- hopefully this is the case! +(2) No <-- if this is your answer, double check that you've run the "cd" command above +(3) Don’t know <-- if this is your answer, check if pwd returns the same directory as the folder that contains nlsw88.dta +*/ ********************************************** * 0. WORKSHOP I WRAP-UP @@ -53,13 +58,23 @@ use nlsw88.dta , clear help histogram //let's take a look at the histogram command // POLL 2 // - +/* +Refresher on help files: based on the help file syntax, what information MUST be provided to run a command? +(1) bolded words +(2) bolded words and italicized arguments that are NOT in brackets <-- correct answer +(3) bolded words and arguments in brackets +(4) bolded words and words after commas +(5) bolded words, italicized (non-bracketed) arguments, and words after commas +*/ + +/* the default histogram command gives the density of values per bin, but using the option freq allows us to visualize the frequency of values per bin */ histogram age histogram age, freq histogram wage histogram wage, freq +/* the option discrete allows us to specify that the data are discrete, which means that we visualize a separate bin for each value */ histogram age, discrete histogram wage, discrete @@ -85,14 +100,14 @@ histogram wage, freq width(2) /// title("Histogram by Wage in National Labor Survery in 1988") /// xtitle("Hourly Wage in 1988 Dollars") - -** CHALLENGE ** -* 1. Plot a histogram of weekly hours worked in which each bar represents 5 hours. -* Label the x-axis "Weekly Hours" +** Challenge question 1 ** +/* +(1) Plot a histogram of weekly hours worked in which each bar represents 5 hours. +(2) Label the x-axis "Weekly Hours" // variable: hours - +*/ +/* Solution*/ histogram hours, width(5) start(0) xtitle("Weekly Hours") - * OR sum hours // max is 80 histogram hours, bin(16) start(0) xtitle("Weekly Hours") @@ -113,26 +128,30 @@ histogram wage, by(married) name(hist_wageXmarried) // POLL 3 // - -** CHALLENGE ** -* 2. Create a graph with one historgram of wage for each industry. +/* +Which of the following options can be combined in Stata’s histogram command? +(1) bin and width +(2) density and frequency +(3) start and width <-- correct answer +(4) discrete and bin +(5) none of the above +*/ + +** Challenge question 2 ** +/* +(1) Create a graph with one historgram of wage for each industry. // variables: wage, industry -histogram wage, by(industry) - -* Bonus: Include a (single) title for the whole graph +(2) Bonus: Include a (single) title for the whole graph // hint: this is an option WITHIN an option - +*/ +/* Solution */ +histogram wage, by(industry) histogram wage, by(industry, title("Wage by Industry")) - - - - *** SCATTERPLOT *** - help scatter //now scatterplots scatter wage age @@ -151,13 +170,10 @@ scatter wage age, title("Hourly vs. Age") scheme(s1mono) scatter wage age, title("Hourly vs. Age") scheme(s1mono) mcolor(blue) - - //There are other formatting changes we can also make scatter wage age, title("Hourly vs. Age") legend(on) /// mcolor(blue) xlabel(34(1)46, format(%2.0f)) ylabel(,format(%2.1f)) - *** COMBINE GRAPHS help twoway @@ -193,12 +209,8 @@ graph export "hist_wageXmarried.png", name(hist_wageXmarried) replace *I have them broken up into multiple lines for easy display in class *Do what is best for you! - - ** Additional options for a Scatter Plot - - *Scatter plot by wage and age- separate graph for each scatter wage age, by(race) @@ -208,8 +220,6 @@ scatter wage age, by(race, total) ***This is the same syntax as the histogram above! - - *** More Advanced Plotting Options *** *What if we want to put two histograms on the same plot? @@ -239,32 +249,36 @@ twoway (histogram wage if union==1, percent fcolor(blue%50) lcolor(black) start( legend(order (1 "Union" 2 "Non-Union")) title("Wage by Union Status") -** CHALLENGE ** -* 3. Create a graph with a scatter plot of wage (y-axis) and total work experience (x-axis) -* for (1) white women and (2) black women on the same set of axes. -* Include a legend that labels the plot for each race - // variables: wage, ttl_exp, race (1=white, 2=black) +** Challenge question 3 ** +/* +Create a graph with a scatter plot of wage (y-axis) and total work experience (x-axis) for (1) white women and (2) black women on the same set of axes. -* BONUS: change the marker colors from the default to 2 different fun colors - // hint: help colorstyle +Include a legend that labels the plot for each race + // variables: wage, ttl_exp, race (1=white, 2=black) +BONUS: change the marker colors from the default to 2 different fun colors + // hint: help colorstyle +*/ +/* Solution */ twoway (scatter wage ttl_exp if race==1, col(cranberry)) /// (scatter wage ttl_exp if race==2, col(teal)), /// legend(label(1 "White") label(2 "Black")) - - - - // POLL 4 // +/* +Which of the following is true about plotting in Stata? +(1) You can plot UP TO two graphs at a time +(2) There is only one right way to write code for any desired graph +(3) You can only have one graph open at a time +(4) Once you set a scheme, you cannot adjust the appearance of your graph +(5) None of the above <-- Correct answer +*/ ********************************************** * II. CORRELATION AND T-TESTS ********************************************** - - *CORRELATION AND T-TESTS //How do we use the correlation command in Stata? @@ -282,18 +296,24 @@ corr age wage tenure pwcorr age wage tenure // POLL 5 // -corr age grade wage hours ttl_exp - - -** CHALLENGE ** -* 4. Correlate ALL of the continuous variables in the dataset that are non-missing for ALL variables - // hint: continuous variables are numeric variables for which a "unit increase" - // (or decrease) has inherent meaning. - +corr age grade wage hours ttl_exp // run this line to answer poll 5 +/* +Which pair of variables has the WEAKEST correlation? +(1) age and wage +(2) hours and age <-- correct answer +(3) wage and hours +(4) grade and ttl_exp +*/ + +** Challenge question 4 ** +/* +Correlate ALL of the continuous variables in the dataset that are non-missing for ALL variables + // hint: continuous variables are numeric variables for which a "unit increase" (or decrease) has inherent meaning. +*/ +/* Solution */ corr age grade wage hours ttl_exp tenure - *T-TESTS //Now, let's test whether wages are different by union membership @@ -308,18 +328,18 @@ ttest wage, by(south) *How would you interpret this? -** CHALLENGE ** -* 5. Is there a statistically significant difference in the mean wage of white and black women? +** Challenge question 5 ** +/* +Is there a statistically significant difference in the mean wage of white and black women? // variables: wage race // hint: the ttest approach requires a conditional statement -ttest wage if race<3, by(race) - -/* Answer: - Yes, white and black women earn significantly different wages on average. - White women earn $1.24 more per hour than black women. +*/ +/* Solution */ +ttest wage if race < 3, by(race) +/* +Yes, white and black women earn significantly different wages on average. White women earn $1.24 more per hour than black women. */ - ************************************************** * III. REGRESSION AND ITS OUTPUT @@ -327,7 +347,7 @@ ttest wage if race<3, by(race) *LINEAR REGRESSION -help regress //Let's look at the doccumentation for the regress commend +help regress //Let's look at the documentation for the regress commend *Lets regress wage and age reg wage age @@ -335,10 +355,16 @@ reg wage age *How about wage on age, union, and married? reg wage age union married - - // POLL 6 // - +reg wage age union married // run this line to answer poll 6 +/* +Which coefficients are statistically significant (at the conventional 95% confidence level)? +(1) age +(2) union <-- correct answer +(3) married +(4) union and married +(5) age, union, and married +*/ *Why can't we use married_txt? @@ -348,43 +374,33 @@ reg wage age union married_txt *What happens when we do a categorical variable? *What does this output mean? -reg wage age union married industry // Not right +reg wage age union married industry // not right: an increase or decreae in the value of industry does not have inherent meaning -*We want to treat each industry number as its own category instead of assuming a linear -*relationship between them +*We want to treat each industry number as its own category instead of assuming a linear relationship between them **How do we fix this? //COMMAND: - reg wage age union married i.industry - *The i. here lets us split up the categorical industry variable into dummies by value - - - //What if we only want to run this regression for certain industries? //COMMAND: - - reg wage age union married if industry==5 reg wage age union married if industry==12 - - *Note number of observations in these regressions *Do all of them match? *Why not? -* OMMITTED CATEGORY + +* OMITTED CATEGORY * when we run a regression with a categorical variable, there is always an ommitted category -* the coefficients are interpretted relative to the ommitted category - reg wage age union married i.industry - - * you can change the ommitted category - say you wanted it to be relative to farmers - codebook occupation - label list occlbl // farmers = 9 - reg wage ttl_exp collgrad union ib9.occupation +* the coefficients are interpreted relative to the omitted category +reg wage age union married i.industry +* you can change the omitted category - say you wanted it to be relative to farmers +codebook occupation +label list occlbl // farmers = 9 +reg wage ttl_exp collgrad union ib9.occupation * a bivariate (two variable) regression is equivalent to testing the difference in group means reg wage i.race @@ -394,20 +410,21 @@ reg wage i.race ttest wage if race <3, by(race) -** CHALLENGE ** -* 6. Regress wage (dependent variable) on: -* total experience, -* college graduation, -* union status, and -* occupation. -* Omit respondents in occupations that are: -* (1) unknown (i.e., "other" or missing) or (2) have fewer than 20 respondents. - // variables: wage ttl_exp collgrad union occupation - +** Challenge question 6 ** +/* +Regress wage (dependent variable) on: + total experience, + college graduation, + union status, and + occupation. +Omit respondents in occupations that are: + (1) unknown (i.e., "other" or missing) or (2) have fewer than 20 respondents. + variables: wage ttl_exp collgrad union occupation +*/ +/* Solution */ /* STEP 1 */ tab occupation tab occupation, nolab - * OR codebook occupation tab occupation @@ -417,18 +434,14 @@ label list occlbl reg wage ttl_exp collgrad union i.occupation if occupation<9 - - *INTERACTIONS // Let's add an interaction term for being married and graduating from college *Basic regression reg wage age union married collgrad - gen marriedXcollgrad= married*collgrad - reg wage age union married collgrad marriedXcollgrad // Another way to do this: @@ -441,8 +454,6 @@ reg wage age union married##collgrad // How do these two specifications differ? - - ****************************************** * IV. POST-ESTIMATION ****************************************** @@ -455,19 +466,17 @@ help regress postestimation // here is the relevant help file reg wage union age married estat hettest - - *WALD TESTS reg wage union age married test union = married - - -** CHALLENGE ** -* 7. Are the wages of clerical/unskilled workers significantly different from -* unskilled workers? -*(hint: two methods - one using Wald test, another using ommitted categories) +** Challenge question 7** +/* +Are the wages of clerical/skilled workers significantly different from unskilled workers? +(hint: two methods - one using Wald test, another using omitted categories) +*/ +/* Solution */ label list occlbl @@ -476,11 +485,8 @@ reg wage i.occupation test 4.occupation = 5.occupation /* Solution 2 */ - reg wage ib4.occupation - *OR - reg wage ib5.occupation @@ -490,7 +496,7 @@ reg wage ib5.occupation // Sometimes, we may want to display results in figures rather than tables -//you will need to run the below to install this very useful user-written command +// you will need to run the below to install this very useful user-written command ssc install coefplot reg wage union age married i.industry @@ -498,12 +504,9 @@ coefplot coefplot, horizontal coefplot, drop(_cons) horizontal - -//What if you want to use 99 percent confidence intervals instead of 95? -//Use the help file for coefplot to figure out how to plot the above figure that way +// What if you want to use 99 percent confidence intervals instead of 95? +// Use the help file for coefplot to figure out how to plot the above figure that way //COMMAND: - - reg wage union age married i.industry coefplot, levels(99 95) diff --git a/Stata Fundamentals III/nlsw88_complete.dta b/Stata Fundamentals III/nlsw88_complete.dta new file mode 100644 index 0000000..a3d0054 Binary files /dev/null and b/Stata Fundamentals III/nlsw88_complete.dta differ diff --git a/Stata Fundamentals III/nlsw88_wave1and2.dta b/Stata Fundamentals III/nlsw88_wave1and2.dta new file mode 100644 index 0000000..725d11b Binary files /dev/null and b/Stata Fundamentals III/nlsw88_wave1and2.dta differ diff --git a/Stata Fundamentals III/workshop3_content.do b/Stata Fundamentals III/workshop3_content.do index 6692725..47b2e90 100644 --- a/Stata Fundamentals III/workshop3_content.do +++ b/Stata Fundamentals III/workshop3_content.do @@ -16,23 +16,26 @@ ************************************************** -/* Step 1: File > Change Working Directory > Navigate to the folder where you - have saved the data file nlsw88.dta */ +/* Step 1: File > Change Working Directory > Navigate to the folder where you have saved the data file nlsw88.dta */ -/* Step 2: Copy-paste the last command that shows up on result screen. - My result window shows this:*/ +/* Step 2: Copy-paste the last command that shows up on result screen. My result window shows this:*/ -cd "C:\Users\heroa\Google Drive\DLab\stata-fundamentals\Stata Fundamentals III" +cd "\\Client\C$\Users\salma\Box\dlab_workshops-s21\stata-fundamentals\Stata Fundamentals III" /*** - We paste this command above so that next time we can just run this - do-file from the top and it will run smoothly. We will not need to - use the file menu or copy-paste again. We should be able to run - everything from the do-file. +We paste this command above so that next time we can just run this do-file from the top and it will run smoothly. We will not need to +use the file menu or copy-paste again. We should be able to run everything from the do-file. ***/ pwd // POLL 1 // +/* +Run the command "pwd". Is your working directory proper folder on your computer? + +(1) Yes +(2) No +(3) Not sure (IOKN2K!) +*/ ********************************************** * 0. WORKSHOP II WRAP-UP @@ -56,8 +59,6 @@ sum idcode // what is the range of id numbers in this dataset? br // let's browse the data - - * Data for round 2, use nlsw88_wave2.dta, clear @@ -82,16 +83,14 @@ br // let's browse the data save nlsw88_wave1and2.dta, replace - * MERGE DATASETS * Data for wave 1 use nlsw88_wave1and2.dta -isid idcode // check if id makes a unique identifier +isid idcode // check if id makes a unique identifier. If we don't get an error message, then idcode is a unique identifier duplicates report idcode - *Lets look at the second part of the dataset use nlsw88_childvars, clear @@ -103,7 +102,7 @@ duplicates report idcode // another way of checking if idcode is unique * Merge use nlsw88_wave1and2.dta, clear -merge 1:1 idcode using nlsw88_childvars //one-to-one merge +merge 1:1 idcode using nlsw88_childvars // one-to-one merge on idcode *What does it mean to have different _merge values? tab _merge @@ -119,6 +118,15 @@ drop _merge_1 //or drop merge save nlsw88_complete.dta, replace // POLL 2 // +/* +Which of the following are true about appending and merging? + +(1) to merge, a variable must uniquely identify obs in both datasets +(2) appending only works if all variables appear in both datasets +(3) if not all observations merge, an error has occurred +(4) appending requires fewer arguments than merging +(5) you can never lose information appending or merging +*/ ********************************************** * II. RESHAPING @@ -127,11 +135,12 @@ save nlsw88_complete.dta, replace *Load data use nlsw88_complete.dta, clear -//Some of this data is in "wide" format +// Some of this data is in "wide" format list idcode childage1 childage2 childage3 childage4 in 1/10 //print the data to the screen -//First, we want to try and reshape it to "long" format -//Now, each row will be not a single individual, but an individual-child +// First, we want to try and reshape it to "long" format +// Now, each row will be not a single individual, but an individual-child +// "long" indicates that we want to go from wide to long format, and "childage" indicates that we will be creating another id based on the childage# variables that we will name childidcode reshape long childage, i(idcode) j(childidcode) list idcode childidcode childage in 1/10 //print again @@ -149,44 +158,33 @@ j() will be how we're reshaping it*/ reshape wide childage, i(idcode) j(childidcode) // POLL 3 // +/* +When merging and reshaping data, Stata uses ‘idcode’ as its unique identifier because… +(1) it has ‘id’ in the name +(2) it has ‘code’ in the name +(3) because it has no duplicates +(4) because we tell Stata in the command +(5) Both 1 and 4 are correct +*/ +/* Challenge question 1 */ +/* +Rather than merging nlsw88_childvars into nlsw88_wave1and2 and then reshaping, we could instead have first reshaped nlsw88_childvars, and then done a many-to-one merge. Let's try that now! -/* CHALLENGE: RESHAPING AND MERGING ** - Rather than merging nlsw88_childvars into nlsw88_wave1and2 and then reshaping, - we could instead have first reshaped nlsw88_childvars, and then done a - many-to-one merge. Let's try that now!*/ - -/*1.1: Open up nlsw88_childvars, and reshape it to long format*/ - - - +1.1: Open up nlsw88_childvars, and reshape it to long format -/*1.2: Merge nlsw88_wave1and2 (using) into nlsw88_childvars (master) - using a many-to-one syntax*/ +1.2: Merge nlsw88_wave1and2 (using) into nlsw88_childvars (master) using a many-to-one syntax +1.3: We want this data to be organized at the woman-child level, meaning we should have a number of observations for each woman matching the number of children she has. For example, if a women has 3 children, there should be 3 observations for her. +1.3.1: How many observations are there initially? How many women are there in our data? (hint: use the user-written command --unique-- by typing "install ssc unique" and then looking at the help file) - -/*1.3: We want this data to be organized at the woman-child level, - meaning we should have a number of observations for each woman matching - the number of children she has. For example, if a women has 3 children, there - should be 3 observations for her. - - 1.3.1: How many observations are there initially? How many women are there in our data? - (hint: use the user-written command --unique-- by typing "install ssc unique" - and then looking at the help file) - - 1.3.2: How could you check if there are women with extra observations? - (note: there are many ways to 'answer' this question) +1.3.2: How could you check if there are women with extra observations? (note: there are many ways to 'answer' this question) - 1.3.3: Can you find a way to drop observations for "fake" (created by the reshape) - child observations? +1.3.3: Can you find a way to drop observations for "fake" (created by the reshape) child observations? - 1.3.4: What is the correct number of observations in the end?*/ - - - - +1.3.4: What is the correct number of observations in the end? +*/ ********************************************** @@ -197,9 +195,9 @@ reshape wide childage, i(idcode) j(childidcode) use nlsw88_complete.dta, clear *LOCALS - -local i=1 -disp `i' +// locals are a way to save a value in memory until you close Stata, or run another ado file +local i=1 // save the value 1 to variable i +disp `i' // when referring to a local variable, use the `' syntax (i.e. `i') disp "The local called i has the value `i'" //Now we increase i by 2. local i=`i'+2 @@ -228,12 +226,19 @@ local industry_lab: value label industry display "The value label for industry is `industry_lab'." // POLL 4 // +/* +Which of the following would allow you to display a local that contains the string “I love Stata”? +(1) display `local’ +(2) display $local +(3) display “`local’” +(4) display “$local” +(5) 1 or 3 +*/ *GLOBALS - -//Considered bad form in programming, use sparingly. -//Easy list of long set of variable names -//Set the file path name for different computers +// Considered bad form in programming, use sparingly. +// Easy list of long set of variable names +// Set the file path name for different computers /* Making a Global:*/ @@ -241,7 +246,6 @@ display "The value label for industry is `industry_lab'." pwd * copy your own working directory and replace mine below* - global mycomp "C:\Users\heroa\Google Drive\DLab\stata-fundamentals\Stata Fundamentals II" //Check if it worked @@ -253,7 +257,7 @@ cd "$mycomp" // check it worked pwd -// this global will be useful latter when we save files to a different folder +// this global will be useful later when we save files to a different folder ********************************************** @@ -272,25 +276,33 @@ foreach var in wage ttl_exp hours { reg `var' grade } - - //Instead of using foreach var in, we can also use foreach var of //This works only with variables foreach fudge of varlist wage ttl_exp hours { reg `fudge' grade } - //Using of varlist lets us do interesting things like search our variable list foreach fudge of varlist t* { reg `fudge' grade } -/*You may notice that the output from inside a loop is not +/* +You may notice that the output from inside a loop is not quite as well documented as from outside a loop -It can be helpful to add display lines explaining where the code is*/ +It can be helpful to add display lines explaining where the code is +*/ // POLL 5 // +/* +Look at the code for the loop on the screen. How many times will the code inside this loop run? +(1) One time +(2) Two times +(3) Three times +(4) Four times +(5) Six times +(6) There’s no way to know +*/ foreach fudge in wage ttl_exp hours { disp "********This regresses `fudge' on grade **********" @@ -366,64 +378,58 @@ foreach var of varlist `outcomes' { } // POLL 6 // - - -/** CHALLENGE 2: LOCALS AND LOOPS ** - Let's use nlsw88_complete to explore locals and loops further! Let's imagine we want to - make a "dictionary" from this dataset, or print on the screen some information - about each of the variables in the data. +/* +Which of the following is true about loops? +(1) you must select part of EACH line of the loop (including the close bracket) for it to run +(2) forval loops can loop over any list of numbers +(3) you cannot create/change locals inside a loop +(4) foreach loops can only loop over variables +(5) if looping over a macro, you always need to use typical macro syntax ($global or `local’) +*/ + + +/** Challenge question 2: locals and loops **/ +/* +Let's use nlsw88_complete to explore locals and loops further! Let's imagine we want to make a "dictionary" from this dataset, or print on the screen some information about each of the variables in the data. - In this exercise, we'll focus on ttl_exp, tenure, south and smsa.*/ +In this exercise, we'll focus on ttl_exp, tenure, south and smsa. +*/ -/*2.1: Use the --help extended_fcn-- file to make a local containing the variable -label of ttl_exp, and display it. The command can be found under the subheading -"Macro functions for extracting data attributes" in the help file extended_fcn*/ +/* +2.1: Use the --help extended_fcn-- file to make a local containing the variable label of ttl_exp, and display it. The command can be found under the subheading "Macro functions for extracting data attributes" in the help file extended_fcn * (hint: the variable label is the explanation for what the variable is) +*/ +/* +2.2: Make a loop which goes over ttl_exp, tenure, south and smsa and +lists the variable label for each one. +*/ -/*2.2: Make a loop which goes over ttl_exp, tenure, south and smsa and -lists the variable label for each one.*/ - - - - - - -/*2.3: Display the sentence - using locals and extended functions, not words - - in the following format: "ttl_exp (float) contains the total work experence for each - woman in the dataset." */ - - - +/* +2.3: Display the sentence - using locals and extended functions, not words - in the following format: "ttl_exp (float) contains the total work experence for each woman in the dataset." +*/ -/*2.4: Make a loop which takes your sentence above, and fills it in for - ttl_exp, tenure, south and smsa. Put a number at the beginning of each sentence - which updates by one every time your loop runs*/ - - - - +/* +2.4: Make a loop which takes your sentence above, and fills it in for ttl_exp, tenure, south and smsa. Put a number at the beginning of each sentence which updates by one every time your loop runs +*/ -/*2.5 (CHALLENGE): Write a loop which produces the exact same results, - but this time use a forvalues loop to loop over the numbers 1 to 4 to do so. +/* +2.5 (CHALLENGE): Write a loop which produces the exact same results, but this time use a forvalues loop to loop over the numbers 1 to 4 to do so. +Hint: check the extended function help file and look at "word # of string". +*/ - Hint: check the extended function help file and look at "word # of string".*/ - - - - ********************************************** * V. EXPORTING RESULTS ********************************************** //Create a folder to store output -//the mkdir folder creates the folder specified in " " (if the file path makes sense) -//cap, or capture, is a Stata command which tells Stata to keep going even if it can't implement that command +// the mkdir folder creates the folder specified in " " (if the file path makes sense) +// cap, or capture, is a Stata command which tells Stata to keep going even if it can't implement that command // we are making use of your global $mycomp so that we don't have to write out the whole filepath @@ -432,10 +438,17 @@ cap mkdir "$mycomp/Output" *OUTREG2 //To install outreg2: ssc install outreg2 - global controlvars south married union // POLL 7 // +/* +Take a look at the outreg2 help file on the screen. Which parts of the command must be specified for the command to run in the Full Syntax? +(1) Whether the command replaces or appends +(2) A column title +(3) A list of variables or estimations to export +(4) A file name for where the results will be stored +(5) None of the above +*/ // Export results to EXCEL (default is text file) diff --git a/Stata Fundamentals III/workshop3_solutions.do b/Stata Fundamentals III/workshop3_solutions.do index afcb260..c164028 100644 --- a/Stata Fundamentals III/workshop3_solutions.do +++ b/Stata Fundamentals III/workshop3_solutions.do @@ -1,7 +1,6 @@ ******************************* * STATA FUNDAMENTALS: WORKSHOP 3 * SPRING 2021, D-LAB -* SOLUTIONS DO FILE ******************************** ************************************************** @@ -17,23 +16,26 @@ ************************************************** -/* Step 1: File > Change Working Directory > Navigate to the folder where you - have saved the data file nlsw88.dta */ +/* Step 1: File > Change Working Directory > Navigate to the folder where you have saved the data file nlsw88.dta */ -/* Step 2: Copy-paste the last command that shows up on result screen. - My result window shows this:*/ +/* Step 2: Copy-paste the last command that shows up on result screen. My result window shows this:*/ -cd "C:\Users\heroa\Google Drive\DLab\stata-fundamentals\Stata Fundamentals III" +cd "\\Client\C$\Users\salma\Box\dlab_workshops-s21\stata-fundamentals\Stata Fundamentals III" /*** - We paste this command above so that next time we can just run this - do-file from the top and it will run smoothly. We will not need to - use the file menu or copy-paste again. We should be able to run - everything from the do-file. +We paste this command above so that next time we can just run this do-file from the top and it will run smoothly. We will not need to +use the file menu or copy-paste again. We should be able to run everything from the do-file. ***/ pwd // POLL 1 // +/* +Run the command "pwd". Is your working directory proper folder on your computer? + +(1) Yes <-- good job! +(2) No <-- try running the command above again +(3) Not sure (IOKN2K!) <-- look for the nlsw88.dta file on your computer. Is that folder the same folder as the one that shows up when you type "pwd"? +*/ ********************************************** * 0. WORKSHOP II WRAP-UP @@ -57,8 +59,6 @@ sum idcode // what is the range of id numbers in this dataset? br // let's browse the data - - * Data for round 2, use nlsw88_wave2.dta, clear @@ -83,16 +83,14 @@ br // let's browse the data save nlsw88_wave1and2.dta, replace - * MERGE DATASETS * Data for wave 1 use nlsw88_wave1and2.dta -isid idcode // check if id makes a unique identifier +isid idcode // check if id makes a unique identifier. If we don't get an error message, then idcode is a unique identifier duplicates report idcode - *Lets look at the second part of the dataset use nlsw88_childvars, clear @@ -104,7 +102,7 @@ duplicates report idcode // another way of checking if idcode is unique * Merge use nlsw88_wave1and2.dta, clear -merge 1:1 idcode using nlsw88_childvars //one-to-one merge +merge 1:1 idcode using nlsw88_childvars // one-to-one merge on idcode *What does it mean to have different _merge values? tab _merge @@ -120,6 +118,15 @@ drop _merge_1 //or drop merge save nlsw88_complete.dta, replace // POLL 2 // +/* +Which of the following are true about appending and merging? + +(1) to merge, a variable must uniquely identify obs in both datasets +(2) appending only works if all variables appear in both datasets +(3) if not all observations merge, an error has occurred +(4) appending requires fewer arguments than merging <-- correct answer +(5) you can never lose information appending or merging +*/ ********************************************** * II. RESHAPING @@ -128,11 +135,12 @@ save nlsw88_complete.dta, replace *Load data use nlsw88_complete.dta, clear -//Some of this data is in "wide" format +// Some of this data is in "wide" format list idcode childage1 childage2 childage3 childage4 in 1/10 //print the data to the screen -//First, we want to try and reshape it to "long" format -//Now, each row will be not a single individual, but an individual-child +// First, we want to try and reshape it to "long" format +// Now, each row will be not a single individual, but an individual-child +// "long" indicates that we want to go from wide to long format, and "childage" indicates that we will be creating another id based on the childage# variables that we will name childidcode reshape long childage, i(idcode) j(childidcode) list idcode childidcode childage in 1/10 //print again @@ -150,38 +158,35 @@ j() will be how we're reshaping it*/ reshape wide childage, i(idcode) j(childidcode) // POLL 3 // - - -/* CHALLENGE: RESHAPING AND MERGING ** - Rather than merging nlsw88_childvars into nlsw88_wave1and2 and then reshaping, - we could instead have first reshaped nlsw88_childvars, and then done a - many-to-one merge. Let's try that now!*/ - -/*1.1: Open up nlsw88_childvars, and reshape it to long format*/ +/* +When merging and reshaping data, Stata uses ‘idcode’ as its unique identifier because… +(1) it has ‘id’ in the name +(2) it has ‘code’ in the name +(3) because it has no duplicates +(4) because we tell Stata in the command <-- correct answer +(5) Both 1 and 4 are correct +*/ + +/* Challenge question 1 */ +/* +Rather than merging nlsw88_childvars into nlsw88_wave1and2 and then reshaping, we could instead have first reshaped nlsw88_childvars, and then done a many-to-one merge. Let's try that now! + +1.1: Open up nlsw88_childvars, and reshape it to long format +*/ use nlsw88_childvars, clear reshape long childage, i(idcode) j(childidcode) - - - -/*1.2: Merge nlsw88_wave1and2 (using) into nlsw88_childvars (master) - using a many-to-one syntax*/ +/* +1.2: +Merge nlsw88_wave1and2 (using) into nlsw88_childvars (master) using a many-to-one syntax +*/ merge m:1 idcode using nlsw88_wave1and2 - - - -/*1.3: We want this data to be organized at the woman-child level, - meaning we should have a number of observations for each woman matching - the number of children she has. For example, if a women has 3 children, there - should be 3 observations for her. - 1.3.1: How many observations are there initially? How many women are there in our data? - (hint: use the user-written command --unique-- by typing "install ssc unique" - and then looking at the help file) - 1.3.2: How could you check if there are women with extra observations? - (note: there are many ways to 'answer' this question) - 1.3.3: Can you find a way to drop observations for "fake" (created by the reshape) - child observations? - 1.3.4: What is the correct number of observations in the end?*/ - +/* +1.3: We want this data to be organized at the woman-child level, meaning we should have a number of observations for each woman matching the number of children she has. For example, if a women has 3 children, there should be 3 observations for her. +1.3.1: How many observations are there initially? How many women are there in our data? (hint: use the user-written command --unique-- by typing "install ssc unique" and then looking at the help file) +1.3.2: How could you check if there are women with extra observations? (note: there are many ways to 'answer' this question) +1.3.3: Can you find a way to drop observations for "fake" (created by the reshape) child observations? +1.3.4: What is the correct number of observations in the end? +*/ *1.3.1 count // 12,635 observations unique idcode // this user-written command will tells us there should be 3167 women @@ -200,9 +205,6 @@ drop if (child_num==0 | child_num==.) & childidcode>1 & _merge!=2 // this keeps count // 6,568 obserations unique idcode // still 3167 - - - ********************************************** * III. MACROS @@ -212,9 +214,9 @@ unique idcode // still 3167 use nlsw88_complete.dta, clear *LOCALS - -local i=1 -disp `i' +// locals are a way to save a value in memory until you close Stata, or run another ado file +local i=1 // save the value 1 to variable i +disp `i' // when referring to a local variable, use the `' syntax (i.e. `i') disp "The local called i has the value `i'" //Now we increase i by 2. local i=`i'+2 @@ -243,12 +245,19 @@ local industry_lab: value label industry display "The value label for industry is `industry_lab'." // POLL 4 // +/* +Which of the following would allow you to display a local that contains the string “I love Stata”? +(1) display `local' +(2) display $local +(3) display “`local'” <-- correct answer +(4) display “$local” +(5) 1 or 3 +*/ *GLOBALS - -//Considered bad form in programming, use sparingly. -//Easy list of long set of variable names -//Set the file path name for different computers +// Considered bad form in programming, use sparingly. +// Easy list of long set of variable names +// Set the file path name for different computers /* Making a Global:*/ @@ -256,7 +265,6 @@ display "The value label for industry is `industry_lab'." pwd * copy your own working directory and replace mine below* - global mycomp "C:\Users\heroa\Google Drive\DLab\stata-fundamentals\Stata Fundamentals II" //Check if it worked @@ -268,7 +276,7 @@ cd "$mycomp" // check it worked pwd -// this global will be useful latter when we save files to a different folder +// this global will be useful later when we save files to a different folder ********************************************** @@ -287,25 +295,33 @@ foreach var in wage ttl_exp hours { reg `var' grade } - - //Instead of using foreach var in, we can also use foreach var of //This works only with variables foreach fudge of varlist wage ttl_exp hours { reg `fudge' grade } - //Using of varlist lets us do interesting things like search our variable list foreach fudge of varlist t* { reg `fudge' grade } -/*You may notice that the output from inside a loop is not +/* +You may notice that the output from inside a loop is not quite as well documented as from outside a loop -It can be helpful to add display lines explaining where the code is*/ +It can be helpful to add display lines explaining where the code is +*/ // POLL 5 // +/* +Look at the code for the loop on the screen. How many times will the code inside this loop run? +(1) One time +(2) Two times +(3) Three times <-- correct answer (for loop directly below this poll) +(4) Four times +(5) Six times +(6) There’s no way to know +*/ foreach fudge in wage ttl_exp hours { disp "********This regresses `fudge' on grade **********" @@ -381,50 +397,52 @@ foreach var of varlist `outcomes' { } // POLL 6 // - - -/** CHALLENGE 2: LOCALS AND LOOPS ** - Let's use nlsw88_complete to explore locals and loops further! Let's imagine we want to - make a "dictionary" from this dataset, or print on the screen some information - about each of the variables in the data. +/* +Which of the following is true about loops? +(1) you must select part of EACH line of the loop (including the close bracket) for it to run +(2) forval loops can loop over any list of numbers +(3) you cannot create/change locals inside a loop +(4) foreach loops can only loop over variables +(5) if looping over a macro, you always need to use typical macro syntax ($global or `local’) +*/ + + +/** Challenge question 2: locals and loops **/ +/* +Let's use nlsw88_complete to explore locals and loops further! Let's imagine we want to make a "dictionary" from this dataset, or print on the screen some information about each of the variables in the data. - In this exercise, we'll focus on ttl_exp, tenure, south and smsa.*/ +In this exercise, we'll focus on ttl_exp, tenure, south and smsa. +*/ -/*2.1: Use the --help extended_fcn-- file to make a local containing the variable -label of ttl_exp, and display it. The command can be found under the subheading -"Macro functions for extracting data attributes" in the help file extended_fcn*/ +/* +2.1: Use the --help extended_fcn-- file to make a local containing the variable label of ttl_exp, and display it. The command can be found under the subheading "Macro functions for extracting data attributes" in the help file extended_fcn * (hint: the variable label is the explanation for what the variable is) - +*/ local lbl : var label ttl_exp display "The variable label of ttl_exp is `lbl'." -/*2.2: Make a loop which goes over ttl_exp, tenure, south and smsa and -lists the variable label for each one.*/ - +/* +2.2: Make a loop which goes over ttl_exp, tenure, south and smsa and +lists the variable label for each one. +*/ foreach var of varlist t* s* { local lbl : var label `var' display "The variable label of `var' is `lbl'." } - - -/*2.3: Display the sentence - using locals and extended functions, not words - - in the following format: "ttl_exp (float) contains the total work experence for each - woman in the dataset." */ - +/* +2.3: Display the sentence - using locals and extended functions, not words - in the following format: "ttl_exp (float) contains the total work experence for each woman in the dataset." +*/ local lbl : var label ttl_exp local type : type ttl_exp display "ttl_exp (`type') contains `lbl' for each woman in the dataset." - - -/*2.4: Make a loop which takes your sentence above, and fills it in for - ttl_exp, tenure, south and smsa. Put a number at the beginning of each sentence - which updates by one every time your loop runs*/ - +/* +2.4: Make a loop which takes your sentence above, and fills it in for ttl_exp, tenure, south and smsa. Put a number at the beginning of each sentence which updates by one every time your loop runs +*/ local x=1 foreach var of varlist t* s* { local lbl : var label `var' @@ -433,15 +451,10 @@ foreach var of varlist t* s* { local x = `x' + 1 } - - - - -/*2.5 (CHALLENGE): Write a loop which produces the exact same results, - but this time use a forvalues loop to loop over the numbers 1 to 4 to do so. - - Hint: check the extended function help file and look at "word # of string".*/ - +/* +2.5 (CHALLENGE): Write a loop which produces the exact same results, but this time use a forvalues loop to loop over the numbers 1 to 4 to do so. +Hint: check the extended function help file and look at "word # of string". +*/ local var_list ttl_exp tenure south smsa forvalues x=1/4 { local var : word `x' of `var_list' @@ -450,8 +463,6 @@ forvalues x=1/4 { display "`x'. `var' (`type') contains `lbl' for each woman in the dataset." } - - ********************************************** * V. EXPORTING RESULTS @@ -465,13 +476,21 @@ forvalues x=1/4 { cap mkdir "$mycomp/Output" + *OUTREG2 //To install outreg2: ssc install outreg2 - global controlvars south married union // POLL 7 // +/* +Take a look at the outreg2 help file on the screen. Which parts of the command must be specified for the command to run in the Full Syntax? +(1) Whether the command replaces or appends +(2) A column title +(3) A list of variables or estimations to export +(4) A file name for where the results will be stored <-- correct answer +(5) None of the above +*/ // Export results to EXCEL (default is text file)