Project 4
Instructions
As usual, you will work on Posit Cloud for this project. Write your responses within a Quarto/RMarkdown here file in the project4 tab in Posit Cloud.
Specific questions
Load the
atlas_training.rds
file.Produce simple summary statistics (mean and standard deviation) for the 10 predictors you selected from the data and
krf_pooled_p25
.Run a linear regression of
krf_pooled_p25
using only 10 predictors, inspect the results, and comment on what you findings. That is, interpret the predicted changes in mobility as your 10 predictors change.How well does your linear regression predict
krf_pooled_p25
in-sample? Present the RMSE.How well does your linear regression predict
krf_pooled_p25
out-of-sample? Present the RMSE.Implement a decision tree model to predict
krf_pooled_p25
using the code in below (covered in class). Plot the decision tree if possible. What are the main predictors?How well does your decision tree predict
krf_pooled_p25
in-sample? Present the RMSE.How well does your decision tree predict
krf_pooled_p25
out-of-sample? Present the RMSE.Which model performs a better prediction?
Data Description
The total data (train
+ test
) consist of n = 73,278 U.S. Census tracts. For more details on the construction of the variables included in this data set, please see Chetty, Raj, John Friedman, Nathaniel Hendren, Maggie R. Jones, and Sonya R. Porter. 2018. “The Opportunity Atlas: Mapping the Childhood Roots of Social Mobility.”, NBER Working Paper No. 25147.
Table 1
Definitions of Variables in train
and test
Variable name | Label | Obs. |
---|---|---|
(1) | (2) | (3) |
1. Geographic identifiers | ||
tract | Tract FIPS Code (6-digit) 2010 | 73,278 |
county | County FIPS Code (3-digit) | 73,278 |
state | State FIPS Code (2-digit) | 73,278 |
cz | Commuting Zone Identifier (1990 Definition) | 72,473 |
2. Characteristics of Census tracts | ||
hhinc_mean2000 | Mean Household Income 2000 | 72,302 |
mean_commutetime2000 | Average Commute Time of Working Adults in 2000 | 72,313 |
frac_coll_plus2010 | Fraction of Residents with a College Degree or More in 2010 | 72,993 |
frac_coll_plus2000 | Fraction of Residents with a College Degree or More in 2000 | 72,343 |
foreign_share2010 | Share of Population Born Outside the U.S. | 72,279 |
med_hhinc2016 | Median Household Income in 2016 | 72,763 |
med_hhinc1990 | Median Household Income in 1999 | 72,313 |
popdensity2000 | Population Density (per square mile) in 2000 | 72,469 |
poor_share2010 | Poverty Rate 2010 | 72,933 |
poor_share2000 | Poverty Rate 2000 | 72,315 |
poor_share1990 | Poverty Rate 1990 | 72,323 |
share_black2010 | Share black 2010 | 73,111 |
share_hisp2010 | Share Hispanic 2010 | 73,111 |
share_asian2010 | Share Asian 2010 | 71,945 |
share_black2000 | Share black 2000 | 72,368 |
share_white2000 | Share white 2000 | 72,368 |
share_hisp2000 | Share Hispanic 2000 | 72,368 |
share_asian2000 | Share Asian 2000 | 71,050 |
gsmn_math_g3_2013 | Average School District Level Standardized Test Scores in 3rd Grade in 2013 | 72,090 |
rent_twobed2015 | Average Rent for Two-Bedroom Apartment in 2015 | 56,607 |
singleparent_share2010 | Share of Single-Headed Households with Children 2010 | 72,564 |
singleparent_share1990 | Share of Single-Headed Households with Children 1990 | 72,196 |
singleparent_share2000 | Share of Single-Headed Households with Children 2000 | 72,285 |
traveltime15_2010 | Share of Working Adults w/ Commute Time of 15 Minutes Or Less in 2010 | 72,939 |
emp2000 | Employment Rate 2000 | 72,344 |
mail_return_rate2010 | Census Form Rate Return Rate 2010 | 72,547 |
ln_wage_growth_hs_grad | Log wage growth for HS Grad., 2005-2014 | 51,635 |
jobs_total_5mi_2015 | Number of Primary Jobs within 5 Miles in 2015 | 72,311 |
jobs_highpay_5mi_2015 | Number of High-Paying (>USD40,000 annually) Jobs within 5 Miles in 2015 | 72,311 |
nonwhite_share2010 | Share of People who are not white 2010 | 73,111 |
popdensity2010 | Population Density (per square mile) in 2010 | 73,194 |
ann_avg_job_growth_2004_2013 | Average Annual Job Growth Rate 2004-2013 | 70,664 |
job_density_2013 | Job Density (in square miles) in 2013 | 72,463 |
3. Measures of Upward Mobility from the Opportunity Atlas | ||
kfr_pooled_p25 | Household income ($) at age 31-37 for children with parents at the 25th percentile of the national income distribution | 72,011 |
kfr_pooled_p75 | Household income ($) at age 31-37 for children with parents at the 75th percentile of the national income distribution | 72,012 |
kfr_pooled_p100 | Household income ($) at age 31-37 for children with parents at the 100th percentile of the national income distribution | 71,968 |
kfr_natam_p25 | Household income ($) at age 31-37 for Native American children with parents at the 25th percentile of the national income distribution | 1,733 |
kfr_natam_p75 | Household income ($) at age 31-37 for Native American children with parents at the 75th percentile of the national income distribution | 1,728 |
kfr_natam_p100 | Household income ($) at age 31-37 for Native American children with parents at the 100th percentile of the national income distribution | 1,594 |
kfr_asian_p25 | Household income ($) at age 31-37 for Asian children with parents at the 25th percentile of the national income distribution | 15,434 |
kfr_asian_p75 | Household income ($) at age 31-37 for Asian children with parents at the 75th percentile of the national income distribution | 15,360 |
kfr_asian_p100 | Household income ($) at age 31-37 for Asian children with parents at the 100th percentile of the national income distribution | 13,480 |
kfr_black_p25 | Household income ($) at age 31-37 for Black children with parents at the 25th percentile of the national income distribution | 34,086 |
kfr_black_p75 | Household income ($) at age 31-37 for Black children with parents at the 75th percentile of the national income distribution | 34,049 |
kfr_black_p100 | Household income ($) at age 31-37 for Black children with parents at the 100th percentile of the national income distribution | 32,536 |
kfr_hisp_p25 | Household income ($) at age 31-37 for Hispanic children with parents at the 25th percentile of the national income distribution | 37,611 |
kfr_hisp_p75 | Household income ($) at age 31-37 for Hispanic children with parents at the 75th percentile of the national income distribution | 37,579 |
kfr_hisp_p100 | Household income ($) at age 31-37 for Hispanic children with parents at the 100th percentile of the national income distribution | 35,987 |
kfr_white_p25 | Household income ($) at age 31-37 for white children with parents at the 25th percentile of the national income distribution | 67,978 |
kfr_white_p75 | Household income ($) at age 31-37 for white children with parents at the 75th percentile of the national income distribution | 67,968 |
kfr_white_p100 | Household income ($) at age 31-37 for white children with parents at the 100th percentile of the national income distribution | 67,627 |
3. Counts of number of children under 18 in 2000 (to calculate weighted summary statistics) | ||
count_pooled | Count of all children | 72,451 |
count_white | Count of White children | 72,451 |
count_black | Count of Black children | 72,451 |
count_asian | Count of Asian children | 72,451 |
count_hisp | Count of Hispanic children | 72,451 |
count_natam | Count of Native American children | 72,451 |
4. Measures of Social Capital | ||
ec_zip | Baseline definition of economic connectedness: two times the share of high-SES friends among low-SES individuals, averaged over all low-SES individuals in the ZIP code. See equations (1), (2), and (3) of Chetty et al. (2022a) for a formal definition. | 71,516 |
ec_high_zip | Economic connectedness for high-SES individuals: two times the share of high-SES friends among high-SES individuals, averaged over all high-SES individuals in the ZIP code. | 71,516 |
clustering_zip | The average fraction of an individual’s friend pairs who are also friends with each other. See equations (4) and (5) of Chetty et al. (2022a). They include links to people outside the ZIP code when calculating individual clustering (equation 4), but only average individual clustering over users in the relevant ZIP code to compute clustering at the ZIP code level (equation 5). | 71,950 |
volunteering_rate_zip | The percentage of Facebook users who are members of a group which is predicted to be about ‘volunteering’ or ‘activism’ based on group title and other group characteristics. We do not include groups that have the privacy setting ‘secret’ enabled. We additionally manually review the 50 largest such groups in the United States and the largest group in each state, and remove the very small number of groups that are clearly misclassified. | 71,950 |
civic_organizations_zip | The number of Facebook Pages predicted to be “Public Good” pages based on page title, category, and other page characteristics, per 1,000 users in the ZIP code. They remove pages that do not have a website linked, do not have a description on their Facebook page or do not have an address listed. We then assign the page to a ZIP code on the basis of its listed address. | 71,938 |
5. Other variables | ||
GenderIncome Inequality_2018 | Gender Income Inequality in 2018, for person 15 years or older with income. | |
MedianIncome Person_2020 | Median Income per person in 2020 | |
MedianAgePerson_2020 | Median Age per person in 2020 | |
CountPersonNoHealth Insurance_2020 | Number of people with no health insurance (public or private) in 2020. | |
CountPerson Divorced_2020 | Number of people divorced in 2020 | |
CountGedOrAlternative Credential_2020 | Number of people with GED or alternative credential in 2020 | |
PersonWithDisability_2019 | Count of people with some type of disability in 2019 | |
CountHouseholdInternet WithoutSubscription_2020 | Number of households with internet access without subscription in 2020 | |
LimitedEnglishSpeaking Household_SpanishSpokenAtHome_2019 | Count of households that speak limited English and speak Spanish at home in 2019 | |
Household_WithFoodStamps InThePast12Months_AbovePovertyLevelInThePast12Months_2019 | Number of households in 2019 that received food stamps and that are above the poverty level in the past 12 months | |
Count_Household_With 0AvailableVehicles_2020 | number of households that have 0 vehicles in 2020 | |
Count_Person_Single MotherFamilyHousehold_2020 | Number of single mother family households in 2020 | |
Count_Person_Single FatherFamilyHousehold_2020 | Number of single father family households in 2020 | |
Count_NotAUS Citizen_2020 | Number of people that were not US Citizens in 2020 | |
Count_Person_Speak EnglishNotAtAll_2020 | Number of people that do not speak English at all in 2020 | |
Count_Medicare Enrollee_2016 | Number of people enrolled in Medicare in 2016 | |
Count_Death_2017 | Number of deaths in 2017 | |
LowerConfInterval_Percent_ Person_BingeDrinking_2018 | Percent of people that practice binge drinking in 2018 (reported as the lower confidence interval) | |
LowerConfInterval_Percent_ Person_Obesity_2018 | Percent of people with obesity in 2018 (reported as the lower confidence interval) | |
Value Percent_Person_ WithDiabetes_2018 | Percent of population with diabetes in 2018 | |
Median_Cost_HousingUnit_ WithMortgage_2020 | Median Cost of a housing unit with mortgage in 2020 | |
Count_Person_19To 34Years_2020 | Count of people that are between 19 to 34 years old in 2020 | |
Median_Income_Household_ HouseholderRaceHispanic OrLatino_2020 | Median household income for households where the householder’s race is Hispanic or Latino in 2020 | |
Median_Income_Household_ HouseholderRaceWhite Alone_2020 | Median household income for households where the householder’s race is White only in 2020 | |
count_hh_bachhigher_ married_belowp2019 | Number of households in 2019 where the householder has a bachelor’s degree or higher, for a married household below the poverty level in the past 12 months |
To see all other social capital variables not defined above, see here.
Cheatsheat commands
R command Description
How to tell R that you have a categorical variable?
Recall that if you modify one variable in one data set, you should do the same on the other one as well (datasets are train
and test
)
Below X
is just a placeholder for the actual variable you want to converto into a factor.
Run regression
As before, I have written a whole section explaining regression in more detail: Section 14. Please see that for further details. But here is a quick help.
- Multivariate linear regression
You might want to understand the relationship between yvar
and variable xvar1
while holding fixed another variable xvar2
for neighborhoods only in Milwaukee. You can do this:
You would see the output from the model by running:
Measures of accuracy in prediction in a Multiple regression
To assess how good is your model at predicting the outcome, you can use the Root Mean Square Error measurement. That will take the error in the prediction for each observation, square it, average it, and then take the root of that.
You can estimate this measure in-sample (within the train
dataset), or out-of-sample (for the test
dataset).
To do it in-sample, you can calculate it by checking the prediction directly from the estimated linear model:
where residuals
is the error that the model predicts for each observation`
To do it out-of-sample, you need to use the function predict
that requires a model, and a dataset with the same variables for the prediction. In this case, you would do something like this:
and then you can do this:
Estimating a Decision Tree
To estimate the decision tree, you can choose which variables you want to select for the prediction by listing them as before. Say, you want to use x1, x2 and x3 for the prediction, you can do:
If you wan to use all other variables for the prediction, you can do:
To plot the tree you estimated, just do this:
Measures of accuracy in a Decision Tree
The measurement is the same, but to calculate it you need to do use the predict
function as before. For in-sample prediction you would do:
#In-sample prediction
p <- predict(tree, train)
#Root mean squared error = 1944.108 (in sample)
sqrt(mean((train$kfr_pooled_p25-p)^2))
For out-of-sample prediction, you can run:
pred_tree_outofsample<-predict(tree,test)
#RMSE out of sample = 5965.278
sqrt(mean((test$kfr_pooled_p25-pred_tree_outofsample)^2, na.rm=TRUE ))