Project 3
Instructions
As usual, you will work on Posit Cloud for this project. Write your responses within a Quarto/RMarkdown here file in the project3 tab in Posit Cloud.
Specific questions to address in your narrative
- Once more, start by looking up the city where you grew up on the Social Capital Atlas. Zoom in to the area around your home (or the one you explored in Project 1).
Figure 1 should be a map of the Zip codes in your hometown from the Social Capital Atlas. Figure 2 should be a map of the High Schools in your hometown from the Social Capital Atlas. Lastly, Figure 3 should be a map of the colleges in your hometown from the Social Capital Atlas (if there are none, show how the closest set of colleges look in the map).
Describe the data being used by the authors. What period do the data you are analyzing come from? How should we interpret any possible correlation between these measures of social capital and those of mobility (take into account when each dataset is being measured)?
Now turn to the
atlas_socialk.rds
data set. How does economic connectedness (ec_zip
), cohesiveness (clustering_zip
) and civic engagement (volunteering_rate_zip
andcivic_organizations_zip
) look like in your home Census tract compare to mean (population-weighted, usingcount_pooled
) across the state and the U.S. overall?
Hint: Same as before. You can find the tract, county, and state FIPS codes for your home address from the Opportunity Atlas.
Recall that it’s usually a good idea to load the data and packages at the very beggining of your Quarto file. You will need to load
tidyverse
andHmisc
.
What is the standard deviation of the different social capital measures (population-weighted) in your home county? Is it larger or smaller than the standard deviation across tracts in your state? Across tracts in the country? What do you learn from these comparisons?
Using a linear regression, estimate the relationship between economic connectedness (
ec_zip
) and economic mobility (kfr_pooled_p25
) across the US and in your state. Generate a scatter plot to visualize this regression. Interpret what you find. In particular, is the relationship statistically significant? Would you conclude that there is a causal effect between the variables? Why?How does the results in (5) change if you consider
ec_high_zip
instead? How do you interpret this difference?What happens to the results in (5) if you consider adjusting for other covariates? Identify 2 or 3 additional covariates which could explain mobility differences and include them in a multiple regression which includes
ec_zip
. Some examples of covariates you might examine include housing prices, income inequality, fraction of children with single parents, job density, etc. For 2 or 3 of these, report estimated correlation coefficients along with their 95% confidence intervals.Putting together all the analyses you did above, what have you learned about the relationship between social capital and economic opportunity? Mention any important caveats to your conclusions; for example, can we conclude that the variable you identified as a key predictor in the question above has a causal effect (i.e., changing it would change upward mobility) based on that analysis? Why or why not?
Data Description
The data consist of n = 73,278 U.S. Census tracts. For more details on the construction of the variables included in this data set, please see Chetty, Raj, John Friedman, Nathaniel Hendren, Maggie R. Jones, and Sonya R. Porter. 2018. “The Opportunity Atlas: Mapping the Childhood Roots of Social Mobility.”, NBER Working Paper No. 25147.
Table 1
Definitions of Variables in atlas.rds
Variable name | Label | Obs. |
---|---|---|
(1) | (2) | (3) |
1. Geographic identifiers | ||
tract | Tract FIPS Code (6-digit) 2010 | 73,278 |
county | County FIPS Code (3-digit) | 73,278 |
state | State FIPS Code (2-digit) | 73,278 |
cz | Commuting Zone Identifier (1990 Definition) | 72,473 |
2. Characteristics of Census tracts | ||
hhinc_mean2000 | Mean Household Income 2000 | 72,302 |
mean_commutetime2000 | Average Commute Time of Working Adults in 2000 | 72,313 |
frac_coll_plus2010 | Fraction of Residents with a College Degree or More in 2010 | 72,993 |
frac_coll_plus2000 | Fraction of Residents with a College Degree or More in 2000 | 72,343 |
foreign_share2010 | Share of Population Born Outside the U.S. | 72,279 |
med_hhinc2016 | Median Household Income in 2016 | 72,763 |
med_hhinc1990 | Median Household Income in 1999 | 72,313 |
popdensity2000 | Population Density (per square mile) in 2000 | 72,469 |
poor_share2010 | Poverty Rate 2010 | 72,933 |
poor_share2000 | Poverty Rate 2000 | 72,315 |
poor_share1990 | Poverty Rate 1990 | 72,323 |
share_black2010 | Share black 2010 | 73,111 |
share_hisp2010 | Share Hispanic 2010 | 73,111 |
share_asian2010 | Share Asian 2010 | 71,945 |
share_black2000 | Share black 2000 | 72,368 |
share_white2000 | Share white 2000 | 72,368 |
share_hisp2000 | Share Hispanic 2000 | 72,368 |
share_asian2000 | Share Asian 2000 | 71,050 |
gsmn_math_g3_2013 | Average School District Level Standardized Test Scores in 3rd Grade in 2013 | 72,090 |
rent_twobed2015 | Average Rent for Two-Bedroom Apartment in 2015 | 56,607 |
singleparent_share2010 | Share of Single-Headed Households with Children 2010 | 72,564 |
singleparent_share1990 | Share of Single-Headed Households with Children 1990 | 72,196 |
singleparent_share2000 | Share of Single-Headed Households with Children 2000 | 72,285 |
traveltime15_2010 | Share of Working Adults w/ Commute Time of 15 Minutes Or Less in 2010 | 72,939 |
emp2000 | Employment Rate 2000 | 72,344 |
mail_return_rate2010 | Census Form Rate Return Rate 2010 | 72,547 |
ln_wage_growth_hs_grad | Log wage growth for HS Grad., 2005-2014 | 51,635 |
jobs_total_5mi_2015 | Number of Primary Jobs within 5 Miles in 2015 | 72,311 |
jobs_highpay_5mi_2015 | Number of High-Paying (>USD40,000 annually) Jobs within 5 Miles in 2015 | 72,311 |
nonwhite_share2010 | Share of People who are not white 2010 | 73,111 |
popdensity2010 | Population Density (per square mile) in 2010 | 73,194 |
ann_avg_job_growth_2004_2013 | Average Annual Job Growth Rate 2004-2013 | 70,664 |
job_density_2013 | Job Density (in square miles) in 2013 | 72,463 |
3. Measures of Upward Mobility from the Opportunity Atlas | ||
kfr_pooled_p25 | Household income ($) at age 31-37 for children with parents at the 25th percentile of the national income distribution | 72,011 |
kfr_pooled_p75 | Household income ($) at age 31-37 for children with parents at the 75th percentile of the national income distribution | 72,012 |
kfr_pooled_p100 | Household income ($) at age 31-37 for children with parents at the 100th percentile of the national income distribution | 71,968 |
kfr_natam_p25 | Household income ($) at age 31-37 for Native American children with parents at the 25th percentile of the national income distribution | 1,733 |
kfr_natam_p75 | Household income ($) at age 31-37 for Native American children with parents at the 75th percentile of the national income distribution | 1,728 |
kfr_natam_p100 | Household income ($) at age 31-37 for Native American children with parents at the 100th percentile of the national income distribution | 1,594 |
kfr_asian_p25 | Household income ($) at age 31-37 for Asian children with parents at the 25th percentile of the national income distribution | 15,434 |
kfr_asian_p75 | Household income ($) at age 31-37 for Asian children with parents at the 75th percentile of the national income distribution | 15,360 |
kfr_asian_p100 | Household income ($) at age 31-37 for Asian children with parents at the 100th percentile of the national income distribution | 13,480 |
kfr_black_p25 | Household income ($) at age 31-37 for Black children with parents at the 25th percentile of the national income distribution | 34,086 |
kfr_black_p75 | Household income ($) at age 31-37 for Black children with parents at the 75th percentile of the national income distribution | 34,049 |
kfr_black_p100 | Household income ($) at age 31-37 for Black children with parents at the 100th percentile of the national income distribution | 32,536 |
kfr_hisp_p25 | Household income ($) at age 31-37 for Hispanic children with parents at the 25th percentile of the national income distribution | 37,611 |
kfr_hisp_p75 | Household income ($) at age 31-37 for Hispanic children with parents at the 75th percentile of the national income distribution | 37,579 |
kfr_hisp_p100 | Household income ($) at age 31-37 for Hispanic children with parents at the 100th percentile of the national income distribution | 35,987 |
kfr_white_p25 | Household income ($) at age 31-37 for white children with parents at the 25th percentile of the national income distribution | 67,978 |
kfr_white_p75 | Household income ($) at age 31-37 for white children with parents at the 75th percentile of the national income distribution | 67,968 |
kfr_white_p100 | Household income ($) at age 31-37 for white children with parents at the 100th percentile of the national income distribution | 67,627 |
3. Counts of number of children under 18 in 2000 (to calculate weighted summary statistics) | ||
count_pooled | Count of all children | 72,451 |
count_white | Count of White children | 72,451 |
count_black | Count of Black children | 72,451 |
count_asian | Count of Asian children | 72,451 |
count_hisp | Count of Hispanic children | 72,451 |
count_natam | Count of Native American children | 72,451 |
4. Measures of Social Capital | ||
ec_zip | Baseline definition of economic connectedness: two times the share of high-SES friends among low-SES individuals, averaged over all low-SES individuals in the ZIP code. See equations (1), (2), and (3) of Chetty et al. (2022a) for a formal definition. | 71,516 |
ec_high_zip | Economic connectedness for high-SES individuals: two times the share of high-SES friends among high-SES individuals, averaged over all high-SES individuals in the ZIP code. | 71,516 |
clustering_zip | The average fraction of an individual’s friend pairs who are also friends with each other. See equations (4) and (5) of Chetty et al. (2022a). They include links to people outside the ZIP code when calculating individual clustering (equation 4), but only average individual clustering over users in the relevant ZIP code to compute clustering at the ZIP code level (equation 5). | 71,950 |
volunteering_rate_zip | The percentage of Facebook users who are members of a group which is predicted to be about ‘volunteering’ or ‘activism’ based on group title and other group characteristics. We do not include groups that have the privacy setting ‘secret’ enabled. We additionally manually review the 50 largest such groups in the United States and the largest group in each state, and remove the very small number of groups that are clearly misclassified. | 71,950 |
civic_organizations_zip | The number of Facebook Pages predicted to be “Public Good” pages based on page title, category, and other page characteristics, per 1,000 users in the ZIP code. They remove pages that do not have a website linked, do not have a description on their Facebook page or do not have an address listed. We then assign the page to a ZIP code on the basis of its listed address. | 71,938 |
To see all other social capital variables not defined above, see here.
Cheatsheat commands
R command Description
Here I present a summary of the commands you could use to work on this project. There are two important issues you should keep in mind while reading this:
- Notice that whenever you see
yvar
this is not a real variable. It is only a place holder for the appropriate variable that you decide to analyze or use. For example, if you want to see the mean across neighborhoods of the average household income as measured in 2000, you would not domean(atlas$yvar, na.rm=TRUE)
butmean(atlas$hhinc_mean2000, na.rm=TRUE)
.
Important!
‘yvar’ is not a real variable. You should replace it for the appropriate variable in your code.
- The data
atlas
has missing information for some neighborhoods for some variables. These are calledNA
or missing. Most R functions do not like that you include missings in the function, because R does not know what to do with that. What is5+NA
? NA !! So, for many of these functions, we will explicitely tell R to ignore NAs. That is what the optionna.rm=TRUE
does. It does not not exist for every function, but it does for most of the ones we will use here.
Important!
Careful with missing values (also called ‘NA’)! We will use ‘na.rm=TRUE’ as an option for several functions to tell R to ignore the missings.
Load package
If you wanted to install and open the package Hmisc
(which you will need to calculate the weighted statistics), run:
Weighted summary statistics
You can weight means or other statistics. In our case, we want to use the population weighted statistics in several cases. That is, we want to put more weight on the value of a tract in which more people live than in another with lower population. Recall that the population variable is count_pooled
.
- Weighted mean:
- Weighted standard deviation:
Subset observations
- State level:
If you want to select a subset of observations, you can add the rule for selecting those observations, and the filter
function. Here we subset the observations for the State of Wisconsin, and called the resulting dataset atlas_wisconsin
.
- County level:
We can do the same but now for a specific county, adding an extra rule after a comma ,
or &
. Here we subset the observations for Milwaukee County in Wisconsin:
Standardize variables
You can standardize variables by substracting the mean and dividing by the standard deviation. Let us say that you want to standardize only considering the variables in Milwaukee, then you can do this:
atlas_milwaukee<- atlas_milwaukee %>%
mutate(x_std=(xvar - mean(atlas_milwaukee$xvar, na.rm=TRUE))/sd(atlas_milwaukee$xvar, na.rm=TRUE))
As an example, let’s say you want to standardize both the measure of mobility and the annual job growth for the data in Wisconsin:
atlas_wisconsin<- atlas_wisconsin %>%
mutate(kfr_pooled_p25_std=(kfr_pooled_p25 - mean(atlas_wisconsin$kfr_pooled_p25, na.rm=TRUE))/sd(atlas_wisconsin$kfr_pooled_p25, na.rm=TRUE),
ann_avg_job_growth_2004_2013_std=(ann_avg_job_growth_2004_2013 - mean(atlas_wisconsin$ann_avg_job_growth_2004_2013, na.rm=TRUE))/sd(atlas_wisconsin$ann_avg_job_growth_2004_2013, na.rm=TRUE))
Run regression
I have written a whole section explaining regression in more detail: Section 14. Please see that for further details. But here is a quick help.
- Simple linear regression
Let’s say you want to run a simple regression of variable yvar
on variable xvar1
for the county of Milwaukee. We will save the results of that regression in an object call mod1
(we could give it any name). Then you would do this:
To see what the outcome of the regression is, you would use the function summary
and apply it to our new object mod1
, like this:
As an example, we could regress the mobility of children from parents in the 25th percentile (kfr_pooled_p25
) on the average annual job growth rate between 2004 and 2013 (ann_avg_job_growth_2004_2013
). To do that, we would run:
- Multivariate linear regression
You might want to understand the relationship between yvar
and variable xvar1
while holding fixed another variable xvar2
for neighborhoods only in Milwaukee. You can do this:
How to read the regression output?
To simplify the interpretation, let’s run a regression where you use the standardize both the measure of mobility and the annual job growth for the data in Wisconsin:
mod3 <- lm(kfr_pooled_p25_std ~ ann_avg_job_growth_2004_2013_std, data = atlas_wisconsin)
summary(mod3)
##
## Call:
## lm(formula = kfr_pooled_p25_std ~ ann_avg_job_growth_2004_2013_std,
## data = atlas_wisconsin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.829 -0.569 0.035 0.684 5.128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.00000978 0.02681564 0.00 1.000
## ann_avg_job_growth_2004_2013_std -0.05131843 0.02679889 -1.91 0.056 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.999 on 1386 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.00264, Adjusted R-squared: 0.00192
## F-statistic: 3.67 on 1 and 1386 DF, p-value: 0.0557
You should focus here on interpreting the coefficient for ann_avg_job_growth_2004_2013_std
. You should pay attention to both the magnitude of the number, and the sign. In this case, you would read it like this:
In Wisconsin, increasing one standard deviation the average annual job growth rate is correlated with an reduction of 0.05 standard deviations in the economic mobility for children with parents in the 25th percertile.
Confidence Intervals
To see the confidence intervals for the coefficients, you can run the function confint
on the saved linear regression model:
## 2.5 % 97.5 %
## (Intercept) -0.0526 0.05259
## ann_avg_job_growth_2004_2013_std -0.1039 0.00125
It will give you the range in which the estimated would fall 95 our of 100 times if you would run the same exercise with a different sample of the population.
Plotting the linear relationship
You need to load the ggplot2
package (which should be installed already).
Suppose you want to visually see the linear relationship between two variables in the atlas_milwaukee
dataset that we filtered above. Then, you can do this:
ggplot(data = atlas_milwaukee) + geom_point(aes(x = xvar1, y = yvar)) +
geom_smooth(aes(x = xvar1, y = yvar), method = "lm", se = F)
The function geom_smooth()
adds the line that you calculated above for mod1
, where you ran mod1 <- lm(yvar~xvar1, data = atlas_milwaukee)
.
For more, see Section 12.
Social Capital and Economic Mobility
Posted: Wednesday, October 25, 2023
Due: Midnight on Wednesday, November 8, 2023
You have already explored the Opportunity Atlas in Project 1. Now you will explore the new data on economic mobility that some of the previous researchers and new ones have made freely available in the Social Capital Atlas. It also provides an interactive mapping tool, as before, that attempts to trace the strength of our relationships and communities.
In this empirical project, you will use new data on social capital to estimate it’s relationship to economic mobility. To answer some of the questions you will need to refer to the following papers (both of which are in Blackboard under Supplementary Readings):
Chetty, Raj; Jackson, Matthew O; Kuchler, Theresa; Stroebel, Johannes; Hendren, Nathaniel; Fluegge, Robert B; Gong, Sara; Gonzalez, Federico; Grondin, Armelle; Jacob, Matthew; Johnston, Drew; Koenen, Martin; Laguna-Muggenburg, Eduardo; Mudekereza, Florian; Rutter, Tom; Thor, Nicolaj; Townsend, Wilbur; Zhang, Ruby; Bailey, Mike; Barberá, Pablo; Bhole, Monica; Wernerfelt, Nils, 2022. “Social capital I: measurement and associations with economic mobility”, Nature 608 (7921): 108-121
Chetty, Raj; Jackson, Matthew O; Kuchler, Theresa; Stroebel, Johannes; Hendren, Nathaniel; Fluegge, Robert B; Gong, Sara; Gonzalez, Federico; Grondin, Armelle; Jacob, Matthew; Johnston, Drew; Koenen, Martin; Laguna-Muggenburg, Eduardo; Mudekereza, Florian; Rutter, Tom; Thor, Nicolaj; Townsend, Wilbur; Zhang, Ruby; Bailey, Mike; Barberá, Pablo; Bhole, Monica; Wernerfelt, Nils, 2022. “Social capital II: determinants of economic connectedness”, Nature 608 (7921): 122-134
The data file
atlas_socialk.rds
consists of the same information inatlas
that you used for Project 1, but it now includes information from the Social Captal Atlas as well. This version includes the new estimates of social capital described in the papers above at the ZIP level, which in this case have been connected with the tract-level data.You can load the data by using the following code: