Project 3

Social Capital and Economic Mobility

Posted: Wednesday, October 25, 2023
Due: Midnight on Wednesday, November 8, 2023

You have already explored the Opportunity Atlas in Project 1. Now you will explore the new data on economic mobility that some of the previous researchers and new ones have made freely available in the Social Capital Atlas. It also provides an interactive mapping tool, as before, that attempts to trace the strength of our relationships and communities.

In this empirical project, you will use new data on social capital to estimate it’s relationship to economic mobility. To answer some of the questions you will need to refer to the following papers (both of which are in Blackboard under Supplementary Readings):

The data file atlas_socialk.rds consists of the same information in atlas that you used for Project 1, but it now includes information from the Social Captal Atlas as well. This version includes the new estimates of social capital described in the papers above at the ZIP level, which in this case have been connected with the tract-level data.

You can load the data by using the following code:

atlas_socialk <- readRDS(gzcon(url("https://raw.githubusercontent.com/jrm87/ECO3253_repo/master/data/atlas_socialk_tract.rds")))

Instructions

As usual, you will work on Posit Cloud for this project. Write your responses within a Quarto/RMarkdown here file in the project3 tab in Posit Cloud.

Specific questions to address in your narrative

Once more, start by looking up the city where you grew up on the Social Capital Atlas. Zoom in to the area around your home (or the one you explored in Project 1).

Figure 1 should be a map of the Zip codes in your hometown from the Social Capital Atlas. Figure 2 should be a map of the High Schools in your hometown from the Social Capital Atlas. Lastly, Figure 3 should be a map of the colleges in your hometown from the Social Capital Atlas (if there are none, show how the closest set of colleges look in the map).

Describe the data being used by the authors. What period do the data you are analyzing come from? How should we interpret any possible correlation between these measures of social capital and those of mobility (take into account when each dataset is being measured)?
Now turn to the atlas_socialk.rds data set. How does economic connectedness (ec_zip), cohesiveness (clustering_zip) and civic engagement (volunteering_rate_zip and civic_organizations_zip) look like in your home Census tract compare to mean (population-weighted, using count_pooled) across the state and the U.S. overall?

Hint: Same as before. You can find the tract, county, and state FIPS codes for your home address from the Opportunity Atlas.

Recall that it’s usually a good idea to load the data and packages at the very beggining of your Quarto file. You will need to load tidyverse and Hmisc.

What is the standard deviation of the different social capital measures (population-weighted) in your home county? Is it larger or smaller than the standard deviation across tracts in your state? Across tracts in the country? What do you learn from these comparisons?
Using a linear regression, estimate the relationship between economic connectedness (ec_zip) and economic mobility (kfr_pooled_p25) across the US and in your state. Generate a scatter plot to visualize this regression. Interpret what you find. In particular, is the relationship statistically significant? Would you conclude that there is a causal effect between the variables? Why?
How does the results in (5) change if you consider ec_high_zip instead? How do you interpret this difference?
What happens to the results in (5) if you consider adjusting for other covariates? Identify 2 or 3 additional covariates which could explain mobility differences and include them in a multiple regression which includes ec_zip. Some examples of covariates you might examine include housing prices, income inequality, fraction of children with single parents, job density, etc. For 2 or 3 of these, report estimated correlation coefficients along with their 95% confidence intervals.
Putting together all the analyses you did above, what have you learned about the relationship between social capital and economic opportunity? Mention any important caveats to your conclusions; for example, can we conclude that the variable you identified as a key predictor in the question above has a causal effect (i.e., changing it would change upward mobility) based on that analysis? Why or why not?

Data Description

The data consist of n = 73,278 U.S. Census tracts. For more details on the construction of the variables included in this data set, please see Chetty, Raj, John Friedman, Nathaniel Hendren, Maggie R. Jones, and Sonya R. Porter. 2018. “The Opportunity Atlas: Mapping the Childhood Roots of Social Mobility.”, NBER Working Paper No. 25147.

Table 1

Definitions of Variables in atlas.rds

Variable name	Label	Obs.
(1)	(2)	(3)
1. Geographic identifiers
tract	Tract FIPS Code (6-digit) 2010	73,278
county	County FIPS Code (3-digit)	73,278
state	State FIPS Code (2-digit)	73,278
cz	Commuting Zone Identifier (1990 Definition)	72,473

2. Characteristics of Census tracts
hhinc_mean2000	Mean Household Income 2000	72,302
mean_commutetime2000	Average Commute Time of Working Adults in 2000	72,313
frac_coll_plus2010	Fraction of Residents with a College Degree or More in 2010	72,993
frac_coll_plus2000	Fraction of Residents with a College Degree or More in 2000	72,343
foreign_share2010	Share of Population Born Outside the U.S.	72,279
med_hhinc2016	Median Household Income in 2016	72,763
med_hhinc1990	Median Household Income in 1999	72,313
popdensity2000	Population Density (per square mile) in 2000	72,469
poor_share2010	Poverty Rate 2010	72,933
poor_share2000	Poverty Rate 2000	72,315
poor_share1990	Poverty Rate 1990	72,323
share_black2010	Share black 2010	73,111
share_hisp2010	Share Hispanic 2010	73,111
share_asian2010	Share Asian 2010	71,945
share_black2000	Share black 2000	72,368
share_white2000	Share white 2000	72,368
share_hisp2000	Share Hispanic 2000	72,368
share_asian2000	Share Asian 2000	71,050
gsmn_math_g3_2013	Average School District Level Standardized Test Scores in 3rd Grade in 2013	72,090
rent_twobed2015	Average Rent for Two-Bedroom Apartment in 2015	56,607
singleparent_share2010	Share of Single-Headed Households with Children 2010	72,564
singleparent_share1990	Share of Single-Headed Households with Children 1990	72,196
singleparent_share2000	Share of Single-Headed Households with Children 2000	72,285
traveltime15_2010	Share of Working Adults w/ Commute Time of 15 Minutes Or Less in 2010	72,939
emp2000	Employment Rate 2000	72,344
mail_return_rate2010	Census Form Rate Return Rate 2010	72,547
ln_wage_growth_hs_grad	Log wage growth for HS Grad., 2005-2014	51,635
jobs_total_5mi_2015	Number of Primary Jobs within 5 Miles in 2015	72,311
jobs_highpay_5mi_2015	Number of High-Paying (>USD40,000 annually) Jobs within 5 Miles in 2015	72,311
nonwhite_share2010	Share of People who are not white 2010	73,111
popdensity2010	Population Density (per square mile) in 2010	73,194
ann_avg_job_growth_2004_2013	Average Annual Job Growth Rate 2004-2013	70,664
job_density_2013	Job Density (in square miles) in 2013	72,463

3. Measures of Upward Mobility from the Opportunity Atlas
kfr_pooled_p25	Household income ($) at age 31-37 for children with parents at the 25th percentile of the national income distribution	72,011
kfr_pooled_p75	Household income ($) at age 31-37 for children with parents at the 75th percentile of the national income distribution	72,012
kfr_pooled_p100	Household income ($) at age 31-37 for children with parents at the 100th percentile of the national income distribution	71,968
kfr_natam_p25	Household income ($) at age 31-37 for Native American children with parents at the 25th percentile of the national income distribution	1,733
kfr_natam_p75	Household income ($) at age 31-37 for Native American children with parents at the 75th percentile of the national income distribution	1,728
kfr_natam_p100	Household income ($) at age 31-37 for Native American children with parents at the 100th percentile of the national income distribution	1,594
kfr_asian_p25	Household income ($) at age 31-37 for Asian children with parents at the 25th percentile of the national income distribution	15,434
kfr_asian_p75	Household income ($) at age 31-37 for Asian children with parents at the 75th percentile of the national income distribution	15,360
kfr_asian_p100	Household income ($) at age 31-37 for Asian children with parents at the 100th percentile of the national income distribution	13,480
kfr_black_p25	Household income ($) at age 31-37 for Black children with parents at the 25th percentile of the national income distribution	34,086
kfr_black_p75	Household income ($) at age 31-37 for Black children with parents at the 75th percentile of the national income distribution	34,049
kfr_black_p100	Household income ($) at age 31-37 for Black children with parents at the 100th percentile of the national income distribution	32,536
kfr_hisp_p25	Household income ($) at age 31-37 for Hispanic children with parents at the 25th percentile of the national income distribution	37,611
kfr_hisp_p75	Household income ($) at age 31-37 for Hispanic children with parents at the 75th percentile of the national income distribution	37,579
kfr_hisp_p100	Household income ($) at age 31-37 for Hispanic children with parents at the 100th percentile of the national income distribution	35,987
kfr_white_p25	Household income ($) at age 31-37 for white children with parents at the 25th percentile of the national income distribution	67,978
kfr_white_p75	Household income ($) at age 31-37 for white children with parents at the 75th percentile of the national income distribution	67,968
kfr_white_p100	Household income ($) at age 31-37 for white children with parents at the 100th percentile of the national income distribution	67,627

3. Counts of number of children under 18 in 2000 (to calculate weighted summary statistics)
count_pooled	Count of all children	72,451
count_white	Count of White children	72,451
count_black	Count of Black children	72,451
count_asian	Count of Asian children	72,451
count_hisp	Count of Hispanic children	72,451
count_natam	Count of Native American children	72,451

4. Measures of Social Capital
ec_zip	Baseline definition of economic connectedness: two times the share of high-SES friends among low-SES individuals, averaged over all low-SES individuals in the ZIP code. See equations (1), (2), and (3) of Chetty et al. (2022a) for a formal definition.	71,516
ec_high_zip	Economic connectedness for high-SES individuals: two times the share of high-SES friends among high-SES individuals, averaged over all high-SES individuals in the ZIP code.	71,516
clustering_zip	The average fraction of an individual’s friend pairs who are also friends with each other. See equations (4) and (5) of Chetty et al. (2022a). They include links to people outside the ZIP code when calculating individual clustering (equation 4), but only average individual clustering over users in the relevant ZIP code to compute clustering at the ZIP code level (equation 5).	71,950
volunteering_rate_zip	The percentage of Facebook users who are members of a group which is predicted to be about ‘volunteering’ or ‘activism’ based on group title and other group characteristics. We do not include groups that have the privacy setting ‘secret’ enabled. We additionally manually review the 50 largest such groups in the United States and the largest group in each state, and remove the very small number of groups that are clearly misclassified.	71,950
civic_organizations_zip	The number of Facebook Pages predicted to be “Public Good” pages based on page title, category, and other page characteristics, per 1,000 users in the ZIP code. They remove pages that do not have a website linked, do not have a description on their Facebook page or do not have an address listed. We then assign the page to a ZIP code on the basis of its listed address.	71,938

To see all other social capital variables not defined above, see here.

Cheatsheat commands

R command Description

Here I present a summary of the commands you could use to work on this project. There are two important issues you should keep in mind while reading this:

Notice that whenever you see yvar this is not a real variable. It is only a place holder for the appropriate variable that you decide to analyze or use. For example, if you want to see the mean across neighborhoods of the average household income as measured in 2000, you would not do mean(atlas$yvar, na.rm=TRUE) but mean(atlas$hhinc_mean2000, na.rm=TRUE).

Important!

‘yvar’ is not a real variable. You should replace it for the appropriate variable in your code.

The data atlas has missing information for some neighborhoods for some variables. These are called NA or missing. Most R functions do not like that you include missings in the function, because R does not know what to do with that. What is 5+NA ? NA !! So, for many of these functions, we will explicitely tell R to ignore NAs. That is what the option na.rm=TRUE does. It does not not exist for every function, but it does for most of the ones we will use here.

Important!

Careful with missing values (also called ‘NA’)! We will use ‘na.rm=TRUE’ as an option for several functions to tell R to ignore the missings.

Unweighted summary statistics

summary(atlas$yvar)
mean(atlas$yvar, na.rm=TRUE)
sd(atlas$yvar, na.rm=TRUE)

Load package

If you wanted to install and open the package Hmisc (which you will need to calculate the weighted statistics), run:

install.packages("Hmisc")
library(Hmisc)

Weighted summary statistics

You can weight means or other statistics. In our case, we want to use the population weighted statistics in several cases. That is, we want to put more weight on the value of a tract in which more people live than in another with lower population. Recall that the population variable is count_pooled.

Weighted mean:

wtd.mean(atlas$yvar, atlas$count_pooled)

Weighted standard deviation:

sqrt(wtd.var(atlas$yvar, atlas$count_pooled))

Subset observations

State level:

If you want to select a subset of observations, you can add the rule for selecting those observations, and the filter function. Here we subset the observations for the State of Wisconsin, and called the resulting dataset atlas_wisconsin.

atlas_wisconsin <- atlas %>% 
  filter(state == 55)

County level:

We can do the same but now for a specific county, adding an extra rule after a comma , or &. Here we subset the observations for Milwaukee County in Wisconsin:

atlas_milwaukee <- atlas %>% 
  filter(state == 55 & county == 079)

Standardize variables

You can standardize variables by substracting the mean and dividing by the standard deviation. Let us say that you want to standardize only considering the variables in Milwaukee, then you can do this:

atlas_milwaukee<- atlas_milwaukee %>%
  mutate(x_std=(xvar - mean(atlas_milwaukee$xvar, na.rm=TRUE))/sd(atlas_milwaukee$xvar, na.rm=TRUE))

As an example, let’s say you want to standardize both the measure of mobility and the annual job growth for the data in Wisconsin:

atlas_wisconsin<- atlas_wisconsin %>%
  mutate(kfr_pooled_p25_std=(kfr_pooled_p25 - mean(atlas_wisconsin$kfr_pooled_p25, na.rm=TRUE))/sd(atlas_wisconsin$kfr_pooled_p25, na.rm=TRUE),
         ann_avg_job_growth_2004_2013_std=(ann_avg_job_growth_2004_2013 - mean(atlas_wisconsin$ann_avg_job_growth_2004_2013, na.rm=TRUE))/sd(atlas_wisconsin$ann_avg_job_growth_2004_2013, na.rm=TRUE))

Run regression

I have written a whole section explaining regression in more detail: Section 14. Please see that for further details. But here is a quick help.

Simple linear regression

Let’s say you want to run a simple regression of variable yvar on variable xvar1 for the county of Milwaukee. We will save the results of that regression in an object call mod1 (we could give it any name). Then you would do this:

mod1 <- lm(yvar~xvar1, data = atlas_milwaukee)

To see what the outcome of the regression is, you would use the function summary and apply it to our new object mod1, like this:

summary(mod1)

As an example, we could regress the mobility of children from parents in the 25th percentile (kfr_pooled_p25) on the average annual job growth rate between 2004 and 2013 (ann_avg_job_growth_2004_2013). To do that, we would run:

mod1 <- lm(kfr_pooled_p25~ann_avg_job_growth_2004_2013, data = atlas_milwaukee)

Multivariate linear regression

You might want to understand the relationship between yvar and variable xvar1 while holding fixed another variable xvar2 for neighborhoods only in Milwaukee. You can do this:

mod2 <- lm(yvar~xvar1+xvar2 + xvar3, data = atlas_milwaukee)

How to read the regression output?

To simplify the interpretation, let’s run a regression where you use the standardize both the measure of mobility and the annual job growth for the data in Wisconsin:

mod3 <- lm(kfr_pooled_p25_std ~ ann_avg_job_growth_2004_2013_std, data = atlas_wisconsin)
summary(mod3)

## 
## Call:
## lm(formula = kfr_pooled_p25_std ~ ann_avg_job_growth_2004_2013_std, 
##     data = atlas_wisconsin)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.829 -0.569  0.035  0.684  5.128 
## 
## Coefficients:
##                                     Estimate  Std. Error t value Pr(>|t|)  
## (Intercept)                      -0.00000978  0.02681564    0.00    1.000  
## ann_avg_job_growth_2004_2013_std -0.05131843  0.02679889   -1.91    0.056 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.999 on 1386 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.00264,    Adjusted R-squared:  0.00192 
## F-statistic: 3.67 on 1 and 1386 DF,  p-value: 0.0557

You should focus here on interpreting the coefficient for ann_avg_job_growth_2004_2013_std. You should pay attention to both the magnitude of the number, and the sign. In this case, you would read it like this:

In Wisconsin, increasing one standard deviation the average annual job growth rate is correlated with an reduction of 0.05 standard deviations in the economic mobility for children with parents in the 25th percertile.

Confidence Intervals

To see the confidence intervals for the coefficients, you can run the function confint on the saved linear regression model:

confint(mod3)

##                                    2.5 %  97.5 %
## (Intercept)                      -0.0526 0.05259
## ann_avg_job_growth_2004_2013_std -0.1039 0.00125

It will give you the range in which the estimated would fall 95 our of 100 times if you would run the same exercise with a different sample of the population.

Plotting the linear relationship

You need to load the ggplot2 package (which should be installed already).

library(ggplot2)

Suppose you want to visually see the linear relationship between two variables in the atlas_milwaukee dataset that we filtered above. Then, you can do this:

ggplot(data = atlas_milwaukee) + geom_point(aes(x = xvar1, y = yvar)) + 
  geom_smooth(aes(x = xvar1, y = yvar), method = "lm", se = F)

The function geom_smooth() adds the line that you calculated above for mod1, where you ran mod1 <- lm(yvar~xvar1, data = atlas_milwaukee).

For more, see Section 12.