In this lab, we will be analyzing a newly available data set that links 1940 census records to Social Security Administration death records of men who died over age 65 between 1975 and 2005.

This data, called CenSoc, is useful for studying socio-economic differentials in old-age mortality because it includes a large number of interesting census variables such as race, education, income, and place of birth.

We will use the CenSoc data to study the relationship between education and age at death. Our questions are:

1. Is education associated with longer life?
1. Does every additional year of education help the same amount, or do there appear to be important thresholds, e.g., high school graduation? What might this tell us about the nature of the connections between education and longevity?
1. Does education provide the same longevity advantages to all groups? (Here we will compare Blacks and Whites.)
1. Does education matter because of its effect on income, or is it something else? (We will do this by comparing people of similar incomes but different education.).

How could education be linked to longevity?

• One possibility is that what people learn in school helps them to make better health decisions, or in the language of the Grossman model be “more efficient producers” of health.

• A second possibility is that education increases income, and that income can “buy” better health.

• A third possibility is that education is proxying for some other factor, say family wealth or cognitive ability, which “causes” both higher educational attainment and longer life.

• A fourth possibility is reverse-causality with health status influencing educational attainment. Those with health difficulties may drop out of school earlier.

The multiplicity of these and other pathways means that our study of the links between education and longevity is not a causal analysis. We are not doing an experiment, giving some people more and others less education. Instead we are observing the longevity of those who happen to have more schooling. Our hope is that the patterns of association we observe between education and longevity will provide us with insights on what the possible causal mechanisms may be.

(Technical note: One approach that has been successfully used for causal inference is to take advantage of mandatory schooling laws which changed the level of education in some states but not others, depending on the timing of these laws. See Lleras-Muney 2005 referenced in the reading.)

The structure of this lab is to walk-through an analysis of education and longevity for White men. Your mission in the “graded questions” will then be to repeat the analysis for Black men, and to compare and interpret the results.

# Do not edit this chunk, but *do* press the green button to the answer key for the quiz info (the unreadable string below)
tot = 0
library(quizify)
source.coded.txt(answer.key)

## Preparing the data

library(data.table) ## for fast processing of big data sets
library(arm) ## for nice display of regression results
source("/data175/labels_and_variables_for_censoc.R")
## shrink down dataset to only what we need to create a bit more room on server
print(dim(dt))
[1] 7564451      33
dt <- dt[dyear >= 1975 & dyear <= 2005 & age.at.death >= 65,
.(age.at.death, byear, educyrs, educ, race,
ownershp, incwage, incnonwg)]
print(dim(dt))
[1] 5612040       8

This means we have records of about 5.6 million linked census-death records, with 8 variables.

Let’s look at the distribution of deaths for those born in 1910

age.at.death <- dt$age.at.death byear <- dt$byear
s.1910 = byear == 1910
hist(age.at.death[s.1910])

Q1.0 When is the most common age of death for the cohort born in 1910?

##  "Replace the NA with your answer (e.g., 'A' in quotes)"
answer1.0 = 'B' ## important to note this "modal age at death, conditional on survival to 65"
quiz.check(answer1.0)
Your  answer1.0 : B
Correct.
Explanation:  This is called modal age at death, conditional on survival to age 65.

Let’s compare the distributions of death for different educational levels.

First let’s look at the distribution of education itself:

educyrs = dt$educyrs barplot(table(educyrs[s.1910])) Here we can see that the big groupings of educational attainment are completion of pre-High School (8 years), completion of High School (12 years), and completion of college (16 years) Now we can look at the relationship between education and longevity: educyrs = dt$educyrs
s.high.educ <- educyrs >= 12
s.low.educ <- educyrs < 12
high.educ = age.at.death[s.1910 & s.high.educ]
low.educ= age.at.death[s.1910 & s.low.educ]
## We can view the histograms
par(mfrow = c(2,2))
hist(low.educ, probability = TRUE, col = rgb(0,0,1, 1/4),
main = "Probability histogram of deaths, low education group", cex.main = .6)
hist(high.educ, probability = TRUE,  col = rgb(1,0,0,1/4),
main = "Probability histogram of deaths, high education group", cex.main = .6)
hist(low.educ, probability = TRUE, col = rgb(0,0,1, 1/4),
main = "Probability histogram of deaths, by education group", cex.main = .6)
hist(high.educ, probability = TRUE, add = TRUE, col = rgb(1,0,0,1/4))
## or we can use the fancier "kernel density estimator"
## which is a bit easier to read but not quite as clear about
## actual limits of the data (e.g., it smooths so much it extends age limits)
par(mfrow = c(1,1))

d1 <- density(high.educ, na.rm = T, adjust = 2)
d2 <- density(low.educ, na.rm = T, adjust = 2)
plot(d1, lwd = 3, col = "red",
main = "Smoothed densities of age at death \n for high and low education groups, birth cohort of 1910", cex.main = .8)
lines(d2, lwd = 3, col = "blue")

We can clearly see that the more educated are living longer.

Q1.1 For the cohort born in 1910, the CenSoc records contain deaths aged 65 to 95 (from 1975 to 2005). So when we look at average age of death for this cohort, we are:

A. Ignoring deaths before age 65 B. Ignoring deaths after age 95 C. Getting something close to average age of death for those who survive to age 65, since few die after age 95. D. All of the above.

##  "Replace the NA with your answer (e.g., 'A' in quotes)"
quiz.check(answer1.1)
Your  answer1.1 : D
Correct.
Explanation:  All of the above

Q1.2 For the cohort born in 1915, the average age at death in the CenSoc data will be younger than for the cohort born in 1910.

A. True, because we will only observe deaths up to age 90, instead of up to 95 (for the 1910 cohort). B. False, because people born in 1915 will tend to live longer because of mortality improvement.

##  "Replace the NA with your answer (e.g., 'A' in quotes)"
quiz.check(answer1.2)
Your  answer1.2 : A
Correct.
Explanation:  For more recent cohorts, the average age at death in the CenSoc data requires a more carefull interpretation, since there will be more people dying after the truncated age.

## The Effect of Schooling on Longevity

The method we will use to estimate the effects of schooling on longevity is to use R’s lm() function to calculate the mean age at death according to various characteristics.

(Technical note: because each cohort is not seen at all ages, it is important to estimate the effects of covariates within each cohort. We do this by including the command ‘+ as.factor(byear)’ in the regression. This method of calculating averages within cohorts is pretty good, but not perfect. Better methods use explicitly the fact that each cohort is truncated at different, known ages. These methods are more advanced, and we will not cover them here.)

In the worked portion of the lab, we will estimate only for “white” men. For the graded portion, you will re-estimate for “black” men.

1. We begin by treating every year of additional schooling the same, predicting age at death by years of education.
## fitting model
model.1.white <- lm(age.at.death ~ educyrs + as.factor(byear),
data = dt,
subset = race == "white" & byear %in% 1905:1915)

Looking at the results:

display(model.1.white)
lm(formula = age.at.death ~ educyrs + as.factor(byear), data = dt,
subset = race == "white" & byear %in% 1905:1915)
coef.est coef.se
(Intercept)          79.41     0.03
educyrs               0.18     0.00
as.factor(byear)1906 -0.52     0.03
as.factor(byear)1907 -0.95     0.03
as.factor(byear)1908 -1.43     0.03
as.factor(byear)1909 -1.90     0.03
as.factor(byear)1910 -2.27     0.03
as.factor(byear)1911 -2.39     0.03
as.factor(byear)1912 -2.53     0.03
as.factor(byear)1913 -2.78     0.03
as.factor(byear)1914 -3.01     0.03
as.factor(byear)1915 -3.31     0.03
---
n = 1747795, k = 12
residual sd = 7.52, R-Squared = 0.02

The regression output can be interpreted as a prediction of how long – on average – each person will live, depending on their characteristics. Everyone gets the “Intercept”, plus their years of education times the “educyrs” coefficient, plus the coefficient of the year they were born in. (The year of birth not shown, 1905, is the “reference category”" and has a coefficient of zero.)

Q1.3 A white man in the Censoc sample born in 1910 with a High School diploma (12 years of education) had an average age of death of

A. 79.41 years B. 79.41 + (12 * 0.18) + (-2.27) years C. 79.41 + (12 * 0.18) years

##  "Replace the NA with your answer (e.g., 'A' in quotes)"
quiz.check(answer1.3)
Your  answer1.3 : B
Correct.
Explanation:  =79.3 years

Our interpretation of the “educyrs” coefficient here is that each additional year of education is associated with a delay in age of death of 0.18 years.

Q2.1 Which of the following is FALSE?

A. If we would have forced a person born in these cohorts to attend another year of schooling, they would have lived – on average – 0.18 years more.

B. People who attended an additional year of schooling lived – on average – about 0.18 years more.

C. The effect of an additional year of schooling might depend on which year we are talking about. This regression averages the effect over all levels of schooling.

D. This analysis doesn’t consider graduation as anything special.

##  "Replace the NA with your answer (e.g., 'A' in quotes)"
quiz.check(answer2.1)
Your  answer2.1 : A
Correct.
Explanation:  We are not seeing the causal effect of forcing a person to do another year of school. Rather we are seeing the 'association' of an additional year of schooling with increases in longevity. See the section on 'how education could be linked to longevity' above for various possible mechanisms behind the association.

## A more detailed look at how different levels of education might influence longevity.

## fitting model
model.2.white <- lm(age.at.death ~ as.factor(educ) + as.factor(byear),
data = dt,
subset = race == "white" & byear %in% 1905:1915)
## show the results of the model in a table
display(model.2.white)
lm(formula = age.at.death ~ as.factor(educ) + as.factor(byear),
data = dt, subset = race == "white" & byear %in% 1905:1915)
coef.est coef.se
(Intercept)          80.61     0.08
as.factor(educ)1     -0.16     0.17
as.factor(educ)2     -0.10     0.12
as.factor(educ)3     -0.28     0.11
as.factor(educ)4     -0.21     0.09
as.factor(educ)5     -0.21     0.09
as.factor(educ)6     -0.19     0.09
as.factor(educ)7     -0.07     0.08
as.factor(educ)8      0.27     0.08
as.factor(educ)9      0.21     0.08
as.factor(educ)10     0.28     0.08
as.factor(educ)11     0.32     0.08
as.factor(educ)12     0.94     0.08
as.factor(educ)13     1.29     0.09
as.factor(educ)14     1.18     0.09
as.factor(educ)15     1.21     0.09
as.factor(educ)16     1.93     0.08
as.factor(educ)17     2.19     0.09
as.factor(byear)1906 -0.52     0.03
as.factor(byear)1907 -0.94     0.03
as.factor(byear)1908 -1.42     0.03
as.factor(byear)1909 -1.89     0.03
as.factor(byear)1910 -2.25     0.03
as.factor(byear)1911 -2.37     0.03
as.factor(byear)1912 -2.51     0.03
as.factor(byear)1913 -2.76     0.03
as.factor(byear)1914 -2.98     0.03
as.factor(byear)1915 -3.28     0.03
---
n = 1747795, k = 28
residual sd = 7.52, R-Squared = 0.02
## Make a nice graph of the education coefficients
## extract the coefficient and standard error information
educ.effects <- c(0, coef(model.2.white)[2:18])
educ.se <- c(0, summary(model.2.white)$coefficients[2:18,"Std. Error"]) educ.yrs <- 0:17 # for the x-axis ## make plot par(mar = c(5,4,4,2) + .1, mfrow = c(1,1)) plot(educ.yrs, educ.effects, ylab = "additional years of life", xlab = "years of education", ylim = c(-1, 3), axes = F, type = "p") title("Regression Estimates of Effects of Education on Longevity \n (reference category is 0 years of education)")  axis(1, educ.yrs, educ.yrs) axis(2) abline(v = c(12, 16), col = "grey", lwd = 2, lty = 2) abline(h = seq(-1,3,1), col = "grey") abline(h = seq(-1,3,.5), col = "grey", lty = 3) segments(x0 = 0:17, x1 = 0:17, y0 = educ.effects - 2*educ.se, y1 = educ.effects + 2*educ.se) segments(x0 = 0:17, x1 = 0:17, y0 = educ.effects - 1*educ.se, y1 = educ.effects + 1*educ.se, lwd = 2) Each point tells how many additional years a person with that education level lives on average compared to a person with zero years of education. So for example, someone with 8 years of education would live about 0.3 years longer than someone with no education, and someone with 12 years of education would live about 1 year longer than someone with no education. The marginal effect of a year of education is the difference between two adjacent coefficients. (The lines going through the points are uncertainty bounds. So we should not pay too much attention to say the difference between 13 and 14 years of schooling since their uncertainty bounds largely overlap. But we should pay attention to the difference between 11 and 12 years, where there is no overlap of the uncertainty bounds.) We see that each year of schooling does not seem to have the same incremental effect. Attending high school appears to only have a small effect if you don’t graduate, but a big effect if you do. Similarly, attending college helps only slightly (as can be seen by comparing 1, 2, and 3 years of college with just high school), but graduating helps a lot. Such jumps in the effects of education at schooling levels when diplomas are awarded are known as “sheepskin” effects in the economic literature. (The term comes from the historical tradition of using parchment from the skin of sheep as the material for graduation diplomas.) There are at least two main interpretation of sheepskin effects: 1. That they are evidence that we don’t learn much in school but rather use school as a way to signal our innate capacities. This is based on the idea that if we were learning each year we were in school, we would see steady increases with each additional year of schooling, without a big jump in the year the diploma is awarded. 1. Alternatively, one can make the opposite interpretation – that people learn a lot in school. According to this interpretation, people graduate because of how much they are learning in school, and that those who don’t graduate are those who aren’t learning much. Whatever the interpretation about the content of education, it is clear that there is is a big difference in longevity between those who graduate and those who simply attend. When we consider the pattern for Black men, we should keep in mind that it was more common for Blacks to drop out of school for non-educational reasons (e.g., the death of a family member or the need to earn money even as a child). And so we might not expect graduation itself to have such a large effect. Q2.2 What is most striking about the results of each year of schooling on longevity? A. That more schooling is associated with longer life B. That there is a big difference between those who finish a category of schooling and those who drop out C. That there appears to be no effect at all of staying additional years in school, if you don’t graduate. D. That people with 1 year of schooling don’t live as long as people with no schooling. ## "Replace the NA with your answer (e.g., 'A' in quotes)" answer2.2 = 'C' quiz.check(answer2.2) Your answer2.2 : C Correct. Explanation: All answers are true, but the reason it is interesting is that it suggests that sitting in the classroom an additional year might be having close to zero effect on health and longevity. ## Does education operate through higher income? To get deeper into how education might influence health, we look to see if adding controls for wage income, non-wage income, and home ownership reduces the observed effect of education on longevity. If it does, then this suggests that education is operating through these pathways. If not, it suggests that the education itself – or at least other factors associated with education which we failed to control for – are responsible for the increases in longevity. model.3.white <- lm(age.at.death ~ educyrs + ownershp + log(incwage) + incnonwg + as.factor(byear), data = dt, subset = race == "white" & byear %in% 1905:1915 & incwage > 0) ## show the average effect of a year of education ## with controls print("results with controls") [1] "results with controls" display(model.3.white) lm(formula = age.at.death ~ educyrs + ownershp + log(incwage) + incnonwg + as.factor(byear), data = dt, subset = race == "white" & byear %in% 1905:1915 & incwage > 0) coef.est coef.se (Intercept) 77.84 0.06 educyrs 0.17 0.00 ownershprent -0.32 0.01 log(incwage) 0.25 0.01 incnonwgyes 0.28 0.02 as.factor(byear)1906 -0.50 0.03 as.factor(byear)1907 -0.90 0.03 as.factor(byear)1908 -1.38 0.03 as.factor(byear)1909 -1.82 0.03 as.factor(byear)1910 -2.19 0.03 as.factor(byear)1911 -2.29 0.03 as.factor(byear)1912 -2.40 0.03 as.factor(byear)1913 -2.63 0.03 as.factor(byear)1914 -2.83 0.03 as.factor(byear)1915 -3.13 0.03 --- n = 1316489, k = 15 residual sd = 7.51, R-Squared = 0.02 ## and our old results without controls print("results without controls") [1] "results without controls" display(model.1.white) lm(formula = age.at.death ~ educyrs + as.factor(byear), data = dt, subset = race == "white" & byear %in% 1905:1915) coef.est coef.se (Intercept) 79.41 0.03 educyrs 0.18 0.00 as.factor(byear)1906 -0.52 0.03 as.factor(byear)1907 -0.95 0.03 as.factor(byear)1908 -1.43 0.03 as.factor(byear)1909 -1.90 0.03 as.factor(byear)1910 -2.27 0.03 as.factor(byear)1911 -2.39 0.03 as.factor(byear)1912 -2.53 0.03 as.factor(byear)1913 -2.78 0.03 as.factor(byear)1914 -3.01 0.03 as.factor(byear)1915 -3.31 0.03 --- n = 1747795, k = 12 residual sd = 7.52, R-Squared = 0.02 Here we see that the effect of a year of education is nearly the same (a bit less than 0.2 years more of life for every year in school) whether or not we control for income and wealth related variables. Our measures of income are not very good – just earnings the last 12 months and whether or not someone has more than$50 of non-wage earnings in last 12 months. So our analysis is not conclusive. However, it does suggest that the effects of education are not driven by these simple proxies of economic situation. To do a better job we could try to get many years of income data or, using this data set, look within occupations.

We can also see if the sheepskin effects persist after controlling for income

model.4.white <- lm(age.at.death ~ as.factor(educ) +
ownershp +
log(incwage) +
incnonwg +
as.factor(byear),
data = dt,
subset = race == "white" &
byear %in% 1905:1915 &
incwage > 0)
## graph the results
educ.effects <- c(0, coef(model.4.white)[2:18])
educ.se <- c(0, summary(model.4.white)$coefficients[2:18,"Std. Error"]) educ.yrs <- 0:17 par(mar = c(5,4,4,2) + .1, mfrow = c(1,1)) plot(educ.yrs, educ.effects, ylab = "additional years of life", xlab = "years of education", main = "Effects of Education on Longevity (Whites) \n Controlling for income and housing ownership \n (reference category is 0 years of education)", ylim = c(-1, 3), type = "p") axis(1, educ.yrs, educ.yrs) abline(v = c(12, 16), col = "grey", lwd = 2, lty = 2) abline(h = seq(0,2,1), col = "grey") segments(x0 = 0:17, x1 = 0:17, y0 = educ.effects - 2*educ.se, y1 = educ.effects + 2*educ.se) segments(x0 = 0:17, x1 = 0:17, y0 = educ.effects - 1*educ.se, y1 = educ.effects + 1*educ.se, lwd = 2) ## Graded assignment Note: as with many of the labs, we are looking for reasonable answers to the “explain” part of these questions, not marking you right or wrong based on a single “correct” answer. # In order to answer the questions about the differences between black and white men you will need to repeat much of the analysis we have done in for blacks rather than whites. One way to do this is to save this lab as another file – e.g., “lab_13_black.Rmd” and then modify this file. This approach requires that you be careful to make sure that you are really generating new results, since you will probably be using the same names for some of the variables. 1. According to your analysis, does a year of schooling increase longevity as much for Black men as White men? If there is a difference, what might explain it? model.4.black <- lm(age.at.death ~ as.factor(educ) + ownershp + log(incwage) + incnonwg + as.factor(byear), data = dt, subset = race == "black" & byear %in% 1905:1915 & incwage > 0) ## graph the results educ.effects <- c(0, coef(model.4.black)[2:18]) educ.se <- c(0, summary(model.4.black)$coefficients[2:18,"Std. Error"])
educ.yrs <- 0:17
par(mar = c(5,4,4,2) + .1, mfrow = c(1,1))
plot(educ.yrs, educ.effects,
ylab = "additional years of life",
xlab = "years of education",
main = "Effects of Education on Longevity (Blacks) \n Controlling for income and housing ownership \n (reference category is 0 years of education)",
ylim = c(-1, 3),
type = "p")
axis(1, educ.yrs, educ.yrs)
abline(v = c(12, 16), col = "grey", lwd = 2, lty = 2)
abline(h = seq(0,2,1), col = "grey")
segments(x0 = 0:17, x1 = 0:17,
y0 = educ.effects - 2*educ.se,
y1 = educ.effects + 2*educ.se)
segments(x0 = 0:17, x1 = 0:17,
y0 = educ.effects - 1*educ.se,
y1 = educ.effects + 1*educ.se,
lwd = 2)

# 2.

The Grossman model considers time spent improving health (${T}^{H}$$T^H$) as itself generating productive time ${T}^{P}$$T^P$. If the only benefit of education were to lengthen life, then how big would the coefficient in the regression of years of education on length of life have to be in order to make another year of education worthwhile? Is the estimated effect we find large enough to meet this standard? If not, what other benefits of education might explain the decision to continue school for another year? In the language of Grossman, does this involve $Z$$Z$ or $H$$H$?

1. There were clear “sheepskin” effects for White men. Are these also present for Black men? If not, what might be an explanation for why not?
1. Chapter 4 of Bhattacharya, Hyde and Tu offers a set of possible pathways for education on longevity. Given the results of your data analysis, what evidence do you find that supports or negates:
1. the idea that it is what we learn in school that helps us be more efficient producers of health (p. 58)
2. the idea that schooling increases income which allows us to buy more inputs to health (p. 62), and
3. the idea that low status itself is bad for health (p. 63-4). Address these pathways in a short paragraph each, for a total of 3 paragraphs, giving examples from your results. (You can also, optionally, provide any caution that should be taken in your interpretation.)

Congratulations! You have finished all of the required labs for Econ/Demog c175. Very, well done!

## Additional analysis for those who just can’t get enough (non-credit).

Let’s see if there is sheepskin effect on wages

model.incwage.on.educ <- lm(incwage ~ as.factor(educ) + as.factor(byear),
data = dt,
subset = byear %in% 1905:1915 & incwage > 0)
coefplot(model.incwage.on.educ, vertical = F,
main = "Wage income regression estimates")
abline(v = 12, col = "grey")
abline(v = 16, col = "grey")

Do you see sheepskin effects on wage income? If so, are they as stark as with education?

Wage income is only for those who are paid wages. Doctors, business owners, and many others don’t usually report their income as “wages”. We can look at the existence of non-wage income by education. Here the coefficient means your probability of having any non-wage income (> \$50 a year).

model.incnonwg.on.educ <- lm(incnonwg == "yes" ~ as.factor(educ) + as.factor(byear),
data = dt,
subset = byear %in% 1905:1915)
coefplot(model.incnonwg.on.educ,vertical = F,
main = "Non-wage income regression estimates")
abline(v = 12, col = "grey")
abline(v = 16, col = "grey")

We can see that non-wage income is much more complicated. The only very clear pattern is that those with post-graduate education (17+ years), are much more likely to have non-wage income. Part of the story is that many of them are doctors, lawyers, and such that don’t receive their income as wages but rather have their own practices.