Introduction


The purpose of this empirical project is to investigate the real wage gap in dollars of two groups of population, non-immigrants and immigrants. In addition to the real wage gap, other factors that could explain the differences between them are also investigated. The regression function is utilized for the goals, in addition to economic theory and research. After the explanation is established, the next purpose is to find out whether the wage gaps change since the immigrants have moved to the US.



Data


The data used for this empirical project is “Current Population Survey (CPS)” from 2019 by “Center for Economic and Policy Research”. Since the purpose of this project is to investigate the real wage gap, the dependent variable is rw (real wage) and forborn (whether the individual is foreign born) is the base independent variable. However, it seems that the forborn factor alone cannot explain the real wage gap. There are other factors that also needs be investigated, e.g., marital status and education, to yield a more complete explanation.

The data excludes person with missing values for rw, which are either not in the work force or unemployed, therefore, the data is left with 154279 persons. The data of the real wage is given as rw, however, it is transformed into log(rw). The reasons are that it is the customary according to economic research, and also it gives a variable distribution that is more amenable to analysis. The base independent variable is whether they are immigrants, for which forborn seems to be the best suited variable for this purpose.
For the other explanation related to the immigrant groups, the independent variable is prinusyr. The prinusyr is used and not the arrived variable, because it has more recent data and also has a nicer distribution. The alternative independent variables are chosen based on economic theory and research. The factors that will be investigated in the model are related to human capital and demographics. These other factors are age, education, sex, marital status, and race.
I have chosen not to exclude the outliers, even though OLS is sensitive to outliers. Since the size of the samples are large, it is assumed to have the sampling distribution. I also ignore that some individuals have multiple jobs. The 95% confidence interval is the critical value throughout the report.



Empirical Approach


The main method of finding the potential independent variables is utilizing the economic knowledge and available resources from books or research papers to outline which factors will most likely affect the real wage gap.


Regression function for the wage gap (task a)


\[log(y_i) = \beta_0 + \beta_{fb} \cdot fb_i + \beta_{fe} \cdot fe_i + \beta_{age1} \cdot age_i + \beta_{age2} \cdot {age_i}^2 + \beta_{educ} \cdot educ_i + \beta_{marstat} \cdot marstat_i + \beta_{wbho} \cdot wbho_i + u_i\]


\(\beta_{fb}\) is the variable indicates whether the individual is an immigrant.
\(\beta_{fe}\) is a variable indicates whether the individual is a female.
Age is represented two variables:
\(\beta_{age1}\) is a variable that models increasing return as a function of age.
\(\beta_{age2}\) is the counter part of \(\beta_{age1}\), modelling diminishing return with old age.
\(\beta_{educ}\) is a variable that indicates the level of education.
\(\beta_{marstat}\) is a variable that indicates the marital status.
\(\beta_{wbho}\) is a variable that indicates race of the individual.
\(u_i\) is the error term capturing the regression residual due to omitted independent variables.


The hypotheses are that the coefficient of forborn should be negative and be in the same direction of the mean real wage in the next section. Female should have lower wage than male, so the coefficient is suspected to be negative. Age is suspected to be increasing in the beginning and decreasing at the end, therefore \(\beta_{age1}\) should be positive and \(\beta_{age2}\) should be negative. The coefficient of the level of education should be positive and increasing in magnitude as the level of education raises. For individuals who are not married, the coefficient is suspected to be negative. The coefficients for other races should be lower than the dummy variable, wbhoWhite.


Regression function for analysis wage gap with time (task b)


\[log(y_i) = \beta_0 + \beta_{fb} \cdot fb_i + \beta_{fe} \cdot fe_i + \beta_{age1} \cdot age_i + \beta_{age2} \cdot {age_i}^2 + \beta_{educ} \cdot educ_i + \beta_{marstat} \cdot marstat_i + \beta_{prinusyr} \cdot prinusyr_i + \beta_{wbho} \cdot wbho_i + u_i\]


\(\beta_{fb}\) is the variable indicates whether the individual is an immigrant.
\(\beta_{fe}\) is a variable indicates whether the individual is a female.
Age is represented two variables:
\(\beta_{age1}\) is a variable that models increasing return as a function of age.
\(\beta_{age2}\) is the counter part of \(\beta_{age1}\), modelling diminishing return with old age.
\(\beta_{educ}\) is a variable that indicates the level of education.
\(\beta_{marstat}\) is a variable that indicates the marital status.
\(\beta_{wbho}\) is a variable that indicates race of the individual.
\(\beta_{prinusyr}\) is a variable indicates years since the immigrants have entered the US.
\(u_i\) is the error term.


The model is similar to the one above, however the variable prinusyr is added, so we can probably have a better picture. The hypothesis is that the wage gap should be smaller as the variable year increases. This could be the caused by effects such as language barriers and integration. However, the prinusyr variable is a categorical variable indicating the range of year the individual moved to the US, i.e. a value of 1 corresponds to moving before 1950 while the higher the value the more recent the individual moved to the US, in addition to prinusyr for the natives, which will be added.



Results


The mean real wage gap before data transformation is $25.95 - $25.32 = $0.63, which means 2.49% reduction for immigrants. The statistical values regarding the real wage are listed in table 1. The distribution is also shown in figure 1 and 2.


Table 1: Real wage of non-immigrants and immigrants
Name Mean SD Min Max Median Quantile.05 Quantile.95 N
Non-immigrants - rw 25.95 19.60 1 392.30 20.00 9.0 64.60 132180
Immigrants - rw 25.32 20.56 1 288.33 18.00 9.0 68.83 22099
Non-immigrants -log(rw) 3.05 0.63 0 5.97 3.00 2.2 4.17 132180
Immigrants - log(rw) 3.00 0.65 0 5.66 2.89 2.2 4.23 22099



Hypothesis testing of the real mean


Two-sided t-test


Null hypothesis:
\(H_0: \mu_{im} = \mu_{non}\)


Alternative hypothesis:
\(H_1: \mu_{im} \neq \mu_{non}\)


## 
##  Two Sample t-test
## 
## data:  dt$rw[dt$forborn == 1] and dt$rw[dt$forborn == 0]
## t = -4.4178, df = 154277, p-value = 9.977e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9149068 -0.3525823
## sample estimates:
## mean of x mean of y 
##  25.31543  25.94918


From the t-test, we can reject the null hypothesis that the true mean of immigrants and non-immigrants are equal with 95% confidence interval.



The regression analysis of immigrants and non-immigrants


Regression function for the wage gap (task a)


\[y_i = \beta_0 + \beta_{fo} \cdot fo_i + \beta_{age1} \cdot age_i + \beta_{age2} \cdot {age_i}^2 + \beta_{fe} \cdot fe_i + \beta_{educ} \cdot educ_i + \beta_{marstat} \cdot marstat_i + \beta_{wbho} \cdot wbho_i + u_i\]


## 
## Call:
## lm(formula = log(rw) ~ forborn + age + I(age^2) + female + educ + 
##     marstat + wbho, data = dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7393 -0.2985 -0.0088  0.2983  2.9723 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.755e+00  1.357e-02 129.307  < 2e-16 ***
## forborn1             -5.437e-02  4.435e-03 -12.257  < 2e-16 ***
## age                   4.643e-02  5.806e-04  79.963  < 2e-16 ***
## I(age^2)             -4.566e-04  6.286e-06 -72.630  < 2e-16 ***
## female1              -2.252e-01  2.639e-03 -85.327  < 2e-16 ***
## educHS                1.785e-01  5.889e-03  30.317  < 2e-16 ***
## educSome college      2.899e-01  5.944e-03  48.775  < 2e-16 ***
## educCollege           6.612e-01  6.090e-03 108.577  < 2e-16 ***
## educAdvanced          8.778e-01  6.528e-03 134.472  < 2e-16 ***
## marstatWidowed       -6.391e-02  9.565e-03  -6.681 2.38e-11 ***
## marstatDivorced      -5.914e-02  4.433e-03 -13.340  < 2e-16 ***
## marstatSeparated     -8.588e-02  9.847e-03  -8.721  < 2e-16 ***
## marstatNever Married -9.660e-02  3.596e-03 -26.860  < 2e-16 ***
## wbhoBlack            -1.260e-01  4.489e-03 -28.064  < 2e-16 ***
## wbhoHispanic         -5.312e-02  4.305e-03 -12.339  < 2e-16 ***
## wbhoOther             2.422e-02  5.323e-03   4.550 5.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5119 on 154263 degrees of freedom
##   (137111 observations deleted due to missingness)
## Multiple R-squared:  0.3376, Adjusted R-squared:  0.3376 
## F-statistic:  5243 on 15 and 154263 DF,  p-value: < 2.2e-16


The first regression is to determine which factors contribute to the wage gap differences in the population between immigrants and non-immigrants. The possible factors are included in the regression model, and the results are interpreted whether they are according to the hypotheses. Some possible factors are left out due to limitation of my knowledge and not to overcrowd the model, however, they would be mentioned if they could possibly contribute to the difference in the real wage. The result shows that 33.76% of the real wage variation can be accounted for by the six mentioned factors (including forborn) with F=5243 and p<0.001. All the coefficients are statistically significant at 95% confidence interval, with p<0.001. The coefficients seem to be in the correct directions according to the mentioned economic theories. The correlation plots are in the appendices.


intercept

The intercept is interpreted as an individual who is not an immigrant, zero age, no education (educLTHS), male, not living in metropolitan area, married, and white (wbhoWhite), has an average real wage of $5.7831. The intercept is used as a base for dummy variables in the model. The t-statistic and p-value indicates that it is statistically significant.


forborn

Being an immigrant results in a reduction in real wage by approximately 5.4770%. The direction seems to be in same as suspected, that immigrants tend to have lower real wage than non-immigrants. The magnitude of the result is larger than what was calculated from the sample $0.95 (-2.49%). Since the model explains 33.76% of the data, there are other factors that affect the real wage, which are not included. However, the t-statistic and p-value indicate that forborn1 is statistically significant.


age

Since age was assumed to positively correlate with real wage if age is low, the result of aging from 25 to 26 is a raise in real wage by approximately 2.32%, while the result of aging from 62 to 63 is a reduction in real wage by approximately 1.06%. The t-statistic and p-value indicate that age and I(age^2) are statistically significant.


female

Being a female results to a reduction in the real wage by approximately 20.164%. The direction seems to be the same as expected. Supported by the data from “Office of Federal Contract Compliance Programs (OFCCP)” and economic research that gender does have causal effect on the real wage. The t-statistic and p-value mean that it is statistically significant.


educ

The educ variable is a factor with the values of educLTHS (lower than high school), educHS (high school), educSome College (Some college), educCollege (College), and educAdvanced (Advanced).

educLTHS (lower than high school): This group is the dummy, so there is no effect from educ. educHS (high school): Finishing high school results to an increasing in real wage by approximately 19.54%. _educSome College (some college): Having some college increases the real wage by approximately 33.63%. educCollege (college): Having college degree results to an increasing in real wage by 93.71%. educAdvanced (advanced): Having higher education than college degree increases the real wage by 140.56%.

As suspected, the level of education has an impact on the real wage, i.e. higher level of education tends to increase the real wage.


marstat

The marstat variable is a factor variable with the values of marstatMarried (Married), marstatWidowed (Widowed), marstatDivorced (Divorced), marstatSeparated (Separated), and marstatNever Married (Never Married). The t-statistic and p-value mean that they are statistically significant.

marstatMarried (Married): This group is the dummy, so there is no effect from marstat. marstatWidowed (Widowed): Comparing to the individuals who are married, they earn 6.1911% less. marstatDivorced (Divorced): Comparing to the individuals who are married, they earn 5.7425% less. marstatSeparated (Separated): Comparing to the individuals who are married, they earn 8.2296% less. marstatNever Married (Never Married): Comparing to the individuals who are married, they earn 9.2081% less.

The directions of the coefficients are the same as suspected that individuals who are married earn most comparing to others. This marital wage premium could be explained by, e.g., an increase in productivity. While the model can possibly explained the causal effect of the marital status on real wage, there are some other variables that are left out, for example, numbers of children or whether they are the head of the family.


wbho

The wbho variable is a categorical variable, which are wbhoWhite (white), wbhoBlack (black), wbhoHispanic (hispanic), and wbhoOther (other races, including Asians). The t-statistic and p-value for all of them are statistically significant.
wbhoWhite (White): This is a base, so there is no effect from wbho variable.
wbhoBlack (Black): This results to a decreasing in real wage by 11.839%, which seems to be in the same direction as suspected.
wbhoHispanic (Hispanic): The results to a decreasing in real wage by 5.312%, which is smaller compared to wbhoBlack.
wbhoOther (Other): Compared to wbhoWhite, this group has an increase in real wage by 2.4516%.

The directions of the results are quite similar to the data from “Office of Federal Contract Compliance Programs (OFCCP)”, i.e., racial disparity in races. Comparing to white, black and hispanic earn less, while other races earns more. However, the model does not include native american, multiracial and asian, the model cannot be used to explain the causal inference as they are grouped into wbhoOther.



Regression function for analysis wage gap with time (task b)


\[log(y_i) = \beta_0 + \beta_{fo} \cdot fo_i + \beta_{prinusyr} \cdot prinusyr_i + \beta_{age1} \cdot age_i + \beta_{age2} \cdot {age_i}^2 + \beta_{fe} \cdot fe_i + \beta_{educ} \cdot educ_i + \beta_{marstat} \cdot marstat_i + \beta_{wbho} \cdot wbho_i + u_i\]


## NOTE: 137,111 observations removed because of NA values (LHS: 137,111).
## The variable 'as.factor(prinusyr)25' has been removed because of collinearity (see $collin.var).
## OLS estimation, Dep. Var.: log(rw)
## Observations: 154,279 
## Standard-errors: Heteroskedasticity-robust 
##                        Estimate Std. Error   t value   Pr(>|t|)    
## (Intercept)            1.760898 0.01308710 134.55220  < 2.2e-16 ***
## forborn1              -0.123556 0.01275814  -9.68447  < 2.2e-16 ***
## as.factor(prinusyr)1  -0.436079 0.39530301  -1.10315 2.6996e-01    
## as.factor(prinusyr)2   0.122903 0.06972513   1.76269 7.7955e-02 .  
## as.factor(prinusyr)3   0.205241 0.05898235   3.47971 5.0210e-04 ***
## as.factor(prinusyr)4   0.181118 0.03765276   4.81022 1.5091e-06 ***
## as.factor(prinusyr)5   0.175386 0.02827880   6.20202 5.5885e-10 ***
## as.factor(prinusyr)6   0.124138 0.02196419   5.65182 1.5903e-08 ***
## as.factor(prinusyr)7   0.116691 0.02365050   4.93398 8.0656e-07 ***
## as.factor(prinusyr)8   0.108191 0.02802677   3.86028 1.1330e-04 ***
## as.factor(prinusyr)9   0.101984 0.02291253   4.45103 8.5521e-06 ***
## as.factor(prinusyr)10  0.121007 0.02459377   4.92023 8.6530e-07 ***
## as.factor(prinusyr)11  0.035057 0.02157768   1.62470 1.0423e-01    
## as.factor(prinusyr)12  0.104475 0.02012037   5.19249 2.0776e-07 ***
## as.factor(prinusyr)13  0.056727 0.02078213   2.72961 6.3416e-03 ** 
## as.factor(prinusyr)14  0.064537 0.01947024   3.31466 9.1774e-04 ***
## as.factor(prinusyr)15  0.072992 0.02071420   3.52377 4.2558e-04 ***
## as.factor(prinusyr)16  0.089448 0.01826536   4.89714 9.7340e-07 ***
## as.factor(prinusyr)17  0.072768 0.01690134   4.30543 1.6676e-05 ***
## as.factor(prinusyr)18  0.061826 0.01938976   3.18857 1.4301e-03 ** 
## as.factor(prinusyr)19  0.053517 0.01918289   2.78984 5.2741e-03 ** 
## as.factor(prinusyr)20  0.033724 0.01959665   1.72091 8.5270e-02 .  
## as.factor(prinusyr)21  0.055032 0.02020803   2.72325 6.4650e-03 ** 
## as.factor(prinusyr)22  0.069681 0.02096781   3.32326 8.8993e-04 ***
## as.factor(prinusyr)23  0.045798 0.02019255   2.26807 2.3326e-02 *  
## as.factor(prinusyr)24  0.042194 0.01955314   2.15791 3.0936e-02 *  
## age                    0.046300 0.00061285  75.54815  < 2.2e-16 ***
## I(age^2)              -0.000457 0.00000695 -65.78739  < 2.2e-16 ***
## female1               -0.225442 0.00264649 -85.18512  < 2.2e-16 ***
## educHS                 0.179269 0.00456312  39.28650  < 2.2e-16 ***
## educSome college       0.290143 0.00468610  61.91567  < 2.2e-16 ***
## educCollege            0.661835 0.00520991 127.03386  < 2.2e-16 ***
## educAdvanced           0.879434 0.00591695 148.62972  < 2.2e-16 ***
## wbhoBlack             -0.125143 0.00431665 -28.99070  < 2.2e-16 ***
## wbhoHispanic          -0.054870 0.00410348 -13.37148  < 2.2e-16 ***
## wbhoOther              0.025482 0.00556861   4.57598 4.7437e-06 ***
## marstatWidowed        -0.063384 0.00999178  -6.34359 2.2510e-10 ***
## marstatDivorced       -0.059469 0.00451037 -13.18505  < 2.2e-16 ***
## marstatSeparated      -0.086215 0.00928273  -9.28764  < 2.2e-16 ***
## marstatNever Married  -0.097451 0.00360065 -27.06479  < 2.2e-16 ***
## ... 1 variable was removed because of collinearity (as.factor(prinusyr)25)
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.511678   Adj. R2: 0.338051


The second regression is to determine whether years since they have entered would contribute to the wage gap differences. I have chosen to use the same model, however included prinusyr as a factor variable where individuals who are native born have prinsuyr value of zero. The reason is that with the same factors as the model above, which factors would contribute to the real wage within the immigrant group. The result shows that 33.80% of the real wage variation can be accounted for by the seven mentioned factors (including prinusyr) with RMSE = 0.512. The coefficients seem to be in the correct directions according to the mentioned economic theories. The prinusyr seems to be as suspected by the hypothesis that the wage gap is closer the longer they have been in the US.


prinusyr

The wage gap becomes smaller the longer they have resided in the US. As mentioned, the prinusyr is a factor variable, where the higher the value of prinusyr the most recent they have entered. Since prinusyr is zero for the natives (holding other variables constant), the results can be interpreted as, e.g., immigrants who entered during the year 2014 - 2017 (prinusyr = 24) have an increased of 4.31% in real wage and immigrants who entered during the year 1950 - 1959 have an increased of 13.08% in real wage compared to the natives. The t-statistics and p-values for prinsuyr are statistically significant, except for prinsuyr = 1 and 11. There are 7 samples who have prinusyr = 1 (and rw is not NA).


Compared to the other model, the directions of the coefficients are the same as the other model with different magnitudes.


Table 2: Group by the variable prinusyr
prinusyr Mean rw Mean age Male Female LTHS HS Some college College Advanced Total
0 25.95 48.29 66465 65715 5857 36292 39468 32489 18074 132180
1 16.34 80.27 3 4 0 3 0 0 4 7
2 27.45 75.88 42 42 9 22 21 16 16 84
3 30.20 73.21 58 71 14 35 32 25 23 129
4 28.45 69.22 128 130 24 77 69 53 35 258
5 29.42 65.06 263 215 77 126 95 102 78 478
6 28.34 60.29 454 421 168 208 181 185 133 875
7 27.12 58.24 360 306 137 187 123 134 85 666
8 27.81 55.59 254 216 91 107 103 92 77 470
9 26.70 54.84 342 280 125 172 103 128 94 622
10 27.32 54.02 326 265 128 146 106 141 70 591
11 24.25 52.47 501 364 192 256 155 152 110 865
12 26.87 50.78 553 542 200 328 179 235 153 1095
13 24.88 49.26 446 425 185 229 159 170 128 871
14 24.74 48.84 569 479 208 309 186 197 148 1048
15 25.82 47.56 574 432 172 284 188 211 151 1006
16 25.78 45.59 810 624 299 394 244 272 225 1434
17 24.02 43.42 1005 833 425 533 294 326 260 1838
18 23.91 41.45 587 569 247 325 194 220 170 1156
19 23.28 40.80 672 622 304 356 201 240 193 1294
20 23.66 39.88 572 520 203 292 191 230 176 1092
21 25.15 41.06 601 452 182 282 172 212 205 1053
22 26.19 39.11 510 472 122 260 164 219 217 982
23 25.00 38.70 571 459 129 299 152 238 212 1030
24 25.09 36.58 718 571 167 336 173 309 304 1289
25 23.60 35.74 1051 815 254 464 274 451 423 1866



Compared to the other model, the directions of the coefficients are the same as the other model with different magnitudes.



Summary and conclusion


The main objectives are to find out the explanation for the real wage gap between immigrants and non-immigrants and whether the gap changes since they have moved to the US. The results from the model shows some possible factors. The real wage increases as individuals age and decreases after a certain point. There are gender and race disparity among the sample, that gender and race do effect the real wage. Individuals with higher level of education earn more than individuals with lower level of education. The model also shows the existence of the marital status premium in the sample.
The wage gap between immigrants and non-immigrants also reduces as then number of years since the individual moved to the US increases.



References




Appendices


Table 3: Models for rw and log(rw)
 (1)   (2)
(Intercept) 25.9*** 3.0***
[25.8, 26.1] [3.0, 3.1]
s.e. = 0.1 s.e. = 0.0
t = 478.0 t = 1763.6
p = <0.1 p = <0.1
forborn1 -0.6** 0.0**
[-0.9, -0.4] [-0.1, 0.0]
s.e. = 0.1 s.e. = 0.0
t = -4.4 t = -10.6
p = <0.1 p = <0.1
Num.Obs. 154279 154279
R2 0.000 0.001
R2 Adj. 0.000 0.001
AIC 1358130.1 1233607.1
BIC 1358159.9 1233636.9
Log.Lik. -679062.029 -147325.022
RMSE 19.74 0.63
. p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0
Table 4: Models for age
 (1)   (2)
(Intercept) 2.7*** 1.5***
[2.6, 2.7] [1.4, 1.5]
s.e. = 0.0 s.e. = 0.0
t = 546.4 t = 117.7
p = <0.1 p = <0.1
forborn1 -0.1** -0.1**
[-0.1, -0.1] [-0.1, -0.1]
s.e. = 0.0 s.e. = 0.0
t = -13.7 t = -23.7
p = <0.1 p = <0.1
age 0.0*** 0.1***
[0.0, 0.0] [0.1, 0.1]
s.e. = 0.0 s.e. = 0.0
t = 86.8 t = 117.4
p = <0.1 p = <0.1
I(age^2) 0.0***
[0.0, 0.0]
s.e. = 0.0
t = -103.4
p = <0.1
Num.Obs. 154279 154279
R2 0.047 0.109
R2 Adj. 0.047 0.109
AIC 1226253.9 1215912.2
BIC 1226293.7 1215961.9
Log.Lik. -143647.427 -138475.584
RMSE 0.61 0.59
. p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0
Table 5: Models for the final model
 (1)   (2)   (3)   (4)
(Intercept) 1.6*** 1.5*** 1.8*** 1.8***
[1.5, 1.6] [1.5, 1.6] [1.7, 1.8] [1.7, 1.8]
s.e. = 0.0 s.e. = 0.0 s.e. = 0.0 s.e. = 0.0
t = 125.7 t = 134.2 t = 129.8 t = 129.3
p = <0.1 p = <0.1 p = <0.1 p = <0.1
forborn1 -0.1** -0.1** -0.1** -0.1**
[-0.1, -0.1] [-0.1, -0.1] [-0.1, -0.1] [-0.1, 0.0]
s.e. = 0.0 s.e. = 0.0 s.e. = 0.0 s.e. = 0.0
t = -25.8 t = -15.3 t = -17.1 t = -12.3
p = <0.1 p = <0.1 p = <0.1 p = <0.1
age 0.1*** 0.1*** 0.0*** 0.0***
[0.1, 0.1] [0.1, 0.1] [0.0, 0.0] [0.0, 0.0]
s.e. = 0.0 s.e. = 0.0 s.e. = 0.0 s.e. = 0.0
t = 119.3 t = 97.6 t = 78.0 t = 80.0
p = <0.1 p = <0.1 p = <0.1 p = <0.1
I(age^2) 0.0*** 0.0*** 0.0*** 0.0***
[0.0, 0.0] [0.0, 0.0] [0.0, 0.0] [0.0, 0.0]
s.e. = 0.0 s.e. = 0.0 s.e. = 0.0 s.e. = 0.0
t = -105.1 t = -84.7 t = -70.7 t = -72.6
p = <0.1 p = <0.1 p = <0.1 p = <0.1
female1 -0.2*** -0.2*** -0.2*** -0.2***
[-0.2, -0.2] [-0.2, -0.2] [-0.2, -0.2] [-0.2, -0.2]
s.e. = 0.0 s.e. = 0.0 s.e. = 0.0 s.e. = 0.0
t = -64.3 t = -88.2 t = -86.5 t = -85.3
p = <0.1 p = <0.1 p = <0.1 p = <0.1
educHS 0.2** 0.2** 0.2**
[0.2, 0.2] [0.2, 0.2] [0.2, 0.2]
s.e. = 0.0 s.e. = 0.0 s.e. = 0.0
t = 31.6 t = 31.6 t = 30.3
p = <0.1 p = <0.1 p = <0.1
educSome college 0.3*** 0.3*** 0.3***
[0.3, 0.3] [0.3, 0.3] [0.3, 0.3]
s.e. = 0.0 s.e. = 0.0 s.e. = 0.0
t = 51.0 t = 50.6 t = 48.8
p = <0.1 p = <0.1 p = <0.1
educCollege 0.7*** 0.7*** 0.7***
[0.7, 0.7] [0.7, 0.7] [0.6, 0.7]
s.e. = 0.0 s.e. = 0.0 s.e. = 0.0
t = 114.1 t = 113.0 t = 108.6
p = <0.1 p = <0.1 p = <0.1
educAdvanced 0.9*** 0.9*** 0.9***
[0.9, 0.9] [0.9, 0.9] [0.9, 0.9]
s.e. = 0.0 s.e. = 0.0 s.e. = 0.0
t = 141.2 t = 139.5 t = 134.5
p = <0.1 p = <0.1 p = <0.1
marstatWidowed -0.1** -0.1**
[-0.1, -0.1] [-0.1, 0.0]
s.e. = 0.0 s.e. = 0.0
t = -7.2 t = -6.7
p = <0.1 p = <0.1
marstatDivorced -0.1** -0.1**
[-0.1, -0.1] [-0.1, -0.1]
s.e. = 0.0 s.e. = 0.0
t = -14.3 t = -13.3
p = <0.1 p = <0.1
marstatSeparated -0.1** -0.1**
[-0.1, -0.1] [-0.1, -0.1]
s.e. = 0.0 s.e. = 0.0
t = -10.5 t = -8.7
p = <0.1 p = <0.1
marstatNever Married -0.1** -0.1**
[-0.1, -0.1] [-0.1, -0.1]
s.e. = 0.0 s.e. = 0.0
t = -31.0 t = -26.9
p = <0.1 p = <0.1
wbhoBlack -0.1**
[-0.1, -0.1]
s.e. = 0.0
t = -28.1
p = <0.1
wbhoHispanic -0.1**
[-0.1, 0.0]
s.e. = 0.0
t = -12.3
p = <0.1
wbhoOther 0.0**
[0.0, 0.0]
s.e. = 0.0
t = 4.6
p = <0.1
Num.Obs. 154279 154279 154279 154279
R2 0.132 0.329 0.334 0.338
R2 Adj. 0.132 0.329 0.333 0.338
AIC 1211830.1 1172231.8 1171136.5 1170190.9
BIC 1211889.7 1172331.2 1171275.7 1170360.0
Log.Lik. -136433.521 -116630.376 -116078.719 -115602.942
RMSE 0.59 0.52 0.51 0.51
. p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0
## Analysis of Variance Table
## 
## Response: log(rw)
##               Df Sum Sq Mean Sq  F value    Pr(>F)    
## forborn        1     44    44.4   169.24 < 2.2e-16 ***
## age            1   2840  2839.5 10835.09 < 2.2e-16 ***
## I(age^2)       1   3771  3771.0 14389.41 < 2.2e-16 ***
## female         1   1421  1420.7  5421.15 < 2.2e-16 ***
## educ           4  11991  2997.7 11438.82 < 2.2e-16 ***
## marstat        4    292    73.0   278.50 < 2.2e-16 ***
## wbho           3    250    83.4   318.13 < 2.2e-16 ***
## Residuals 154263  40427     0.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Import libraries
library(foreign)
library(dplyr)
library(ggplot2)
library(car)
library(lmtest)
library(sandwich)
library(fixest)
library(tibble)
library(tidyverse)
library(data.table)
library(modelsummary)
library(kableExtra)

print("Loaded library")
# Read the data as data table
dt = read.dta("cepr_org_2019.dta")
dt = as.data.table(dt)
print("Loaded data")
# Functions
dstat = function(x, ...) {
  c(mean = mean(x, ...),
    sd = sd(x, ...),
    min = min(x, ...),
    max = max(x, ...),
    median = median(x, ...),
    quantile_05 = quantile(x, 0.05, ...),
    quantile_95 = quantile(x, 0.95, ...),
    N = sum(!is.na(x)))
}

coeftest.hc1 = function (x, ...) {
  coeftest(x, vcovHC(x, type = "HC1"), ...)[1:x$rank,]
}

sum_models = function(x, ...) {
  modelsummary(list(x, ...),
               fmt = 1,
               statistic = c("conf.int",
                             "s.e. = {std.error}",
                             "t = {statistic}",
                             "p = {p.value}"),
               conf_level = 0.95,
               stars = c("***" = 0,
                         "**" = 0.01,
                         "*" = 0.05,
                         "." = 0.1))
}

print("Loaded functions")
# To factors
dt$forborn = factor(dt$forborn)
dt$female = factor(dt$female)
dt$married = factor(dt$married)
dt$metro = factor(dt$metro)
dt$union = factor(dt$union)
dt$cert = factor(dt$cert)

print("Convert to factors")
# Answer to question 1
sum_tb = tibble(Mean = double(),
                SD = double(),
                Min = double(),
                Max = double(),
                Median = double(),
                Quantile.05 = double(),
                Quantile.95 = double(),
                N = integer())

# rw
sum_tb[nrow(sum_tb) + 1, ] = t(dstat(dt$rw[dt$forborn==0], na.rm = TRUE))
sum_tb[nrow(sum_tb) + 1, ] = t(dstat(dt$rw[dt$forborn==1], na.rm = TRUE))

# log(rw)
sum_tb[nrow(sum_tb) + 1, ] = t(dstat(log(dt$rw[dt$forborn==0]), na.rm = TRUE))
sum_tb[nrow(sum_tb) + 1, ] = t(dstat(log(dt$rw[dt$forborn==1]), na.rm = TRUE))

# Add "Name" column
sum_tb = sum_tb %>%
  add_column(Name = c("Non-immigrants - rw", "Immigrants - rw", "Non-immigrants -log(rw)", "Immigrants - log(rw)"), .before = "Mean")

# Print table
sum_tb %>%
  kbl(caption = "Table 1: Real wage of non-immigrants and immigrants", digits = 2) %>%
  column_spec(2, color = "cornflowerblue") %>%
  kable_minimal()

# Confident interval for forborn = 1
# t-test ---> Reject the H0
# H0: true difference in means is equal to 0
# H1: true difference in means is not equal to 0
t.test(x = dt$rw[dt$forborn == 1], y = dt$rw[dt$forborn == 0], alternative = "two.sided", conf.level = 0.95, var.equal = TRUE)
# wilcox.test(x = dt$rw[dt$forborn == 1], y = dt$rw[dt$forborn == 0], alternative = "two.sided", conf.level = 0.95)
#should remove
# t-test ---> Accecpt the H0
# H0: x has a smaller mean than y
# H1: x has a larger mean than y
t.test(x = dt$rw[dt$forborn == 1], y = dt$rw[dt$forborn == 0], alternative = "greater", conf.level = 0.95)
# wilcox.test(x = dt$rw[dt$forborn == 1], y = dt$rw[dt$forborn == 0], alternative = "greater", conf.level = 0.95)
# t-test ---> Reject the H0
# H0: x has a larger mean than y
# H1: x has a smaller mean than y
t.test(x = dt$rw[dt$forborn == 1], y = dt$rw[dt$forborn == 0], alternative = "less", conf.level = 0.95)
# wilcox.test(x = dt$rw[dt$forborn == 1], y = dt$rw[dt$forborn == 0], alternative = "less", conf.level = 0.95)
# For the wage gap
model_final = lm(formula = log(rw) ~ forborn + age + I(age^2) + female + educ + marstat + wbho, data = dt)

summary(model_final)

# Set prinusyr = 0 for forborn = 0
dt$prinusyr[dt$forborn==0] = 0
model_final_year = feols(log(rw) ~ forborn + as.factor(prinusyr) + age + I(age^2) + female + educ + wbho + marstat, data = dt, vcov = "HC1")
summary(model_final_year)
dt_by_prins = dt %>% group_by(prinusyr)%>%
        summarise(mean_rw = mean(rw, na.rm = TRUE),
                  mean_age = mean(age, na.rm = TRUE),
                  n_male = sum(!is.na(rw) & female==0),
                  n_female = sum(!is.na(rw) & female==1),
                  n_educ_lths = sum(!is.na(rw) & educ=="LTHS"),
                  n_educ_hs = sum(!is.na(rw) & educ=="HS"),
                  n_educ_sc = sum(!is.na(rw) & educ=="Some college"),
                  n_educ_c = sum(!is.na(rw) & educ=="College"),
                  n_educ_adv = sum(!is.na(rw) & educ=="Advanced"),
                  n = sum(!is.na(rw)))
dt_by_prins %>%
        kbl(caption = "Table 2: Group by the variable prinusyr", digits = 2, col.names = c("prinusyr", "Mean rw", "Mean age", "Male", "Female", "LTHS", "HS", "Some college", "College", "Advanced", "Total")) %>%
        column_spec(2, color = "cornflowerblue") %>%
        kable_minimal()
ggplot(data = dt_by_prins, aes(x = prinusyr)) +
        geom_line(aes(y = mean_rw, color = "red")) +
        geom_line(aes(y = mean_age, color = "green"))

ggplot(data = tail(dt_by_prins, -1), aes(x = prinusyr)) +
        geom_line(aes(y = n_female, color = "red")) +
        geom_line(aes(y = n_male, color = "green")) +
        geom_line(aes(y = n, color = "pink"))

ggplot(data = tail(dt_by_prins, -1), aes(x = prinusyr)) +
        geom_line(aes(y = n_educ_lths, color = "red")) +
        geom_line(aes(y = n_educ_hs, color = "green")) +
        geom_line(aes(y = n_educ_sc, color = "pink")) +
        geom_line(aes(y = n_educ_c, color = "orange")) +
        geom_line(aes(y = n_educ_adv, color = "blue"))
# Distributions of rw and log(rw)
ggplot(dt, aes(x = rw, color = forborn)) +
        geom_density(aes(fill = forborn), alpha = 0.5, na.rm = TRUE) +
        labs(title = "Figure 1: Distribution of real wage of immigrants and non-immigrants",
             x = "rw", y = "Density")

ggplot(dt, aes(x = log(rw), color = forborn)) +
        geom_density(aes(fill = forborn), alpha = 0.5, na.rm = TRUE) +
        labs(title = "Figure 2: Distribution of log(real wage) of immigrants and non-immigrants",
             x = "log(rw)", y = "Density")

ggplot(dt, aes(x = female, y = log(rw), color = forborn)) +
        geom_boxplot(outlier.alpha = 0.1, na.rm = TRUE) +
        labs(title = "Figure 3: Correlation between forborn and female",
             x = "Gender",
             y = "log(rw)")

ggplot(dt, aes(x = age, y = log(rw))) +
        geom_jitter(aes(color = forborn), width = 0.1, height = 0.1, na.rm = TRUE) +
        stat_summary(fun = mean, geom = "line", aes(group = 1), na.rm = TRUE) +
        facet_wrap(~forborn) +
        labs(title = "Figure 4: Correlation between forborn and age",
             x = "Age",
             y = "log(rw)")

ggplot(dt, aes(x = educ, y = log(rw), color = forborn)) +
        geom_boxplot(outlier.alpha = 0.1, na.rm = TRUE) +
        labs(title = "Figure 5: Correlation between forborn and educ",
             x = "Education",
             y = "log(rw)")

ggplot(dt, aes(x = marstat, y = log(rw), color = forborn)) +
        geom_boxplot(outlier.alpha = 0.1, na.rm = TRUE) +
        labs(title = "Figure 6: Correlation between forborn and marstat",
             x = "Marital status",
             y = "log(rw)")

ggplot(dt, aes(x = wbho, y = log(rw), color = forborn)) +
        geom_boxplot(outlier.alpha = 0.1, na.rm = TRUE) +
        labs(title = "Figure 7: Correlation between forborn and wbho",
             x = "Race",
             y = "log(rw)")

ggplot(dt, aes(x = prinusyr, y = log(rw))) +
        geom_jitter(width = 0.1, height = 0.1, na.rm = TRUE) +
        geom_smooth(method = lm, na.rm = TRUE, formula = y ~ x) +
        labs(title = "Figure 8: Correlation between log(rw) and prinusyr",
             x = "Years since entered",
             y = "log(rw)")


model = lm(formula = rw ~ forborn, data = dt)
model_base = lm(formula = log(rw) ~ forborn, data = dt)
model_base_age = lm(formula = log(rw) ~ forborn + age, data = dt)
model_base_age_poly = lm(formula = log(rw) ~ forborn + age + I(age^2), data = dt)
model_base_female = lm(formula = log(rw) ~ forborn + female, data = dt)
model_base_educ = lm(formula = log(rw) ~ forborn + educ, data = dt)
model_base_wbho = lm(formula = log(rw) ~ forborn + wbho, data = dt)
model_base_marstat = lm(formula = log(rw) ~ forborn + marstat, data = dt)
model_base_metro = lm(formula = log(rw) ~ forborn + metro, data = dt)
model_age_female = lm(formula = log(rw) ~ forborn + age + female, data = dt)
model_age_poly_female = lm(formula = log(rw) ~ forborn + age + I(age^2) + female, data = dt)
model_age_poly_female_educ = lm(formula = log(rw) ~ forborn + age + I(age^2) + female + educ, data = dt)
model_age_poly_female_educ_marstat = lm(formula = log(rw) ~ forborn + age + I(age^2) + female + educ + marstat, data = dt)
model_age_poly_female_educ_marstat_wbho = lm(formula = log(rw) ~ forborn + age + I(age^2) + female + educ + marstat + wbho, data = dt)


modelsummary(list(model, model_base),
             title = "Table 3: Models for rw and log(rw)",
             fmt = 1,
             statistic = c("conf.int",
                           "s.e. = {std.error}",
                           "t = {statistic}",
                           "p = {p.value}"),
             conf_level = 0.95,
             stars = c("***" = 0,
                       "**" = 0.01,
                       "*" = 0.05,
                       "." = 0.1),
             output = "kableExtra")

modelsummary(list(model_base_age, model_base_age_poly),
             title = "Table 4: Models for age",
             fmt = 1,
             statistic = c("conf.int",
                           "s.e. = {std.error}",
                           "t = {statistic}",
                           "p = {p.value}"),
             conf_level = 0.95,
             stars = c("***" = 0,
                       "**" = 0.01,
                       "*" = 0.05,
                       "." = 0.1),
             output = "kableExtra")

modelsummary(list(model_age_poly_female, model_age_poly_female_educ, model_age_poly_female_educ_marstat, model_age_poly_female_educ_marstat_wbho),
             title = "Table 5: Models for the final model",
             fmt = 1,
             statistic = c("conf.int",
                           "s.e. = {std.error}",
                           "t = {statistic}",
                           "p = {p.value}"),
             conf_level = 0.95,
             stars = c("***" = 0,
                       "**" = 0.01,
                       "*" = 0.05,
                       "." = 0.1),
             output = "kableExtra")
anova(model_final)
# anova(model_final_year)
par(mfrow=c(2, 2))
plot(model_final)
title(main = "Model plot for the task a")
coefplot(model_final_year)
title(main = "Coefficient plot for the task b")
par(mfrow=c(1, 1))