inference-report

Introduction

The purpose of this empirical project is to investigate the real wage gap in dollars of two groups of population, non-immigrants and immigrants. In addition to the real wage gap, other factors that could explain the differences between them are also investigated. The regression function is utilized for the goals, in addition to economic theory and research. After the explanation is established, the next purpose is to find out whether the wage gaps change since the immigrants have moved to the US.

Data

The data used for this empirical project is “Current Population Survey (CPS)” from 2019 by “Center for Economic and Policy Research”. Since the purpose of this project is to investigate the real wage gap, the dependent variable is rw (real wage) and forborn (whether the individual is foreign born) is the base independent variable. However, it seems that the forborn factor alone cannot explain the real wage gap. There are other factors that also needs be investigated, e.g., marital status and education, to yield a more complete explanation.

The data excludes person with missing values for rw, which are either not in the work force or unemployed, therefore, the data is left with 154279 persons. The data of the real wage is given as rw, however, it is transformed into log(rw). The reasons are that it is the customary according to economic research, and also it gives a variable distribution that is more amenable to analysis. The base independent variable is whether they are immigrants, for which forborn seems to be the best suited variable for this purpose.
For the other explanation related to the immigrant groups, the independent variable is prinusyr. The prinusyr is used and not the arrived variable, because it has more recent data and also has a nicer distribution. The alternative independent variables are chosen based on economic theory and research. The factors that will be investigated in the model are related to human capital and demographics. These other factors are age, education, sex, marital status, and race.
I have chosen not to exclude the outliers, even though OLS is sensitive to outliers. Since the size of the samples are large, it is assumed to have the sampling distribution. I also ignore that some individuals have multiple jobs. The 95% confidence interval is the critical value throughout the report.

Empirical Approach

The main method of finding the potential independent variables is utilizing the economic knowledge and available resources from books or research papers to outline which factors will most likely affect the real wage gap.

Regression function for the wage gap (task a)

\[log(y_i) = \beta_0 + \beta_{fb} \cdot fb_i + \beta_{fe} \cdot fe_i + \beta_{age1} \cdot age_i + \beta_{age2} \cdot {age_i}^2 + \beta_{educ} \cdot educ_i + \beta_{marstat} \cdot marstat_i + \beta_{wbho} \cdot wbho_i + u_i\]

$\beta_{fb}$ is the variable indicates whether the individual is an immigrant.
$\beta_{fe}$ is a variable indicates whether the individual is a female.
Age is represented two variables:
$\beta_{age1}$ is a variable that models increasing return as a function of age.
$\beta_{age2}$ is the counter part of $\beta_{age1}$, modelling diminishing return with old age.
$\beta_{educ}$ is a variable that indicates the level of education.
$\beta_{marstat}$ is a variable that indicates the marital status.
$\beta_{wbho}$ is a variable that indicates race of the individual.
$u_i$ is the error term capturing the regression residual due to omitted independent variables.

The hypotheses are that the coefficient of forborn should be negative and be in the same direction of the mean real wage in the next section. Female should have lower wage than male, so the coefficient is suspected to be negative. Age is suspected to be increasing in the beginning and decreasing at the end, therefore $\beta_{age1}$ should be positive and $\beta_{age2}$ should be negative. The coefficient of the level of education should be positive and increasing in magnitude as the level of education raises. For individuals who are not married, the coefficient is suspected to be negative. The coefficients for other races should be lower than the dummy variable, wbhoWhite.

Regression function for analysis wage gap with time (task b)

\[log(y_i) = \beta_0 + \beta_{fb} \cdot fb_i + \beta_{fe} \cdot fe_i + \beta_{age1} \cdot age_i + \beta_{age2} \cdot {age_i}^2 + \beta_{educ} \cdot educ_i + \beta_{marstat} \cdot marstat_i + \beta_{prinusyr} \cdot prinusyr_i + \beta_{wbho} \cdot wbho_i + u_i\]

$\beta_{fb}$ is the variable indicates whether the individual is an immigrant.
$\beta_{fe}$ is a variable indicates whether the individual is a female.
Age is represented two variables:
$\beta_{age1}$ is a variable that models increasing return as a function of age.
$\beta_{age2}$ is the counter part of $\beta_{age1}$, modelling diminishing return with old age.
$\beta_{educ}$ is a variable that indicates the level of education.
$\beta_{marstat}$ is a variable that indicates the marital status.
$\beta_{wbho}$ is a variable that indicates race of the individual.
$\beta_{prinusyr}$ is a variable indicates years since the immigrants have entered the US.
$u_i$ is the error term.

The model is similar to the one above, however the variable prinusyr is added, so we can probably have a better picture. The hypothesis is that the wage gap should be smaller as the variable year increases. This could be the caused by effects such as language barriers and integration. However, the prinusyr variable is a categorical variable indicating the range of year the individual moved to the US, i.e. a value of 1 corresponds to moving before 1950 while the higher the value the more recent the individual moved to the US, in addition to prinusyr for the natives, which will be added.

Results

The mean real wage gap before data transformation is $25.95 - $25.32 = $0.63, which means 2.49% reduction for immigrants. The statistical values regarding the real wage are listed in table 1. The distribution is also shown in figure 1 and 2.

Table 1: Real wage of non-immigrants and immigrants
Name	Mean	SD	Min	Max	Median	Quantile.05	Quantile.95	N
Non-immigrants - rw	25.95	19.60	1	392.30	20.00	9.0	64.60	132180
Immigrants - rw	25.32	20.56	1	288.33	18.00	9.0	68.83	22099
Non-immigrants -log(rw)	3.05	0.63	0	5.97	3.00	2.2	4.17	132180
Immigrants - log(rw)	3.00	0.65	0	5.66	2.89	2.2	4.23	22099

Hypothesis testing of the real mean

Two-sided t-test

Null hypothesis: $H_0: \mu_{im} = \mu_{non}$

Alternative hypothesis: $H_1: \mu_{im} \neq \mu_{non}$

## 
##  Two Sample t-test
## 
## data:  dt$rw[dt$forborn == 1] and dt$rw[dt$forborn == 0]
## t = -4.4178, df = 154277, p-value = 9.977e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9149068 -0.3525823
## sample estimates:
## mean of x mean of y 
##  25.31543  25.94918

From the t-test, we can reject the null hypothesis that the true mean of immigrants and non-immigrants are equal with 95% confidence interval.

The regression analysis of immigrants and non-immigrants

Regression function for the wage gap (task a)

\[y_i = \beta_0 + \beta_{fo} \cdot fo_i + \beta_{age1} \cdot age_i + \beta_{age2} \cdot {age_i}^2 + \beta_{fe} \cdot fe_i + \beta_{educ} \cdot educ_i + \beta_{marstat} \cdot marstat_i + \beta_{wbho} \cdot wbho_i + u_i\]

## 
## Call:
## lm(formula = log(rw) ~ forborn + age + I(age^2) + female + educ + 
##     marstat + wbho, data = dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7393 -0.2985 -0.0088  0.2983  2.9723 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.755e+00  1.357e-02 129.307  < 2e-16 ***
## forborn1             -5.437e-02  4.435e-03 -12.257  < 2e-16 ***
## age                   4.643e-02  5.806e-04  79.963  < 2e-16 ***
## I(age^2)             -4.566e-04  6.286e-06 -72.630  < 2e-16 ***
## female1              -2.252e-01  2.639e-03 -85.327  < 2e-16 ***
## educHS                1.785e-01  5.889e-03  30.317  < 2e-16 ***
## educSome college      2.899e-01  5.944e-03  48.775  < 2e-16 ***
## educCollege           6.612e-01  6.090e-03 108.577  < 2e-16 ***
## educAdvanced          8.778e-01  6.528e-03 134.472  < 2e-16 ***
## marstatWidowed       -6.391e-02  9.565e-03  -6.681 2.38e-11 ***
## marstatDivorced      -5.914e-02  4.433e-03 -13.340  < 2e-16 ***
## marstatSeparated     -8.588e-02  9.847e-03  -8.721  < 2e-16 ***
## marstatNever Married -9.660e-02  3.596e-03 -26.860  < 2e-16 ***
## wbhoBlack            -1.260e-01  4.489e-03 -28.064  < 2e-16 ***
## wbhoHispanic         -5.312e-02  4.305e-03 -12.339  < 2e-16 ***
## wbhoOther             2.422e-02  5.323e-03   4.550 5.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5119 on 154263 degrees of freedom
##   (137111 observations deleted due to missingness)
## Multiple R-squared:  0.3376, Adjusted R-squared:  0.3376 
## F-statistic:  5243 on 15 and 154263 DF,  p-value: < 2.2e-16

The first regression is to determine which factors contribute to the wage gap differences in the population between immigrants and non-immigrants. The possible factors are included in the regression model, and the results are interpreted whether they are according to the hypotheses. Some possible factors are left out due to limitation of my knowledge and not to overcrowd the model, however, they would be mentioned if they could possibly contribute to the difference in the real wage. The result shows that 33.76% of the real wage variation can be accounted for by the six mentioned factors (including forborn) with F=5243 and p<0.001. All the coefficients are statistically significant at 95% confidence interval, with p<0.001. The coefficients seem to be in the correct directions according to the mentioned economic theories. The correlation plots are in the appendices.

intercept

The intercept is interpreted as an individual who is not an immigrant, zero age, no education (educLTHS), male, not living in metropolitan area, married, and white (wbhoWhite), has an average real wage of $5.7831. The intercept is used as a base for dummy variables in the model. The t-statistic and p-value indicates that it is statistically significant.

forborn

Being an immigrant results in a reduction in real wage by approximately 5.4770%. The direction seems to be in same as suspected, that immigrants tend to have lower real wage than non-immigrants. The magnitude of the result is larger than what was calculated from the sample $0.95 (-2.49%). Since the model explains 33.76% of the data, there are other factors that affect the real wage, which are not included. However, the t-statistic and p-value indicate that forborn1 is statistically significant.

age

Since age was assumed to positively correlate with real wage if age is low, the result of aging from 25 to 26 is a raise in real wage by approximately 2.32%, while the result of aging from 62 to 63 is a reduction in real wage by approximately 1.06%. The t-statistic and p-value indicate that age and I(age^2) are statistically significant.

female

Being a female results to a reduction in the real wage by approximately 20.164%. The direction seems to be the same as expected. Supported by the data from “Office of Federal Contract Compliance Programs (OFCCP)” and economic research that gender does have causal effect on the real wage. The t-statistic and p-value mean that it is statistically significant.

educ

The educ variable is a factor with the values of educLTHS (lower than high school), educHS (high school), educSome College (Some college), educCollege (College), and educAdvanced (Advanced).

educLTHS (lower than high school): This group is the dummy, so there is no effect from educ. educHS (high school): Finishing high school results to an increasing in real wage by approximately 19.54%. _educSome College (some college): Having some college increases the real wage by approximately 33.63%. educCollege (college): Having college degree results to an increasing in real wage by 93.71%. educAdvanced (advanced): Having higher education than college degree increases the real wage by 140.56%.

As suspected, the level of education has an impact on the real wage, i.e. higher level of education tends to increase the real wage.

marstat

The marstat variable is a factor variable with the values of marstatMarried (Married), marstatWidowed (Widowed), marstatDivorced (Divorced), marstatSeparated (Separated), and marstatNever Married (Never Married). The t-statistic and p-value mean that they are statistically significant.

marstatMarried (Married): This group is the dummy, so there is no effect from marstat. marstatWidowed (Widowed): Comparing to the individuals who are married, they earn 6.1911% less. marstatDivorced (Divorced): Comparing to the individuals who are married, they earn 5.7425% less. marstatSeparated (Separated): Comparing to the individuals who are married, they earn 8.2296% less. marstatNever Married (Never Married): Comparing to the individuals who are married, they earn 9.2081% less.

The directions of the coefficients are the same as suspected that individuals who are married earn most comparing to others. This marital wage premium could be explained by, e.g., an increase in productivity. While the model can possibly explained the causal effect of the marital status on real wage, there are some other variables that are left out, for example, numbers of children or whether they are the head of the family.

wbho

The wbho variable is a categorical variable, which are wbhoWhite (white), wbhoBlack (black), wbhoHispanic (hispanic), and wbhoOther (other races, including Asians). The t-statistic and p-value for all of them are statistically significant.
wbhoWhite (White): This is a base, so there is no effect from wbho variable.
wbhoBlack (Black): This results to a decreasing in real wage by 11.839%, which seems to be in the same direction as suspected.
wbhoHispanic (Hispanic): The results to a decreasing in real wage by 5.312%, which is smaller compared to wbhoBlack.
wbhoOther (Other): Compared to wbhoWhite, this group has an increase in real wage by 2.4516%.

The directions of the results are quite similar to the data from “Office of Federal Contract Compliance Programs (OFCCP)”, i.e., racial disparity in races. Comparing to white, black and hispanic earn less, while other races earns more. However, the model does not include native american, multiracial and asian, the model cannot be used to explain the causal inference as they are grouped into wbhoOther.

Regression function for analysis wage gap with time (task b)

\[log(y_i) = \beta_0 + \beta_{fo} \cdot fo_i + \beta_{prinusyr} \cdot prinusyr_i + \beta_{age1} \cdot age_i + \beta_{age2} \cdot {age_i}^2 + \beta_{fe} \cdot fe_i + \beta_{educ} \cdot educ_i + \beta_{marstat} \cdot marstat_i + \beta_{wbho} \cdot wbho_i + u_i\]

## NOTE: 137,111 observations removed because of NA values (LHS: 137,111).

## The variable 'as.factor(prinusyr)25' has been removed because of collinearity (see $collin.var).

## OLS estimation, Dep. Var.: log(rw)
## Observations: 154,279 
## Standard-errors: Heteroskedasticity-robust 
##                        Estimate Std. Error   t value   Pr(>|t|)    
## (Intercept)            1.760898 0.01308710 134.55220  < 2.2e-16 ***
## forborn1              -0.123556 0.01275814  -9.68447  < 2.2e-16 ***
## as.factor(prinusyr)1  -0.436079 0.39530301  -1.10315 2.6996e-01    
## as.factor(prinusyr)2   0.122903 0.06972513   1.76269 7.7955e-02 .  
## as.factor(prinusyr)3   0.205241 0.05898235   3.47971 5.0210e-04 ***
## as.factor(prinusyr)4   0.181118 0.03765276   4.81022 1.5091e-06 ***
## as.factor(prinusyr)5   0.175386 0.02827880   6.20202 5.5885e-10 ***
## as.factor(prinusyr)6   0.124138 0.02196419   5.65182 1.5903e-08 ***
## as.factor(prinusyr)7   0.116691 0.02365050   4.93398 8.0656e-07 ***
## as.factor(prinusyr)8   0.108191 0.02802677   3.86028 1.1330e-04 ***
## as.factor(prinusyr)9   0.101984 0.02291253   4.45103 8.5521e-06 ***
## as.factor(prinusyr)10  0.121007 0.02459377   4.92023 8.6530e-07 ***
## as.factor(prinusyr)11  0.035057 0.02157768   1.62470 1.0423e-01    
## as.factor(prinusyr)12  0.104475 0.02012037   5.19249 2.0776e-07 ***
## as.factor(prinusyr)13  0.056727 0.02078213   2.72961 6.3416e-03 ** 
## as.factor(prinusyr)14  0.064537 0.01947024   3.31466 9.1774e-04 ***
## as.factor(prinusyr)15  0.072992 0.02071420   3.52377 4.2558e-04 ***
## as.factor(prinusyr)16  0.089448 0.01826536   4.89714 9.7340e-07 ***
## as.factor(prinusyr)17  0.072768 0.01690134   4.30543 1.6676e-05 ***
## as.factor(prinusyr)18  0.061826 0.01938976   3.18857 1.4301e-03 ** 
## as.factor(prinusyr)19  0.053517 0.01918289   2.78984 5.2741e-03 ** 
## as.factor(prinusyr)20  0.033724 0.01959665   1.72091 8.5270e-02 .  
## as.factor(prinusyr)21  0.055032 0.02020803   2.72325 6.4650e-03 ** 
## as.factor(prinusyr)22  0.069681 0.02096781   3.32326 8.8993e-04 ***
## as.factor(prinusyr)23  0.045798 0.02019255   2.26807 2.3326e-02 *  
## as.factor(prinusyr)24  0.042194 0.01955314   2.15791 3.0936e-02 *  
## age                    0.046300 0.00061285  75.54815  < 2.2e-16 ***
## I(age^2)              -0.000457 0.00000695 -65.78739  < 2.2e-16 ***
## female1               -0.225442 0.00264649 -85.18512  < 2.2e-16 ***
## educHS                 0.179269 0.00456312  39.28650  < 2.2e-16 ***
## educSome college       0.290143 0.00468610  61.91567  < 2.2e-16 ***
## educCollege            0.661835 0.00520991 127.03386  < 2.2e-16 ***
## educAdvanced           0.879434 0.00591695 148.62972  < 2.2e-16 ***
## wbhoBlack             -0.125143 0.00431665 -28.99070  < 2.2e-16 ***
## wbhoHispanic          -0.054870 0.00410348 -13.37148  < 2.2e-16 ***
## wbhoOther              0.025482 0.00556861   4.57598 4.7437e-06 ***
## marstatWidowed        -0.063384 0.00999178  -6.34359 2.2510e-10 ***
## marstatDivorced       -0.059469 0.00451037 -13.18505  < 2.2e-16 ***
## marstatSeparated      -0.086215 0.00928273  -9.28764  < 2.2e-16 ***
## marstatNever Married  -0.097451 0.00360065 -27.06479  < 2.2e-16 ***
## ... 1 variable was removed because of collinearity (as.factor(prinusyr)25)
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.511678   Adj. R2: 0.338051

The second regression is to determine whether years since they have entered would contribute to the wage gap differences. I have chosen to use the same model, however included prinusyr as a factor variable where individuals who are native born have prinsuyr value of zero. The reason is that with the same factors as the model above, which factors would contribute to the real wage within the immigrant group. The result shows that 33.80% of the real wage variation can be accounted for by the seven mentioned factors (including prinusyr) with RMSE = 0.512. The coefficients seem to be in the correct directions according to the mentioned economic theories. The prinusyr seems to be as suspected by the hypothesis that the wage gap is closer the longer they have been in the US.

prinusyr

The wage gap becomes smaller the longer they have resided in the US. As mentioned, the prinusyr is a factor variable, where the higher the value of prinusyr the most recent they have entered. Since prinusyr is zero for the natives (holding other variables constant), the results can be interpreted as, e.g., immigrants who entered during the year 2014 - 2017 (prinusyr = 24) have an increased of 4.31% in real wage and immigrants who entered during the year 1950 - 1959 have an increased of 13.08% in real wage compared to the natives. The t-statistics and p-values for prinsuyr are statistically significant, except for prinsuyr = 1 and 11. There are 7 samples who have prinusyr = 1 (and rw is not NA).

Compared to the other model, the directions of the coefficients are the same as the other model with different magnitudes.

Table 2: Group by the variable prinusyr
prinusyr	Mean rw	Mean age	Male	Female	LTHS	HS	Some college	College	Advanced	Total
0	25.95	48.29	66465	65715	5857	36292	39468	32489	18074	132180
1	16.34	80.27	3	4	0	3	0	0	4	7
2	27.45	75.88	42	42	9	22	21	16	16	84
3	30.20	73.21	58	71	14	35	32	25	23	129
4	28.45	69.22	128	130	24	77	69	53	35	258
5	29.42	65.06	263	215	77	126	95	102	78	478
6	28.34	60.29	454	421	168	208	181	185	133	875
7	27.12	58.24	360	306	137	187	123	134	85	666
8	27.81	55.59	254	216	91	107	103	92	77	470
9	26.70	54.84	342	280	125	172	103	128	94	622
10	27.32	54.02	326	265	128	146	106	141	70	591
11	24.25	52.47	501	364	192	256	155	152	110	865
12	26.87	50.78	553	542	200	328	179	235	153	1095
13	24.88	49.26	446	425	185	229	159	170	128	871
14	24.74	48.84	569	479	208	309	186	197	148	1048
15	25.82	47.56	574	432	172	284	188	211	151	1006
16	25.78	45.59	810	624	299	394	244	272	225	1434
17	24.02	43.42	1005	833	425	533	294	326	260	1838
18	23.91	41.45	587	569	247	325	194	220	170	1156
19	23.28	40.80	672	622	304	356	201	240	193	1294
20	23.66	39.88	572	520	203	292	191	230	176	1092
21	25.15	41.06	601	452	182	282	172	212	205	1053
22	26.19	39.11	510	472	122	260	164	219	217	982
23	25.00	38.70	571	459	129	299	152	238	212	1030
24	25.09	36.58	718	571	167	336	173	309	304	1289
25	23.60	35.74	1051	815	254	464	274	451	423	1866

Compared to the other model, the directions of the coefficients are the same as the other model with different magnitudes.

Summary and conclusion

The main objectives are to find out the explanation for the real wage gap between immigrants and non-immigrants and whether the gap changes since they have moved to the US. The results from the model shows some possible factors. The real wage increases as individuals age and decreases after a certain point. There are gender and race disparity among the sample, that gender and race do effect the real wage. Individuals with higher level of education earn more than individuals with lower level of education. The model also shows the existence of the marital status premium in the sample.
The wage gap between immigrants and non-immigrants also reduces as then number of years since the individual moved to the US increases.

References

Wage Gaps by Race. (n.d.). Investopedia. Retrieved March 12, 2023, from https://www.investopedia.com/wage-gaps-by-race-5073258
Anderson, K. H. H. (2019). Can immigrants ever earn as much as native workers? IZA World of Labor. https://doi.org/10.15185/izawol.159
Earnings Disparities by Race and Ethnicity. (n.d.). DOL. Retrieved March 12, 2023, from http://www.dol.gov/agencies/ofccp/about/data/earnings/race-and-ethnicity
Earnings Disparities by Sex. (n.d.). DOL. Retrieved March 12, 2023, from http://www.dol.gov/agencies/ofccp/about/data/earnings/gender
Mincy, R., Hill, J., & Sinkewicz, M. (2009). Marriage: Cause or Mere Indicator of Future Earnings Growth? Journal of Policy Analysis and Management : [The Journal of the Association for Public Policy Analysis and Management], 28(3), 417–439. https://doi.org/10.1002/pam.20439
Patten, E. (n.d.). Racial, gender wage gaps persist in U.S. despite some progress. Pew Research Center. Retrieved March 12, 2023, from https://www.pewresearch.org/fact-tank/2016/07/01/racial-gender-wage-gaps-persist-in-u-s-despite-some-progress/
SciELO - Brazil—An analysis of income differentials by marital status An analysis of income differentials by marital status. (n.d.). Retrieved March 12, 2023, from https://www.scielo.br/j/ee/a/XSyLwJp5k8x9YvLh9qqnfpN/?lang=en
Scimago Institutions Distribution. (n.d.). Retrieved May 9, 2022, from https://www.scimagoir.com/institutiondistribution.php?startyear=&country=NORDIC%20COUNTRIES
Observing the Earnings Gap through Marital Status, Race and Gender | St. Louis Fed. (n.d.). Retrieved March 12, 2023, from https://www.stlouisfed.org/publications/regional-economist/second-quarter-2019/earnings-gap-marital-status-race-gender
Madalozzo, R. (2008). An analysis of income differentials by marital status. Estudos Econômicos (São Paulo), 38, 267–292. https://doi.org/10.1590/S0101-41612008000200003
Ahituv, A., & Lerman, R. I. (2005). How Do Marital Status, Wage Rates, and Work Commitment Interact? SSRN Electronic Journal. https://doi.org/10.2139/ssrn.773950
Strike, A. (2012). What is the Source of the Male Marital Wage Premium?

Appendices

Table 3: Models for rw and log(rw)
	(1)	(2)
(Intercept)	25.9***	3.0***
	[25.8, 26.1]	[3.0, 3.1]
	s.e. = 0.1	s.e. = 0.0
	t = 478.0	t = 1763.6
	p = <0.1	p = <0.1
forborn1	-0.6**	0.0**
	[-0.9, -0.4]	[-0.1, 0.0]
	s.e. = 0.1	s.e. = 0.0
	t = -4.4	t = -10.6
	p = <0.1	p = <0.1
Num.Obs.	154279	154279
R2	0.000	0.001
R2 Adj.	0.000	0.001
AIC	1358130.1	1233607.1
BIC	1358159.9	1233636.9
Log.Lik.	-679062.029	-147325.022
RMSE	19.74	0.63
. p < 0.1, * p < 0.05, p < 0.01, * p < 0

Table 4: Models for age
	(1)	(2)
(Intercept)	2.7***	1.5***
	[2.6, 2.7]	[1.4, 1.5]
	s.e. = 0.0	s.e. = 0.0
	t = 546.4	t = 117.7
	p = <0.1	p = <0.1
forborn1	-0.1**	-0.1**
	[-0.1, -0.1]	[-0.1, -0.1]
	s.e. = 0.0	s.e. = 0.0
	t = -13.7	t = -23.7
	p = <0.1	p = <0.1
age	0.0***	0.1***
	[0.0, 0.0]	[0.1, 0.1]
	s.e. = 0.0	s.e. = 0.0
	t = 86.8	t = 117.4
	p = <0.1	p = <0.1
I(age^2)		0.0***
		[0.0, 0.0]
		s.e. = 0.0
		t = -103.4
		p = <0.1
Num.Obs.	154279	154279
R2	0.047	0.109
R2 Adj.	0.047	0.109
AIC	1226253.9	1215912.2
BIC	1226293.7	1215961.9
Log.Lik.	-143647.427	-138475.584
RMSE	0.61	0.59
. p < 0.1, * p < 0.05, p < 0.01, * p < 0

Table 5: Models for the final model
	(1)	(2)	(3)	(4)
(Intercept)	1.6***	1.5***	1.8***	1.8***
	[1.5, 1.6]	[1.5, 1.6]	[1.7, 1.8]	[1.7, 1.8]
	s.e. = 0.0	s.e. = 0.0	s.e. = 0.0	s.e. = 0.0
	t = 125.7	t = 134.2	t = 129.8	t = 129.3
	p = <0.1	p = <0.1	p = <0.1	p = <0.1
forborn1	-0.1**	-0.1**	-0.1**	-0.1**
	[-0.1, -0.1]	[-0.1, -0.1]	[-0.1, -0.1]	[-0.1, 0.0]
	s.e. = 0.0	s.e. = 0.0	s.e. = 0.0	s.e. = 0.0
	t = -25.8	t = -15.3	t = -17.1	t = -12.3
	p = <0.1	p = <0.1	p = <0.1	p = <0.1
age	0.1***	0.1***	0.0***	0.0***
	[0.1, 0.1]	[0.1, 0.1]	[0.0, 0.0]	[0.0, 0.0]
	s.e. = 0.0	s.e. = 0.0	s.e. = 0.0	s.e. = 0.0
	t = 119.3	t = 97.6	t = 78.0	t = 80.0
	p = <0.1	p = <0.1	p = <0.1	p = <0.1
I(age^2)	0.0***	0.0***	0.0***	0.0***
	[0.0, 0.0]	[0.0, 0.0]	[0.0, 0.0]	[0.0, 0.0]
	s.e. = 0.0	s.e. = 0.0	s.e. = 0.0	s.e. = 0.0
	t = -105.1	t = -84.7	t = -70.7	t = -72.6
	p = <0.1	p = <0.1	p = <0.1	p = <0.1
female1	-0.2***	-0.2***	-0.2***	-0.2***
	[-0.2, -0.2]	[-0.2, -0.2]	[-0.2, -0.2]	[-0.2, -0.2]
	s.e. = 0.0	s.e. = 0.0	s.e. = 0.0	s.e. = 0.0
	t = -64.3	t = -88.2	t = -86.5	t = -85.3
	p = <0.1	p = <0.1	p = <0.1	p = <0.1
educHS		0.2**	0.2**	0.2**
		[0.2, 0.2]	[0.2, 0.2]	[0.2, 0.2]
		s.e. = 0.0	s.e. = 0.0	s.e. = 0.0
		t = 31.6	t = 31.6	t = 30.3
		p = <0.1	p = <0.1	p = <0.1
educSome college		0.3***	0.3***	0.3***
		[0.3, 0.3]	[0.3, 0.3]	[0.3, 0.3]
		s.e. = 0.0	s.e. = 0.0	s.e. = 0.0
		t = 51.0	t = 50.6	t = 48.8
		p = <0.1	p = <0.1	p = <0.1
educCollege		0.7***	0.7***	0.7***
		[0.7, 0.7]	[0.7, 0.7]	[0.6, 0.7]
		s.e. = 0.0	s.e. = 0.0	s.e. = 0.0
		t = 114.1	t = 113.0	t = 108.6
		p = <0.1	p = <0.1	p = <0.1
educAdvanced		0.9***	0.9***	0.9***
		[0.9, 0.9]	[0.9, 0.9]	[0.9, 0.9]
		s.e. = 0.0	s.e. = 0.0	s.e. = 0.0
		t = 141.2	t = 139.5	t = 134.5
		p = <0.1	p = <0.1	p = <0.1
marstatWidowed			-0.1**	-0.1**
			[-0.1, -0.1]	[-0.1, 0.0]
			s.e. = 0.0	s.e. = 0.0
			t = -7.2	t = -6.7
			p = <0.1	p = <0.1
marstatDivorced			-0.1**	-0.1**
			[-0.1, -0.1]	[-0.1, -0.1]
			s.e. = 0.0	s.e. = 0.0
			t = -14.3	t = -13.3
			p = <0.1	p = <0.1
marstatSeparated			-0.1**	-0.1**
			[-0.1, -0.1]	[-0.1, -0.1]
			s.e. = 0.0	s.e. = 0.0
			t = -10.5	t = -8.7
			p = <0.1	p = <0.1
marstatNever Married			-0.1**	-0.1**
			[-0.1, -0.1]	[-0.1, -0.1]
			s.e. = 0.0	s.e. = 0.0
			t = -31.0	t = -26.9
			p = <0.1	p = <0.1
wbhoBlack				-0.1**
				[-0.1, -0.1]
				s.e. = 0.0
				t = -28.1
				p = <0.1
wbhoHispanic				-0.1**
				[-0.1, 0.0]
				s.e. = 0.0
				t = -12.3
				p = <0.1
wbhoOther				0.0**
				[0.0, 0.0]
				s.e. = 0.0
				t = 4.6
				p = <0.1
Num.Obs.	154279	154279	154279	154279
R2	0.132	0.329	0.334	0.338
R2 Adj.	0.132	0.329	0.333	0.338
AIC	1211830.1	1172231.8	1171136.5	1170190.9
BIC	1211889.7	1172331.2	1171275.7	1170360.0
Log.Lik.	-136433.521	-116630.376	-116078.719	-115602.942
RMSE	0.59	0.52	0.51	0.51
. p < 0.1, * p < 0.05, p < 0.01, * p < 0

## Analysis of Variance Table
## 
## Response: log(rw)
##               Df Sum Sq Mean Sq  F value    Pr(>F)    
## forborn        1     44    44.4   169.24 < 2.2e-16 ***
## age            1   2840  2839.5 10835.09 < 2.2e-16 ***
## I(age^2)       1   3771  3771.0 14389.41 < 2.2e-16 ***
## female         1   1421  1420.7  5421.15 < 2.2e-16 ***
## educ           4  11991  2997.7 11438.82 < 2.2e-16 ***
## marstat        4    292    73.0   278.50 < 2.2e-16 ***
## wbho           3    250    83.4   318.13 < 2.2e-16 ***
## Residuals 154263  40427     0.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Import libraries
library(foreign)
library(dplyr)
library(ggplot2)
library(car)
library(lmtest)
library(sandwich)
library(fixest)
library(tibble)
library(tidyverse)
library(data.table)
library(modelsummary)
library(kableExtra)

print("Loaded library")
# Read the data as data table
dt = read.dta("cepr_org_2019.dta")
dt = as.data.table(dt)
print("Loaded data")
# Functions
dstat = function(x, ...) {
  c(mean = mean(x, ...),
    sd = sd(x, ...),
    min = min(x, ...),
    max = max(x, ...),
    median = median(x, ...),
    quantile_05 = quantile(x, 0.05, ...),
    quantile_95 = quantile(x, 0.95, ...),
    N = sum(!is.na(x)))
}

coeftest.hc1 = function (x, ...) {
  coeftest(x, vcovHC(x, type = "HC1"), ...)[1:x$rank,]
}

sum_models = function(x, ...) {
  modelsummary(list(x, ...),
               fmt = 1,
               statistic = c("conf.int",
                             "s.e. = {std.error}",
                             "t = {statistic}",
                             "p = {p.value}"),
               conf_level = 0.95,
               stars = c("***" = 0,
                         "**" = 0.01,
                         "*" = 0.05,
                         "." = 0.1))
}

print("Loaded functions")
# To factors
dt$forborn = factor(dt$forborn)
dt$female = factor(dt$female)
dt$married = factor(dt$married)
dt$metro = factor(dt$metro)
dt$union = factor(dt$union)
dt$cert = factor(dt$cert)

print("Convert to factors")
# Answer to question 1
sum_tb = tibble(Mean = double(),
                SD = double(),
                Min = double(),
                Max = double(),
                Median = double(),
                Quantile.05 = double(),
                Quantile.95 = double(),
                N = integer())

# rw
sum_tb[nrow(sum_tb) + 1, ] = t(dstat(dt$rw[dt$forborn==0], na.rm = TRUE))
sum_tb[nrow(sum_tb) + 1, ] = t(dstat(dt$rw[dt$forborn==1], na.rm = TRUE))

# log(rw)
sum_tb[nrow(sum_tb) + 1, ] = t(dstat(log(dt$rw[dt$forborn==0]), na.rm = TRUE))
sum_tb[nrow(sum_tb) + 1, ] = t(dstat(log(dt$rw[dt$forborn==1]), na.rm = TRUE))

# Add "Name" column
sum_tb = sum_tb %>%
  add_column(Name = c("Non-immigrants - rw", "Immigrants - rw", "Non-immigrants -log(rw)", "Immigrants - log(rw)"), .before = "Mean")

# Print table
sum_tb %>%
  kbl(caption = "Table 1: Real wage of non-immigrants and immigrants", digits = 2) %>%
  column_spec(2, color = "cornflowerblue") %>%
  kable_minimal()

# Confident interval for forborn = 1
# t-test ---> Reject the H0
# H0: true difference in means is equal to 0
# H1: true difference in means is not equal to 0
t.test(x = dt$rw[dt$forborn == 1], y = dt$rw[dt$forborn == 0], alternative = "two.sided", conf.level = 0.95, var.equal = TRUE)
# wilcox.test(x = dt$rw[dt$forborn == 1], y = dt$rw[dt$forborn == 0], alternative = "two.sided", conf.level = 0.95)
#should remove
# t-test ---> Accecpt the H0
# H0: x has a smaller mean than y
# H1: x has a larger mean than y
t.test(x = dt$rw[dt$forborn == 1], y = dt$rw[dt$forborn == 0], alternative = "greater", conf.level = 0.95)
# wilcox.test(x = dt$rw[dt$forborn == 1], y = dt$rw[dt$forborn == 0], alternative = "greater", conf.level = 0.95)
# t-test ---> Reject the H0
# H0: x has a larger mean than y
# H1: x has a smaller mean than y
t.test(x = dt$rw[dt$forborn == 1], y = dt$rw[dt$forborn == 0], alternative = "less", conf.level = 0.95)
# wilcox.test(x = dt$rw[dt$forborn == 1], y = dt$rw[dt$forborn == 0], alternative = "less", conf.level = 0.95)
# For the wage gap
model_final = lm(formula = log(rw) ~ forborn + age + I(age^2) + female + educ + marstat + wbho, data = dt)

summary(model_final)

# Set prinusyr = 0 for forborn = 0
dt$prinusyr[dt$forborn==0] = 0
model_final_year = feols(log(rw) ~ forborn + as.factor(prinusyr) + age + I(age^2) + female + educ + wbho + marstat, data = dt, vcov = "HC1")
summary(model_final_year)
dt_by_prins = dt %>% group_by(prinusyr)%>%
        summarise(mean_rw = mean(rw, na.rm = TRUE),
                  mean_age = mean(age, na.rm = TRUE),
                  n_male = sum(!is.na(rw) & female==0),
                  n_female = sum(!is.na(rw) & female==1),
                  n_educ_lths = sum(!is.na(rw) & educ=="LTHS"),
                  n_educ_hs = sum(!is.na(rw) & educ=="HS"),
                  n_educ_sc = sum(!is.na(rw) & educ=="Some college"),
                  n_educ_c = sum(!is.na(rw) & educ=="College"),
                  n_educ_adv = sum(!is.na(rw) & educ=="Advanced"),
                  n = sum(!is.na(rw)))
dt_by_prins %>%
        kbl(caption = "Table 2: Group by the variable prinusyr", digits = 2, col.names = c("prinusyr", "Mean rw", "Mean age", "Male", "Female", "LTHS", "HS", "Some college", "College", "Advanced", "Total")) %>%
        column_spec(2, color = "cornflowerblue") %>%
        kable_minimal()
ggplot(data = dt_by_prins, aes(x = prinusyr)) +
        geom_line(aes(y = mean_rw, color = "red")) +
        geom_line(aes(y = mean_age, color = "green"))

ggplot(data = tail(dt_by_prins, -1), aes(x = prinusyr)) +
        geom_line(aes(y = n_female, color = "red")) +
        geom_line(aes(y = n_male, color = "green")) +
        geom_line(aes(y = n, color = "pink"))

ggplot(data = tail(dt_by_prins, -1), aes(x = prinusyr)) +
        geom_line(aes(y = n_educ_lths, color = "red")) +
        geom_line(aes(y = n_educ_hs, color = "green")) +
        geom_line(aes(y = n_educ_sc, color = "pink")) +
        geom_line(aes(y = n_educ_c, color = "orange")) +
        geom_line(aes(y = n_educ_adv, color = "blue"))
# Distributions of rw and log(rw)
ggplot(dt, aes(x = rw, color = forborn)) +
        geom_density(aes(fill = forborn), alpha = 0.5, na.rm = TRUE) +
        labs(title = "Figure 1: Distribution of real wage of immigrants and non-immigrants",
             x = "rw", y = "Density")

ggplot(dt, aes(x = log(rw), color = forborn)) +
        geom_density(aes(fill = forborn), alpha = 0.5, na.rm = TRUE) +
        labs(title = "Figure 2: Distribution of log(real wage) of immigrants and non-immigrants",
             x = "log(rw)", y = "Density")

ggplot(dt, aes(x = female, y = log(rw), color = forborn)) +
        geom_boxplot(outlier.alpha = 0.1, na.rm = TRUE) +
        labs(title = "Figure 3: Correlation between forborn and female",
             x = "Gender",
             y = "log(rw)")

ggplot(dt, aes(x = age, y = log(rw))) +
        geom_jitter(aes(color = forborn), width = 0.1, height = 0.1, na.rm = TRUE) +
        stat_summary(fun = mean, geom = "line", aes(group = 1), na.rm = TRUE) +
        facet_wrap(~forborn) +
        labs(title = "Figure 4: Correlation between forborn and age",
             x = "Age",
             y = "log(rw)")

ggplot(dt, aes(x = educ, y = log(rw), color = forborn)) +
        geom_boxplot(outlier.alpha = 0.1, na.rm = TRUE) +
        labs(title = "Figure 5: Correlation between forborn and educ",
             x = "Education",
             y = "log(rw)")

ggplot(dt, aes(x = marstat, y = log(rw), color = forborn)) +
        geom_boxplot(outlier.alpha = 0.1, na.rm = TRUE) +
        labs(title = "Figure 6: Correlation between forborn and marstat",
             x = "Marital status",
             y = "log(rw)")

ggplot(dt, aes(x = wbho, y = log(rw), color = forborn)) +
        geom_boxplot(outlier.alpha = 0.1, na.rm = TRUE) +
        labs(title = "Figure 7: Correlation between forborn and wbho",
             x = "Race",
             y = "log(rw)")

ggplot(dt, aes(x = prinusyr, y = log(rw))) +
        geom_jitter(width = 0.1, height = 0.1, na.rm = TRUE) +
        geom_smooth(method = lm, na.rm = TRUE, formula = y ~ x) +
        labs(title = "Figure 8: Correlation between log(rw) and prinusyr",
             x = "Years since entered",
             y = "log(rw)")


model = lm(formula = rw ~ forborn, data = dt)
model_base = lm(formula = log(rw) ~ forborn, data = dt)
model_base_age = lm(formula = log(rw) ~ forborn + age, data = dt)
model_base_age_poly = lm(formula = log(rw) ~ forborn + age + I(age^2), data = dt)
model_base_female = lm(formula = log(rw) ~ forborn + female, data = dt)
model_base_educ = lm(formula = log(rw) ~ forborn + educ, data = dt)
model_base_wbho = lm(formula = log(rw) ~ forborn + wbho, data = dt)
model_base_marstat = lm(formula = log(rw) ~ forborn + marstat, data = dt)
model_base_metro = lm(formula = log(rw) ~ forborn + metro, data = dt)
model_age_female = lm(formula = log(rw) ~ forborn + age + female, data = dt)
model_age_poly_female = lm(formula = log(rw) ~ forborn + age + I(age^2) + female, data = dt)
model_age_poly_female_educ = lm(formula = log(rw) ~ forborn + age + I(age^2) + female + educ, data = dt)
model_age_poly_female_educ_marstat = lm(formula = log(rw) ~ forborn + age + I(age^2) + female + educ + marstat, data = dt)
model_age_poly_female_educ_marstat_wbho = lm(formula = log(rw) ~ forborn + age + I(age^2) + female + educ + marstat + wbho, data = dt)


modelsummary(list(model, model_base),
             title = "Table 3: Models for rw and log(rw)",
             fmt = 1,
             statistic = c("conf.int",
                           "s.e. = {std.error}",
                           "t = {statistic}",
                           "p = {p.value}"),
             conf_level = 0.95,
             stars = c("***" = 0,
                       "**" = 0.01,
                       "*" = 0.05,
                       "." = 0.1),
             output = "kableExtra")

modelsummary(list(model_base_age, model_base_age_poly),
             title = "Table 4: Models for age",
             fmt = 1,
             statistic = c("conf.int",
                           "s.e. = {std.error}",
                           "t = {statistic}",
                           "p = {p.value}"),
             conf_level = 0.95,
             stars = c("***" = 0,
                       "**" = 0.01,
                       "*" = 0.05,
                       "." = 0.1),
             output = "kableExtra")

modelsummary(list(model_age_poly_female, model_age_poly_female_educ, model_age_poly_female_educ_marstat, model_age_poly_female_educ_marstat_wbho),
             title = "Table 5: Models for the final model",
             fmt = 1,
             statistic = c("conf.int",
                           "s.e. = {std.error}",
                           "t = {statistic}",
                           "p = {p.value}"),
             conf_level = 0.95,
             stars = c("***" = 0,
                       "**" = 0.01,
                       "*" = 0.05,
                       "." = 0.1),
             output = "kableExtra")
anova(model_final)
# anova(model_final_year)
par(mfrow=c(2, 2))
plot(model_final)
title(main = "Model plot for the task a")
coefplot(model_final_year)
title(main = "Coefficient plot for the task b")
par(mfrow=c(1, 1))

inference-report

Nutcha Kiraniphonphan

2023-08-06

Introduction

Data

Empirical Approach

Regression function for the wage gap (task a)

Regression function for analysis wage gap with time (task b)

Results

Hypothesis testing of the real mean

Two-sided t-test

The regression analysis of immigrants and non-immigrants

Regression function for the wage gap (task a)

intercept

forborn

age

female

educ

marstat

wbho

Regression function for analysis wage gap with time (task b)

prinusyr

Summary and conclusion

References

Appendices