A bivariate approach to the Mincerian earnings equation

Resumo This paper estimates bivariate regressions for wages and hours worked as an alternative to the univariate Mincerian earnings equation. The bivariate vector of dependent variables included both common and specific covariates. Using individual level data from the Brazilian National Household Sample Survey (PNAD), the Student t distribution produced the best fit to the data according to information criteria and Mahalanobis distance. The bivariate estimation accounts for correlation between the dependent variables, identifies antagonistic effects from common covariates and allows assuming different bivariate distributions. Education, type of employment contract and geographical region affect wages and hours worked in opposite directions.


Introduction
The Mincerian earnings equation introduced by Mincer (1974) is the baseline for a broad empirical literature on labor economics, including contributions by Senna (1976), Card (1999), Resende and Wyllie (2006) and Aali-Bujari et al. (2019).These studies generally seek to estimate the returns to education and experience on the wage rate earned by the worker. 1 Mincer proposed that the distribution of wages among different occupations is positively correlated to the amount of investment in human capital, which positively affects productivity and economic growth. 2he Mincerian earnings equation was originally represented by a linear regression in which the wage rate was explained by education and experience.Following this approach by Mincer (1974), other explanatory variables were included in the regression, such as individual characteristics of gender and race that are used to assess the presence of discrimination in the labor market.
When deciding to join the labor market, a worker chooses the quantity of hours that he will supply to the market.Sedlacek and Santos (1991) used data from the Brazilian National Household Sample Survey (PNAD) to analyze the relationship between the husband's income and the labor supply by the spouse.They found that the higher the husband's income, the higher the reservation wage and less likely the wife will work.Moreover, the younger and more children the family has, the less likely they are to join the labor market or, when they do so, they will supply fewer hours of work.
As far as estimation methods are concerned, since Mincer (1974), the literature has used the traditional ordinary least squares (OLS) method and its variants with instrumental variables, quantile regression, sample selection, and procedures based on maximum likelihood estimation [Chatterjee and Price (1991), Heckman (1979), Buchinsky (2001)].In Brazil, the greater availability of microdata and the improvement of the computational capacity contributed to the expansion of the empirical evidence, as highlighted by Maciel et al. (2001), Giuberti and Menezes-Filho (2005) and Madalozzo (2010).
A common feature in the literature is the use of earnings per hour as the dependent variable in the Mincerian equation.This variable, in general, is obtained by simple division of wage earned by hours worked in the period.Such an approach, however, implies the agglutination, in a single variable, of two distinct components, represented by earnings and hours worked, which should be modeled separately.The determinants of earnings and hours worked are not necessarily the same and those that enter in both regressions might differ in either quantitative (magnitudes) or qualitative (signals) terms.
This feature is not captured by traditional estimates of the Mincerian equation that uses wage rate as the sole dependent variable.The stock of human capital, measured by formal education and experience, for instance, tends to increase the workers' remuneration, but it might also reduce the willingness to supply working hours in the labor market.Those who are more qualified might receive higher remuneration by working less hours than those who are less qualified.These antagonistic effects of education on wages and hours worked are not captured by the univariate estimates of the Mincerian earnings equation.
Therefore, there is a gap in the literature that this study seeks to fill.The common practice of using the earnings per hour dependent variable might hide effects of covariates that would be distinct if separately assessed by regressions on wage and hours worked.In contrast to the classical approach, this paper aims to estimate a bivariate regression for the Mincerian equation considering earnings and hours worked as a bivariate vector of dependent variables.The regressions include both common and specific covariates for the bivariate vector of earnings and hours worked.The bivariate Normal, Student t, and Birnbaum-Saunders (BS) 3 distributions are used in the estimation.For the sake of comparison, the univariate Mincerian earnings equation will also be estimated, considering a single dependent variable represented by earnings per hour worked.Estimates will be made for the Brazilian economy using data extracted from the Brazilian National Household Sample Survey (PNAD) for the period from 2013 to 2015.
Advantages of the bivariate regression approach include the possibility of modeling a correlation structure among the dependent variables.If there is correlation, the estimation of univariate regressions separately for earnings and hours worked might provide biased results [Marchant et al. (2016)].The bivariate framework allows to identify antagonistic effects of common covariates on the two different dependent variables.
Finally, there is flexibility to assume different bivariate distributions for the earnings and hours-worked model.As in Heckman (1976), the parameters will be estimated by maximum likelihood, which is efficient according to Mittelhammer et al. (2000).Thus, the bivariate model emerges as an important alternative to the univariate equation that is traditionally estimated for the Mincerian earnings equation.
The results indicate that some common explanatory variables have different signals and magnitudes of the estimated coefficients in the bivariate regression of earnings and hours-worked.Specifically, the estimated coefficients for education, type of employment contract, and geographical region have distinct signals and different magnitudes for wage and hours worked regressions.Considering education, for instance, more years in school imply in higher average wage and lower supply of hours to the labor market.
In the univariate regression, however, only the positive effect of an additional year of study on the wage rate is observed.Furthermore, the bivariate model captures the correlation between the two dependent variables, which increases robustness in relation to the estimation of separate univariate regressions.Thus, there are important advantages associated to the bivariate approach when compared to the univariate regression, suggesting that the former is more suitable for the estimation of the Mincerian earnings equation.
The paper is organized as follows.Section 2 describes the empirical model, presents the database, reports, and discusses the results.Finally, the third section is dedicated to the concluding remarks.

Econometric approach 2.1 Empirical model
The Mincerian earnings equation is typically described by the following univariate regression: where log(w w w) is a vector with the logarithm of the wage per hour (dependent variable), γ γ γ is a vector of coefficients, X X X is a matrix of explanatory variables, such as education, experience, race, gender and others, and ε ε ε is a random error vector, usually assumed to follow a normal distribution. 4he differential of the present paper is to model the earnings equation ( 1) as a bivariate regression of wages and hours worked separately in order to capture different effects of the common explanatory variables on wages and labor supply.Furthermore, as earnings and hours worked are correlated, the bivariate regression is more appropriate than the univariate estimation of separate regressions.
In the bivariate environment, the model can be estimated as a vector of dependent , where Y 1i is the wage in the main job and Y 2i represents the hours dedicated to the main job by each individual i.This vector might be modeled by a set x x x of explanatory variables using one of the bivariate distributions described in the Appendix, such that: i) Bivariate Normal distribution: ii) Bivariate t distribution: iii) Bivariate BS distribution: Notice that in the cases of the Normal and t, we assume that the dependent variables have bivariate log-normal and log-t distributions, which implies that the logarithm of the variables follow the Normal and t bivariate distributions, respectively [Vanegas and Paula (2016)].For the bivariate BS distribution, it is not necessary to apply the logarithm due to the parameterization as a function of averages of this distribution [Saulo et al. (2021;2020)].Based on the literature, we defined the set of covariates used in the estimations and separated covariates that affect both earnings and hours worked simultaneously from those that affect only one of them separately.
The common covariates, which affect both earnings and hours worked, are: • Gender: dummy variable that assumes value 1 for men and 0 for women; • Race: dummy variable that assumes value 1 for Caucasians and 0 for non-Caucasians; • Marital status: dummy variable that assumes value 1 for married and 0 for unmarried individuals; • Age and age 2 : age, measured in years, and age square are used as proxy for the labor market experience of the individual, following the literature; • Years of schooling: is a proxy for education, ranging from 0 to 16 years of study in the sample; • Category (high, high mean, mean, low mean, low): binary variables used to designate occupancy category, segmented according to socioeconomic criteria and having the low category as a reference5 ; • Type of employment contract (with employment record card, without employment record card, autonomous, civil servant): dummy variables that seek to capture the type of occupation of the individual in the labor market, having "with employment record card" as the base category; • The covariates that affect only earnings are: • Labor union: dummy variable that assumes value of 1 for individuals affiliated to any labor union and 0 for those who were not affiliated; • Social Security: dummy variable that assumes value of 1 for individuals who were taxpayers for social security in the reference period and 0 for those who were not taxpayers; • Time in job: number of years employed in the current main job, ranging from 0 to 56 years in the sample.
The covariates that affect only hours worked are: • Head: dummy variable that assumes value 1 if the reference individual in the household is head of the family and 0 otherwise (non-head); • Minor: dummy variable used to capture if there are children under 10 years old in the household; • Inactivity: dummy variable that assumes value 1 if there are unemployed individuals in the household and 0 if there are no unemployed individuals in the household.
The database was collected from the Brazilian National Household Sample Sur- Table 1 provides some descriptive statistics for earnings and hours worked at level and logarithmic scales, including sample size, average (avg), median, standard deviation (SD), coefficient of variation (CV), asymmetry (CA), and kurtosis (CK).These statistics indicate that earnings in level has a high asymmetry and a significant kurtosis, suggesting that an asymmetric distribution with heavy tails is better to fit the data.On the other hand, hours worked in level show low asymmetry and moderate kurtosis.The application of the logarithm tends to produce symmetry, especially in the case of earnings.Figure 1 shows histograms of earnings and hours worked at level and logarithmic scales.

Investigation of the best fit
Initially, we estimate the Normal, t, and BS univariate regressions for earnings and hours worked, as well as their bivariate counterparts, to investigate the best fit to the data in each case.Table 2 reports the values of the Akaike (AIC) and Bayesian (BIC) information   criteria, calculated as: where ℓ is the value of the log-likelihood function, k denotes the number of parameters, and n indicates the number of observations.According to Table 2, the univariate and bivariate models based on the t distribution yielded the best adjustments, as they re-sulted the lowest values for both AIC and BIC.Thus, among the 3 distributions tested, the univariate and bivariate models of the t distribution shall be used according to the information criteria.Notice that the t distribution has heavier tails than the normal distribution, implying robustness against outlying observations [see Lucas (1997)].Once the best univariate and bivariate models were chosen, we applied the Mahalanobis distance to evaluate the quality of the fit to the data, as proposed by Marchant et al. (2016).In the case of the bivariate t distribution, this distance is given by: where U U U ∼ tBiv(µ 1 ,µ 2 ,σ 1 ,σ 2 ,ν,ρ) according to equation (15) in the Appendix and ψ ψ ψ is the covariance matrix.According to (5), the Mahalanobis distance for the bivariate t distribution follows a F 2,ν distribution.That is, F distribution with 2 and ν degrees of freedom.
In the univariate case, we have a F 1,ν .In order to obtain the estimated values of the Mahalanobis distance, the parameters are replaced by their maximum likelihood estimates, which asymptotically results in the same distribution as ( 5  Figure 3 shows the PP plot of the transformed Mahalanobis distance for the bivariate t regressions of earnings and hours worked.The results also suggest an excellent fit to the data for the bivariate case.Thus, for both univariate and bivariate cases, the t regression models provided excellent adjustments to the data and might therefore be used.

Estimations and analysis
, which is approximately distributed as a standard Normal under H 0 , in which θ and θ 0 are the estimate and its proposed value under H 0 , respectively.In this case, our interest lies in knowing if θ 0 = 0 or H 0 : θ = 0 versus H 1 : θ ̸ = 0, at a significance level of α = 0.05 (or 5%).
Regarding to the interpretation of the estimated coefficients, the following cases deserves special attention: • When the independent variable x is quantitative (for instance, number of years in school) and the value of the coefficient estimated is: (i) out of range: −0.05 ≤ β ≤ 0.05, there is an increase (or decrease if the estimate is negative) of (exp( β ) − 1) × 100% in the expected value (mean) of the dependent variable due to an increase of 1 unit in x; (ii) within the range −0.05 ≤ β ≤ 0.05, there is an increase (or decrease if the estimate is negative) of (exp( β ) − 1) ≈ β × 100% in the expected value (mean) of the dependent variable when x increases by one unit.
• When the independent variable x is a dummy (for instance, gender) and the coefficient value is: (i) out of range −0.05 ≤ β ≤ 0.05, there is an increase (or decrease if the estimate is negative) of (exp( β ) − 1) × 100% in the expected value (mean) of the dependent variable when x changes from 0 (women) to 1 (men); (ii) within the range −0.05 ≤ β ≤ 0.05, we have an increase (or decrease if the estimate is negative) of (exp( β ) − 1) ≈ β × 100% in the expected value (mean) of the dependent variable when x changes from 0 (women) to 1 (men).
Table 3 indicates that the estimated correlation of 0.1877 between earnings and hours worked is statistically significant at the 5% significance level.This means that the bivariate model is more appropriate than the univariate estimation of independent regressions, which might lead to biased results due to the untreated correlation between the two dependent variables.
Considering the estimated coefficients, the variable "Gender" indicates that men have an average income that is 34.58% higher than women and they supply 8.62% more hours worked, on average, than women.On the other hand, the variable "Race" reveals that the wages of Caucasian individuals are, on average, 10.31% higher than the wages of non-Caucasians.However, when it comes to hours worked, Caucasians only supply 0.02% more hours than non-Caucasians.This results confirm that there is discrimination in the Brazilian labor market.Cavalieri and Fernandes (1998), for instance, found wage discrimination using data from the PNAD of 1989.They also found higher wages for men than for women and for Caucasian individuals than for non-Caucasians, even after controlling for age, years in school, and geographical region of residence.
The "Age" variable indicates that one additional year of experience in the labor market increases wage by 5.27%, while hours worked raise only by 1.28%.Considering the variable "Education", an increase of one year of study raises in 4.68% the average wage.
However, this same increase in schooling leads to an average decrease of 0.09% in hours worked.Thus, the higher the individual's schooling, the higher his average wage and the lower his supply of hours in the labor market.This finding illustrates a fundamental advantage of the bivariate regression, since the effects of "Education" go in opposite directions in the bivariate regressions and this cannot be captured by the traditional univariate estimation that considers wage per hour as the unique dependent variable.Lau et al. (1993) also found a positive effect of "Education" on earnings (per capita) due to the higher level of schooling.Gonzaga et al. (2002) argued that, in Brazil, level of schooling is inversely related to hours worked.Taking into account the metropolitan regions, Brasília-DF presents an average wage 8.95% higher and workers supply 2.70% less hours of work than São Paulo-SP.Again, it is also possible to identify distinct effects of an explanatory variable in the bivariate regression that cannot be captured by the traditional univariate model.In order to explain this finding, the unobservable characteristics of the workers, such as skill and motivation, as well as specific differences among the sectors of activity and the geographical regions of the country should be taken into account.In the specific case of Brasília, the differential is due to the location of the federal public administration in Brasília, which pays higher average wages than the private sector.
Regarding the types of labor contracts, the estimates point out that those with "no employment record card" have an average wage that is 17.73% lower than the wages of individuals "with employment record card".In addition, they offer about 15.08% less hours worked than their peers "with employment record card".The "civil servant" category incorporates, on average, an increase of 26.90% in wage while supplying an average of 4.10% less hours worked in relation to the workers "with employment record card".Meanwhile, those who are "autonomous" have an average wage 12.82% lower and supply 16.04% less hours to the labor market than the workers "with employment record card".It is worth mentioning that the the "civil servant" category also presents antagonistic effects on wages and hours worked that might be captured only by bivariate estimation.
For variables that affect only wages or hours worked separately, Table 3 illustrates that individuals who contribute to social security have an average wage 42.70% higher than those who do not contribute.The "head" variable, which affects only hours worked, confirms that the head of the household supplies 1.80% more hours to the labor market on average than those who are not in this condition. 6or comparison purposes, Table 4 presents the results of the traditional univariate regression in which the dependent variable is the wage rate (or wage per hour).In principle, some results show similarity in terms of magnitude with the estimates of the bivariate regression model.However, the univariate model cannot disentangle the effects of a given explanatory variable on wages and hours worked, as were the cases of "Education", "Brasília-DF", and "Civil Servant" discussed above.These variables displayed different signals in the estimated coefficients for the wage and hours worked regressions.For "Education", for instance, the higher the individual's level of schooling, the higher is his average wage and the lower is his supply of hours of work.However, in the univariate regression reported in Table 4, only the positive effect of an additional year of study on the average wage rate can be estimated.In addition, the bivariate model captured a positive and statically significant correlation between wage and hours worked, allowing for a more robust estimation than the simple adjustment of two independent regressions.Therefore, there are important advantages coming from the bivariate model, including the evidence that the determinants of wages and hours worked might not be the same in both quantitative and qualitative terms.In this environment, the bivariate estimation emerges as an important alternative for the estimation of the Mincerian earnings equation.

Conclusion
This paper proposed an alternative approach to estimate the Mincerian earnings equation based on bivariate regression modeling.The combination of wages and hours worked in a single dependent variable, as traditionally is done in the empirical literature, prevents capturing distinct effects of common covariates on those dependent variables separately.On the other hand, the univariate estimation of independent regressions for earnings and hours worked is not indicated due to the correlation between these variables, which might bias the estimates.We proposed the estimation of a regression for wages and hours worked as a bivariate vector of dependent variables, including common and specific covariates among the explanatory variables and using the Normal, Student t and BS bivariate distributions.The estimates used data at the individual level extracted from the Brazilian National Household Sample Survey (PNAD) for the years from 2013 to 2015.
In the bivariate case, the Normal, t and BS distributions were used to jointly model wages and hours worked.The AIC and BIC information criteria and the Mahalanobis distance indicated that the Student t distribution yielded the best fit to the data.In addition, a positive and statistically significant correlation between wages and hours worked justified the use of the bivariate regression in detriment of two separate regressions for those variables, which would yield in biased estimates.
The bivariate estimation indicated that a given common covariate might have distinct effects on wages and hours worked.The results for "Education", for instance, indicated that an additional year of study leads to an average increase of 4.68% in wages and an average decrease of 0.9% in hours worked.This suggests that individuals with more years of schooling, on average, have higher wages and work less hours than those with less years of schooling.Other covariates common to the bivariate vector, such as type of employment contract and geographic region of residence, also had antagonistic effects on earnings and hours worked.This evidence illustrates a fundamental advantage of the bivariate regression, which allows to disentangle the distinct effects of a given common covariate on wages and hours worked.This cannot be done by the traditional estimation of the univariate regression that considers the wage per hour as the dependent variable.
Thus, the bivariate regression might be considered as an alternative approach for the estimation of the Mincerian earnings equation.As further work, one might implement the Heckman two-step correction for selection bias (Heckman, 1979), since the PNAD survey refers to individuals who were actually working in the sample period.However, the individual's earnings are associated with the decision to supply work, which ultimately depends on their opportunity cost.It is advantageous to work if the wage (or potential earnings) is greater than the opportunity cost (reservation wage).In addition, other bivariate probability distributions might be adjusted to model wages and hours worked, such as Pareto and its extensions, which are commonly used in income modeling.Finally, a bivariate logistic regression model might be used to estimate the influence of individual characteristics on the probability of a given worker to belong to a particular income group and type of work.Some of these extensions are object of our ongoing research.
It is also worth mentioning that, in further research, the study might benefit by moving towards a structural approach, with a careful modeling strategy of the labor market and the resulting wage equation.Here, our focus was just on the application of an alternative bivariate approach to estimate the traditional Mincerian wage equation by using Brazilian micro data.In addition, due to well-known distortions in the Brazilian labor market, further extensions should consider an empirical analysis by sector of activity and type of occupation.We leave these issues for future research.When ρ = 0, i.e., when the Normal variables are uncorrelated, (6) can be expressed as the product of 2 Normal CDFs.

Normal bivariate regression model
Consider that there are r and s covariates, let´s say x x x ir ) ⊤ and x x x where β β β k = (β k1 ,β k2 , . . .,β kl ) is a vector of l unknown parameters, and x x x (k) i is the i-th line of matrix X X X (k) , whose dimension is n × l, for k = 1,2 and l = r,s.Thus, we have the following Normal bivariate model: where (ε 1i ,ε 2i ) ∼ NBiv(0,0,σ 1 ,σ 2 ,ρ), and they are independently distributed.
The likelihood and log-likelihood functions of the observed sample can be written respectively as where f is the joint PDF of the bivariate normal distribution.The model parameter estimates must be obtained by maximizing the log-likelihood function (14).This is done by solving a nonlinear iterative optimization process, particularly the quasi-Newton Broyden-Fletcher-Goldfarb-Shanno (BFGS) method can be used.The BFGS method is implemented in R software (http://cran.r-project.org),using the optim and optimx functions.
Here, differently from the BS regression model based on the classical parameterization Rieck and Nedelman (1991), there is no need for logarithmic transformation, that is, the data for the response are worked on in their original scale.
In order to estimate the parameters, as in the normal bivariate case, the maximum likelihood method is used.Consider a random sample of size n, {(t 1i ,t 2i ,x x x (1) i ); i = 1, . . .,n} say, therefore the likelihood and log-likelihood functions of the observed sample are given respectively by L = n ∏ i=1 f (t 1i ,t 2i ; µ 1i ,µ 2i ,δ 1 ,δ 2 ,ρ), where f is a joint PDF of the bivariate BS distribution.The parameter estimates β β β 1 , β β β 2 , δ 1 , δ 2 and ρ are obtained by maximizing the log-likelihood function (28) using an iterative non-linear optimization process, in this case, the BFGS quasi-Newton method.
Metropolitan region (Belém-PA, Fortaleza-CE, Recife-PE, Salvador-BA, Belo Horizonte-MG, Rio de Janeiro-RJ, Curitiba-PR, Porto Alegre-RS, Brasília-DF and São Paulo-SP): dummy variables that designate the metropolitan regions of residence of the individuals, taking São Paulo as the reference category; • Year (2013, 2014, and 2015): time dummies for the years of the sample, having 2013 as the reference year; • Sector of activity (agriculture, industry, construction, commerce, food and others, education, health, and social services): dummy variables used to capture cluster effects by sector of activity of the individuals, having individuals working in the public sector as reference.
vey (PNAD) in the period from 2013 to 2015.This survey is annual, produced and published by the Brazilian Institute of Geography and Statistics (IBGE).It provides a wide set of demographic and socioeconomic information about the Brazilian population at the individual and household levels.We considered a sample of individuals aged between 18 and 65 years with complete information on earnings and hours worked, totalizing 167,271 observations.The sample refers to the 10 major metropolitan regions of the country, namely Belém-PA, Fortaleza-CE, Recife-PE, Salvador-BA, Belo Horizonte-MG, Rio de Janeiro-RJ, Curitiba-PR, Porto Alegre-RS, Brasília-DF, and São Paulo-SP.The nominal values of earnings were deflated by the National Consumer Price Index (INPC).There is no control for groups of individuals in each year, characterizing the data set a pooled cross-section.All results were obtained in the R statistical software [https://www.r-project.org/].

Figure 1 .
Figure 1.Histogram for earnings and hours worked (level and logarithmic scales).
)[Vilca et al. (2014)].The Wilson-Hilferty approximation might then be applied to the Mahalanobis distance to obtain a standard Normal distribution approximation in(5).Thus, the quality of the fit of the univariate and bivariate t regression models might be evaluated by the transformed distances with the Wilson-Hilferty approximation [Ibacache-Pulgar et al. (2014)].In this case, the distances in (5) are adapted to accommodate the regressive structure and the univariate or bivariate condition.Figure 2 displays the probability-probability (PP) plots of the transformed Mahalanobis distance for the univariate t regressions of earnings and hours worked.The PP plot is commonly used to assess how close 2 sets of data are by plotting the 2 corresponding cumulative distribution functions.The closer the points are from the 45 o line in the (0.0) to (1.1) area, the best is the fit.Figure 2, shows the cumulative distribution function of the standard Normal versus the empirical cumulative distribution function of the transformed Mahalanobis distance.The results reveal an excellent fit of the univariate models.

Figure 2 .
Figure 2. PP plots of the transformed distances for the univariate t regression models of earnings (left) and hours worked (right).Legend: EC = empirical probability, TC = theoretical probability.

Figure 3 .
Figure 3. PP plots of the transformed distance for the bivariate t regression model of earnings and hours worked.Legend: EC = empirical probability, TC = theoretical probability.

Table 1 .
Descriptive statistics for wage and hours worked in level and logarithmic scales

Table 2 .
Information criteria for the univariate and bivariate models

Table 3
The model based on the t distribution presented the best fit to the data according to the AIC and BIC information criteria and the PP plot of the Mahalanobis distance reported in the previous section.The Wald statistic is used to test the following hypotheses: H 0 : θ = θ 0 versus H 1 : θ ̸ = θ 0 .The Wald statistic is defined by: reports the results of the maximum likelihood estimation for the bivariate t distribution regression model of earnings and hours worked, with the respective standard errors, Wald statistics, and p-values.

Table 3 .
Bivariate t distribution regression models for wages and hours worked (ν = 4)

Table 4 .
Univariate regression for wage per hour