Estimating Total Claim Size in the Auto Insurance Industry: a Comparison between Tweedie and Zero-Adjusted Inverse Gaussian Distribution

The objective of this article is to estimate insurance claims from an auto dataset using the Tweedie and zeroadjusted inverse Gaussian (ZAIG) methods. We identify factors that influence claim size and probability, and compare the results of these methods which both forecast outcomes accurately. Vehicle characteristics like territory, age, origin and type distinctly influence claim size and probability. This distinct impact is not always present in the Tweedie estimated model. Auto insurers should consider estimating total claim size using both the Tweedie and ZAIG methods. This allows for an estimation of confidence interval based on empirical quantiles using bootstrap simulation. Furthermore, the fitted models may be useful in developing a strategy to obtain premium pricing.


Introduction
There is a well known problem in the insurance industry that refers to the proper pricing of an insurance policy. An insurance company's pure premium for an insured individual is made up of two components: claim probability and expected claim size. The claim probability for any individual is related to the number of claims expected to occur in a given year. The claim size is simply the dollar cost associated with each claim. The difficulty of estimating the size and probability of claims in the insurance industry has been extensively reported in the literature (e.g. Jong & Heller, 2008). In the past, the main difficulty was related to the credibility of the insurance company datasets (Weisberg & Tomberlin, 1982). Insurance datasets were typically very large, containing from tens of thousands to millions of cases. Problems arose such as missing values and inconsistent or invalid records. As current information technology systems have become more sophisticated over the years, the processing of information has become more credible than ever before.
The challenge then is to employ a proper statistical technique to analyze insurance data. Claims and risks have long been estimated using a pure algorithmic technique or a simple stochastic technique (Wüthrich & Merz, 2008). These methods result in poor estimations. Huang, Zhao and Tang (2009) consider a risk model in which the claim number process is treated as a Poisson model and the individual claim size is assumed to be a fuzzy random variable. Jørgensen and De Souza (1994) suggested a Poisson sum of Gamma random variables called Tweedie to estimate insurance risk. According to Smyth and Jørgensen (2002), there is also another problem in that the proposed Tweedie model does not permit the separate estimation of probability and claim size.
Recent studies have perceived that a zero-adjusted Inverse Gaussian (ZAIG) distribution may be appropriate to estimate claim and risk in insurance data (Heller, Stasinopoulos & Rigby, 2006). A mixed discrete-continuous model, with a probability mass of zero and an Inverse Gaussian continuous component, appears to estimate accurately in extreme right skewness distributions. This suggests that probabilities can be calculated from datasets with a large number of zero claims. The ZAIG model explicitly specifies a logitlinear model for the occurrence of a claim (i.e. claim probability). When a claim has occurred, the ZAIG model also specifies log-linear models for the mean claim size and the dispersion of claim sizes. It is important to measure the probability and size of claims separately because it is possible for the probability to depend on a set of independent variables which is different from those that influence claim size. Therefore, ZAIG estimation appears to be more appropriate to estimate the price of insurance policies.
Once an estimation method has been defined, the challenge is to identify potential explanatory variables. Typically, policy holders are divided into discrete classes on the basis of certain measurable characteristics predictive of their propensity to generate losses. We evaluate claims by considering vehicle variables that are frequently used in the literature. In addition to territory, claims have also been studied in relation to the car manufacturer and vehicle's characteristics: age, type and origin. Based on previous research all of these variables must be used in the estimation.
Our objective is to present the ZAIG method of estimation to determine probability claims and the expected claim size in the insurance industry and to formally test the results with an estimation based on a Tweedie regression model using an insurance dataset. Insurance data was collected to analyze the impact of factors estimated by the Tweedie and ZAIG methods.
This work is divided into five sections. In the next section, we will discuss the theoretical background based on previous research in insurance claim estimates. We also present the Tweedie and ZAIG methods in the next section. The third section discusses the methodology and the dataset. The fourth section presents the analysis of the results and a comparison of the findings from the two methods. Finally, we present our concluding remarks and highlight the major contributions of our study.

2.1.Insurance: Importance of Predictions and Predictors
The probability and claim size forecast is very important, since an insurance company can use these estimates to offer or not offer premium discounts depending on an individual client's characteristics or create strategies for detecting fraudulent claims (Viaene, Ayuso, Guillen, Van Gheel & Dedene, 2007). An insurance company can also estimate total claim size using vehicles characteristics to get an idea of how much will be spent on the claim over a certain period and for a specific client portfolio. Insurance companies are constantly looking for ways to better predict claims. Overall, insurance involves the sum of a large number of individual risks of which very few will result in insurance claims being made. Meulbroek (2001) argues that insurance companies need to treat risk management as a series of related factors and events. Boland (2007) suggests that, in order to handle claims arising from incidents that have already occurred, insurers must employ predictive methods to deal with the extent of this liability. Therefore, an insurance company has to find ways to predict claims and appropriately charge a premium to cover this risk.
The prediction problem has to be considered in the light of competitive market insurance (Weisberg, Tomberlin & Chartterjee, 1984). It is possible for an insurer to benefit at least temporarily by identifying segments of the market that are currently being overcharged and offering coverage at lower rates or by avoiding segments that are being undercharged (Doherty, 1981). Regulators are usually concerned about the possibility of rate structures that severely penalize individuals with some characteristics (e.g. where they live, model of vehicle). Therefore, insurers look for better ways to capture the characteristics of individuals that affect claim size and probability, and consequently identify insured drivers that have a higher propensity for generating losses.
Insurance companies attempt to estimate reasonable prices for insurance policies based on the losses reported for certain kinds of policy holders. This estimate has to consider past data in order to grasp the trends that have occurred (Weisberg & Tomberlin, 1982). Information available to predict the price for a period in the future usually consists of the claim experience for a population or a large sample from the population over a period in the past. Accurate estimation may consider a large number of exposures in a dataset and a stable claim generation process over time.
The predictors for estimating the appropriate price for insurance policies were selected from the automobile industry. In our study, we consider the issue of price prediction in the context of the automobile industry because the most sophisticated proposals have been developed in this industry (Jong & Heller, 2008). Previous research in the automobile setting has used predictors such as territory (e.g. Chang & Fairley, 1979) and car manufacturer (e.g. Heller et al., 2006). Weisberg et al. (1984) suggest including variables associated with the status of the vehicle such as age, type and origin.
Previous studies have recognized the utility of the Tweedie method in estimating auto insurance claims (Smyth & Jørgensen, 2002) and recent studies have shown that the ZAIG method may produce accurate models of estimation (Heller et al., 2006). In order to estimate, it is necessary to let y i be the size expended on claims for client i and to let x i be a vector of independent variables related to this client. One may represent the variable y i as where W i is a positive right skewed distribution. This type of variable belongs to the class of the zero inflated probability distributions (e.g. Gan, 2000). The parameter  i is the claim probability and W i represents the claim size related to client i.
It is important to note that a claim is, in general, a rare event. A small proportion of claims in a sample may lead to problems in predicting claim occurrence by a logistic model because, in this case, the predicted probabilities tend to be small. King and Zeng (2001) proposed a correction to be used in these situations. They used the fact that, in the presence of rare events, the independent variable coefficients are consistent, but the intercept may not be.

Tweedie Regression Models
A Tweedie distribution (Jørgensen, 1987(Jørgensen, , 1997) is a member of the class of exponential dispersion models. It is defined as a distribution of the exponential family (e.g. Jong & Heller, 2008) with mean  and variance   p ; in this work, as in Jørgensen and De Souza (1994) and Smyth and Jørgensen (2002), we consider the case 1<p<2. It is possible to write where N i is a Poisson random variable that represents the number of claims that have occurred for the client i and are independent identically distributed Gamma random variables (continuous variables). As a consequence i W also follows a Gamma distribution, which has a positive and right skewed density probability function. In this work, we use a loglinear Tweedie regression model, given by , where x i is a matrix of independent variables and  is the parametric vector.

ZAIG Regression Model
The variable y i follows a ZAIG distribution (Heller et al., 2006) if W i is a Gaussian inverse random variable. The Gaussian inverse is a positive and highly skewed distribution with two parameters: the mean ( i ) and a dispersion parameter ( i ). It may be proved that In the context of this work,  i is the expected claim size and  i is a parameter related to claim size dispersion. It is possible to propose regression models for  i ,  i and  i as where h 1 , h 2 and h 2 are continuous twice differentiable invertible functions, ,  and  are parametric vectors, and x i, z i and w i are vectors of independent variables for client i.
In an insurance context, it is highly convenient to use different sets of independent variables to model these three parameters. Consider, for instance, a variable that indicates the location of a car owner's residence. It is well known that robbery rates vary within a city, but the price of a vehicle does not. Since it is expected that the location will be important in explaining  i , but not  i , then one may include the variable in the probability model but not in the expected claim size model. This example illustrates the Heller et al. (2006)

statement "…
A problem with the Tweedie distribution model is that probabilities at zero cannot be modeled explicitly as a function of explanatory variables…".
The following models are adjusted: In short this modeling option assures that  i and  i are, as expected, always positive and that  i is modeled as a logistic regression. It is important to remember the bad performance of logistic models in predicting claims, when the frequency of claims in the sample is small.

Dataset and sample summary statistics
A sample was collected from a major automobile insurance company resulting in a dataset of 32,783 passenger vehicle records belonging to a corporate fleet. As all corporation employees could drive the vehicle it doesn't make sense to use individual driver characteristics as explanatory variables to explain probability and claim size.
The dataset was processed to remove missing values and generate a selection of relevant variables. We have focused the analysis on yearly claims that refer to robberies or accidents with claim sizes which were greater than the vehicle's value. Claim size refers to the dollar cost paid as a liability of a claim. Claim probability refers to the percentage of claims over the period of a year. The average annual claim probability is 1.17%, and the average claim size is $243.98. When the event occurs, the average claim size goes up to $21,048.03. For every insurance policy holder, twenty explanatory variables were employed. The variables correspond to vehicles characteristics and are coded by means of binary variables, described in Table 1. Table 2 shows the variables' descriptive statistics. Based on Table 2 one can perceive that claims occur more often with older cars, but the cost of the claim diminishes as the car's age increases. Domestic and imported vehicles have approximately the same percentage of claims and claim sizes. There are differences in the frequency and cost of claims depending on the model and the manufacturer (Model/Manuf). Region I has the largest percentage of claims as well as the highest cost for these claims. In terms of vehicle size, most of the claims are for small and midsize cars, while the costs increase in percentage to the size of the vehicle.

Inferential analysis
The ZAIG model was estimated by the package GAMLSS (Stasinopoulos & Rigby, 2007;Stasinopoulos, Rigby & Akantziliotou, 2008) for the R system (R Development Core Team, 2007). The Tweedie model was estimated using the SPSS package (version 16.0). In this section we divide the sample into two parts, a subsample of 22,783 to fit the models and a subsample of 10,000 to forecast the total claim size. Table 3 presents the results of the estimates. The dependent variable is the claim size and refers to robberies or accidents with repair sizes greater than the vehicle's value. Several explanatory variables were significantly related to dependent variables. Considering all vehicle age variables, we can say that there is a significant increase in the expected claim probability as the vehicle becomes older. On the other hand, the expected claim size decreases for older vehicles. This is in line with intuition and descriptive analysis, because old vehicles are less expensive to replace and there is also the fact that old vehicles are more attractive targets. One might suggest that old vehicles are more attractive targets because there is a great auto part replacement market that gets flooded with stolen parts for these old cars, and old cars tend to be poorly maintained, which increases the probability of accidents.
The variable model/manufacturer is related to claim probability and size. In general, the model/manufacturer is more related to claim probability than claim size. It is noteworthy that there is no way to clearly identify whether claim size or probability is causing the significance of the Tweedie coefficients.
The variable for vehicle origin does not influence the claim probability or size. This suggests, ceteris paribus, that domestic and imported vehicles tend to have the same claim size. Territory is generally related to claim probability and size; in this case there are some regions that have more carjackings than others.
Vehicle type is related to claim size and probability. The claim probability decreases for luxury vehicles. However luxury cars lead to higher claim sizes compared to small and midsize cars. Looking at the Tweedie results, the difficulty in accurately predicting claims becomes obvious given the non-significance of the Tweedie coefficient for vehicle type. One might suggest that the non-significant coefficient is due to a negative claim probability effect and a positive claim size effect, as found in the ZAIG coefficients.
The total claim size forecast was done by adding together the individual forecast claim sizes based on the Tweedie and ZAIG models. Using parametric bootstrap simulation, we obtained a 95% confidence interval, based on empirical quantiles of 5,000 bootstrap estimates. For further details, see Efron and Tibishirani (1986). Table 4 shows the estimated and true total claim size and the 95% confidence interval. The ZAIG model was better than the Tweedie model in forecasting the total claim size and both models presented negative bias. We notice that the forecasts are inside the confidence bands for both models, indicating good estimation results. Using inferior and superior limits the insurance company can have an idea of the total claim size dispersion.
We also calculated the mean squared error (MSE) and the mean absolute error (MAE) for the residuals. The results are very similar for the Tweedie and ZAIG models.

Concluding Remarks
In this work we have tackled a well-known problem in the insurance industry which is the proper pricing of an insurance policy. Employing the ZAIG estimation method for claims and risks in the insurance industry, we found distinct factors that influence claim size and probability. Factors such as territory, a vehicle's advanced age, origin and type distinctly influence claim size and probability. The distinct impact is not always present in the Tweedie estimated model. The ZAIG estimation method also allows insurance companies to create a score system to predict claims, based on the logistic model. This score system identifies policy holders who tend to be more risky. These estimated models thus may be employed to develop a strategy for premium pricing. In addition, insurance companies can use vehicle characteristics to estimate total claim size and thus get an idea of how much they will have to spend on a claim over a certain period of time and for a specific client portfolio.
Some limitations to this study should be recognized. First, the methods require a high computational effort which may preclude the use of larger datasets. Second, there is room for developing suitable methods for longitudinal data analysis. Future work may consider the use of estimating equation techniques or multivariate ZAIG distributions. We concentrated our research on the auto insurance industry and specific vehicle variables. Further studies may address other insurance industries and include customer related variables.