WEIGHTED CROSS VALIDATION IN THE SELECTION OF ROBUST REGRESSION MODEL WITH CHANGE-POINT FOR TELEVISION RATING FORECAST

The paper proposes a weighted cross-validation (WCV) algorithm to select a linear regression model with change-point under a scale mixtures of normal (SMN) distribution that yields the best prediction results. SMN distributions are used to construct robust regression models to the influence of outliers on the parameter estimation process. Thus, we relaxed the usual assumption of normality of the regression models and considered that the random errors follow a SMN distribution, specifically the Student-t distribution. In addition, we consider the fact that the parameters of the regression model can change from a specific and unknown point, called change-point. In this context, the estimations of the model parameters, which include the change-point, are obtained via the EM-type algorithm (Expectation-Maximization). The WCV method is used in the selection of the model that presents greater robustness and that offers a smaller prediction error, considering that the weighting values come from step E of the EM-type algorithm. Finally, numerical examples considering simulated and real data (data from television audiences) are presented to illustrate the proposed methodology.


INTRODUCTION
Linear regression models are widely used to describe the average relationship between a response variable and one or more explanatory variables. Their applications are found in several scientific areas: economics, agriculture, biology, medical sciences, and others.
In regression procedures it is usually assumed that a model is valid for the entire data set. However, the data behavior may change from a specific value (time, indexed value or a specific value of the predictor variable, for example), generating a point of change in the model. In the context of linear models, problems of change-point with discontinuity, under normality of the data were widely treated (CHEN; GUPTA, 2001;GUPTA, 2011). In the case of a continuous change-point, the works of Muggeo (2003) and Hofrichter (2007) can be cited as examples.
The assumption of normality error in a regression model is usually adopted in the literature. However, this assumption becomes unrealistic when the data follow a distribution with heavier tails, and even more, the coefficient estimates are sensitive to extreme observations. Thus, the development of flexible methodologies using non normal probability distributions, such as the SMN distribution class (ANDREWS; MALLOWS, 1974), could be proposed as an alternative. The estimation of parameters in this type of model is performed iteratively using EM-type algorithms, such as the EMC (Expectation-Conditional Maximization) algorithm, detailed in the works by Dempster, Laird e Rubin (1977) and Meng e Rubin (1993). Applications that use some distributions that make up SMN class in regression models can be found in Lange, Little e Taylor (1989), Lange e Sinsheimer (1993), Yamaguchi (1990), Rosa, Padovani e Gianola (2003) and Osorio (2006). Results on regression models with change-point discontinuous using robust distributions can be found in Huaira-Contreras (2014) and Young (2014).
Aiming to determine an prediction model based on a data set, the strategy of dividing this set in two parts is usually adopted, with the first group to adjust the model and the other to validate the potential of generalization of the model. Cross-validation (CV) to assess the predictive ability in a regression model is presented by Picard e Cook (1984). Shao (1993) proposes a method that uses CV to select a model that has a better prediction. Ronchetti, Field e Blanchard (1997) and Markatou, Afendras e Agostinelli (2018) extend this idea and propose WCV methods for selecting robust models for extreme observations. Following this approach, this work proposes a WCV method for selecting a regression model with a changepoint continuous that considers distribution with heavier tails (SMN), as an alternative to normal distribution, in order to reduce the influence of outliers and allowing high level of predicitivity. The presence of the change-point influences the planning matrix, and leads to a modification of the algorithm proposed by Markatou, Afendras e Agostinelli (2018).
The paper is developed as follows: The regression model with a continuous changepoint is specified in Section 2, some important concepts about the SMN distributions and the description of the EM-type algorithm used to obtain the maximum likelihood estimators are also presented; in Section 3, the WCV method and the respective algorithm, in the context of the proposed model, are described. Section 4 presents some simulation studies to evaluate the proposed methodology. An application on television audience data in some regions of Brazil is presented in Section 5. Finally, some observations and conclusions are presented in Section 6.

SPECIFICATION OF THE REGRESSION MODEL
According to Muggeo (2003) and Young (2014), the linear regression model with a continuous and unknown change-point, considering a sequence of observations (x i , Y i ), i = 1, . . . , n, can be specified as: where In this way, the model defines a line to the left of the change-point given by Y i = β 0 + β 1 x i + ε i , while to the right of the change-point, we have Y i = (β 0 − β 2 γ) + (β 1 + β 2 )x i + ε i . That is, β 1 is the slope of the line to the left of the change-point, and (β 1 + β 2 ) is the slope of the line to the right. Thus, β 2 is the parameter that represents the "slope difference" of these lines. In addition, it is assumed that the explanatory variable is classified into ascending order, x i ≤ x i+1 , i = 1, . . . , n − 1. Now, the location of the change is no longer restricted to an observed x i . Instead, it can be any value within range [a; b], where are the following points a = min x i = x (1) and b = max x i = x (n) . In this work, we replace in equation (1) the usual assumption of normal distribution of random errors ε i , i = 1, ..., n, as and so, we have Thus, the model proposed by the equations (1)-(2) considers that the probability distribution of errors belongs to the SMN distribution class. Specifically, in this work, we Revista Mundi, Engenharia e Gestão, Paranaguá, PR, v. 5, n. 2, p. 233-01, 233-15, 2020 DOI: 10.21575/25254782rmetg2020vol5n21179 consider the Student-t distribution, a particular case of this class of distributions. Furthermore, assuming ν fixed in equation (3) (LUCAS, 1997), the parameter vector to be estimated is θ = (γ, β 0 , β 1 , β 2 , σ 2 ) T and the maximum likelihood estimators for θ are obtained via EMtype algorithm.

Scale Mixtures of Normal Distributions
The SMN symmetric distribution class (ANDREWS; MALLOWS, 1974) is constructed from the scale mix of a normal variable and a positive random variable, this is, where µ is a location parameter (µ ∈ IR), independent of Z, whose distribution is indexed by the ν parameter that controls the tails of the SMN distributions. Y conditioned to U has normal distribution, this is, . Therefore, the pdf of Y is given by where φ(.; µ, κ(u)σ 2 ) denotes the pdf of a normal distribution with mean µ and variance κ(u)σ 2 . The normal distribution is a particular case of this class of distribution, obtained when H is a degenerate cdf and κ(u) = 1. Other families make up this class of distributions, for example, the Student-t.

The EM Algorithm
Note that the model specified in equations (1)-(2) can be described hierarchically as In this estimation process, consider y = (y 1 , ..., y n ) T the vector of observed responses to n sample units and u = (u 1 , ..., u n ) T . So, under the hierarchical representation (7)- (8), it follows that the complete log-likelihood associated with y c = (y T ,u T ) is given by Let , σ 2 (t) ) T the estimate of θ in the t-th iteration, the conditional expectation of the complete log-likelihood function, where Assuming θ = (θ 1 , θ 2 T ) T , where θ 1 = γ and θ 2 = (β T , σ 2 ) T , the (t+1)-th iteration of the EM-type algorithm is described as following.

Description of the EM-type Algorithm
The EM-type algorithm is composed by well defined steps: Expectation (E) and Maximization (M), in this propose the M step is divided in two conditional steps CM1 and CM2.
STEP E: From the estimates θ (t) and a fixed value ν, the weights u i (t) are obtained through the following conditional expected: , to get Note that step E of the algorithm shows that the higher the value of d i , the lower the value of u i . Thus, the estimation procedure tends to give less weight to the atypical observations in the direction of the Mahalanobis distance. Thus, when using distributions with tails heavier than the normal distribution, the EM algorithm accommodates atypical observations, giving them less weight in the estimation process. When u i = 1, i = 1, ..., n, the results described above coincide with the maximum likelihood estimates of the linear regression model with changepoint under normal errors.
Steps E and M must be repeated alternately until convergence is achieved. The criterion of convergence used is || θ (t+1) − θ (t) || < , where ||*|| indicates the norm of the vector and > 0.

WEIGHTED CROSS-VALIDATION METHOD
In this work, the suggested WCV algorithm is based on the proposal by Markatou, Afendras e Agostinelli (2018) that presents a WCV algorithm to select the best set of covariates in a multiple regression model and considers a nonparametric method to determine the weights of the observations in the data set. We consider a regression model with change-point and a parametric method to define the weighting.
Two main characteristics important for the proposed algorithm are emphasized. First, good generalizability, that is, the performance of the algorithm on the training set accurately reflects the performance on the validation set an important property in models used for forecasting. Second, stability performance in the sense that the removal of any instance in the training set usually results in a small change in the parameter estimation. The theoretical framework and discussions on the concepts of generalization and stability are presented in Bousquet e Elisseeff (2002).
It is expected that in the WCV method proposed in this work, outliers have little influence, Revista Mundi, Engenharia e Gestão, Paranaguá, PR, v. 5, n. 2, p. 233-01, 233-15, 2020 DOI: 10.21575/25254782rmetg2020vol5n21179 because these observations will be weighted by the values u i , i = 1, ..., n, coming from step E of the EM-type algorithm, expressed in equation (11). In other words, in the training sample, are obtained parameter estimates robust to extreme observations and later, these estimates are used to calculate the weights u i , i = 1, ..., n in the validation sample. Thus, the criterion for choosing the best model via the mean of the prediction error also considers a limited influence from the outliers.
2. Select θ(ν) estimated such that the log-likelihood function is maximum, among the estimates made in the previous step.
4. Divide randomly the total data set into two parts for training (c t ) and validation (c v ) of sizes n t and n v , such that n t + n v = n.
5. Partition the weighting matrix W obtained in step 3, in W ct and W cv that match the test sets (c t ) and validation (c v ).

Repeat steps 6 and 7 for all ν ∈ A, such that
9. Repeat steps 4, 5 and 8 for all N random subsamples of size n v defined and calculate 10. Select the model with the smallest Υ T otal ν (c v ).
The following considerations are applied to the application of the proposed CV method.  (2018), the calculation of weights is performed for all ν ∈ A as part of the EM-type algorithm and then fixed when selecting an estimate(steps 1 and 2), these are associated with the Mahalanobis distance and consequently with these estimates.
• The structure of matrix X is dependent on the parameter associated with the changepoint, γ, thus, this structure is fixed for the final steps of the algorithm to optimize the other parameters by evaluating the forecast.
• The selection of models is made on the set A = {1 ≤ ν ≤ ν max , ν ∈ IN} where ν is the degree of freedom of the evaluated Student-t. Integer values between 1 and ν max =30 are evaluated. Remember that when ν → ∞, normal distribution is considered an alternative.
• The CV Monte Carlo method, proposed by Shao (1993), is considered. It uses N subsamples as an alternative to making the n nv combinations with size n v without replacement.

SIMULATION STUDIES
The objective is to verify if we can recover the values of the real parameters when we use the proposed EM-type algorithm, adjusting the Student-t linear regression model with change-point, without discontinuity, to the data that were artificially generated. In this case, we generated 500 samples of size n = 25, 50, 100 and 200 from Student-t linear regression model with continuous change-point, considering γ = 4, 6 and 8. The explanatory variable used was x i ∼ U (1, 10) and random errors i ∼ t(0, 1, 3), with the following values for the parameters: β 0 = 2, β 1 = −1 and β 2 = 3. Figure 1 shows the boxplots of the parameter estimates of the Student-t linear regression model with simulated change-point in the position γ = 8. For the other cases, the results are similar, so they will not be shown here to save space. It is observed that, in general, the bias and variability of the parameter estimates decrease when the sample size increases, as expected. That is, this agrees with the properties asymptotic effects of maximum likelihood estimators: asymptotic consistency and efficiency. In addition, the mean, median values and the corresponding estimates deviations standard obtained via the EM-type algorithm were calculated, the results are shown in Table 1

APPLICATION IN TELEVISION AUDIENCE DATA
A study to measure television audiences using two different measuring instruments was conducted by means of data collected during several days. In a specific period i two audience measurements are obtained (Aud1 i ,Aud2 i ), i = 1, ..., n, where Aud1 i is the audience measured by instrument 1 and Aud2 i is the audience measured by instrument 2. The number of periods evaluated (n) during the study was 2660 periods. With this data set, a regression model that represents the relationship between audience values Revista Mundi, Engenharia e Gestão, Paranaguá, PR, v. 5, n. 2, p. 233-01, 233-15, 2020 DOI: 10.21575/25254782rmetg2020vol5n21179 obtained by the two instruments should be proposed. The model will be used in the complete audience database and to predict new audiences via instrument 2 when necessary. The measures presented by the two instruments show relationships that change when an audience level is reached and the presence of measurements with extreme values is frequent in the data set. Thus, a regression model with a change-point that describes the relationship between the audience values obtained by instruments 1 and 2 is proposed as: It is considered ε i distributed as Student-t and normal to evaluate the weighting and nonweighting of outliers. The proposed model follows all the specifications described in Section 2 and the weighted cross-validation method, described in section 3, is used to obtain the parameters. Table 2 presents estimated parameter values in two situations: (i) via EM-type algorithm in the total data set and which maximizes the log-likelihood function, and (ii) via the cross-validation method that minimizes the prediction error, with N = 200 subsamples and n t = 370.
It is observed that in the two evaluated situations, the Student-t model is always better than  Source: Authors the normal model. In situation (i), the highest log-likelihood value occurs when ν = 26 and in the situation (ii) the prediction error is minimal when ν = 13. The values of the parameters are similar and the cross-validation method allows the calculation of confidence intervals for the parameters directly using Est and SD(Est). Figure 2 shows the model adjusted to the analyzed data. Note that the turning point is necessary for a better fit of the model. Additionally, the weights used in the robust validation method are presented, indicating that there is a large number of observations with small weights which characterizes extreme values, it is important to stand out that when considering the normal classic model the weights are not differentiated (these receive the value 1).

CONCLUSIONS
In this work, a regression model with change-point robust to atypical or outliers values was developed, based on SMN distributions as an alternative to the normal distribution and that presents better results in the adjustment. The inclusion of the weighted cross-validation method allows us to adjust a model that improves the prediction results as well as presenting confidence intervals for the parameters estimated in the right way. Future work will be oriented towards the use of robust asymmetric distributions and the inclusion of more than one change-point in the model. On the other hand, it is possible to consider a method of selecting the best model from a combination of a log-likelihood function (Akaike Information Criterion or other) that assesses the estimation in the model and the forecast error obtained from the validation sample.  Source: Authors