## Proposal and validation of methodologies for the categorisation of continuous variables in the development of prediction models

##### Abstract

Prediction models are currently relevant in a number of fieelds such as physics, meteorology, finance or medicine, among others. In the medical field, prediction models are gaining importance as a support for decision-making whereby increased knowledge of potential predictors helps the decision-making process. Clinical prediction models may provide the necessary input for shared decision-making by estimating an individual's risk of an unfavourable event or developing a certain disease over a speciFIc time period on the basis of his or her clinical and non-clinical proFIle. A vital factor in the development of prediction models is the selection of the predictors or covariates (clinical variables) to be used in the model. From a statistical perspective, categorising continuous variables is not advisable, since it may entail a loss of information and power. In addition, there are statistical modelling techniques such as the generalised additive models (GAM) which do not require any assumption of linearity between predictors and response variables, and so allow for the relationship between the predictor and the outcome to be modelled more appropriately. Yet in clinical research and, more specically, in the development of prediction models for use in clinical practice, both clinicians and health managers call for the categorisation of continuous parameters. Despite the fact that categorisation is a common practice in clinical research, there are no unied criteria for the selection of the cut points. Previous work has been done in the categorisation of continuous variables but with the aim in almost all cases of dichotomising the predictor variable. In this dissertation, we focus on the categorisation of continuous variables to be used in the development of prediction models, considering that the use of more than two categories may be preferable. This serves to reduce the loss of information and enables the relationship between the covariate and the response variable to be retained. Our goal is to propose a methodology to categorise continuous predictor variables xv xvi Summary in regression-based prediction models, mainly focussing on the logistic and Cox regression models which are those most widely used in the medical eld for modelling dichotomous and time-to-event outcomes respectively. The work presented in this dissertation was initially motivated by the development of a prediction model in the context of patients with chronic obstructive pulmonary disease (COPD). Clinicians agreed on the use of a categorised version of some clinical parameters such as the blood gas PCO2 or the respiratory rate in the prediction model. However, they did not agree on the location and number of cut points. We noticed that these were usually based on quartiles and when they were based on clinical criteria there was no agreement between them. Several proposals are available in the literature, but most aimed at the selection of a single cut point. Thus, we considered developing a methodology to categorise continuous predictor variables in prediction models. In a stage we considered categorising a continuous predictor variable X by considering its graphical relationship with a binary response variable Y based on a logistic GAM with P-spline smoothers. We proposed to categorise X in a minimum of three categories, considering the limits of the average-risk category as the location of the cut points. The location of the third cut point, if needed, was to be based on clinical criteria or a change in the slope of the graphical display. Nevertheless, this methodology had some restrictions: the location of this third cut point was subject to subjectivity, it did not allow us to categorise X in a multivariate setting and it was limited to a binary outcome. Thus in a second stage, we claimed for a proposal that provided with an optimal categorisation of a continuous predictor in a multivariate setting for different distributions of the response variable. We started by developing a methodology in which the location for any given k number of cut points for X could be optimally selected in a logistic regression, in addition or not to a set of other predictor variables, Z = (Z1; : : : ;Zp). The proposal consisted of the selection of a vector vk = (x1; : : : ; xk) of k cut points in such a way that the best logistic predictive model was obtained for the response variable Y . Specically, given k the number of cut points set for categorising X in k + 1 intervals, let us denote Xcatk the corresponding categorised variable taking values from 0 to k. Then, what we propose is that the vector of k cut points vk = (x1; : : : ; xk), which maximises the area under the receiver operative characteristic (ROC) curve (AUC) of the logistic regression Summary xvii model shown in equation (2) is thus the vector of the optimal k cut points. logitZ;Xcatk )) = 0 +Xpr=1Zr +