An investigation of imputation methods for discrete databases and multi-variate time series
Abstract
When it comes to information sets in real life, often pieces of the whole set
may not be available. This problem can find its origin in various reasons, describing
therefore different patterns. In the literature, this problem is known as Missing Data.
This issue can be fixed in various ways, from not taking into consideration incomplete
observations, to guessing what those values originally were, or just ignoring the fact
that some values are missing. The methods used to estimate missing data are called
Imputation Methods.
The work presented in this thesis has two main goals.
The first one is to determine whether any kind of interactions exists between Missing
Data, Imputation Methods and Supervised Classification algorithms, when they are
applied together. For this first problem we consider a scenario in which the databases
used are discrete, understanding discrete as that it is assumed that there is no relation
between observations. These datasets underwent processes involving different combina-
tions of the three components mentioned. The outcome showed that the missing data
pattern strongly influences the outcome produced by a classifier. Also, in some of the
cases, the complex imputation techniques investigated in the thesis were able to obtain
better results than simple ones.
The second goal of this work is to propose a new imputation strategy, but this time we
constrain the specifications of the previous problem to a special kind of datasets, the
multivariate Time Series. We designed new imputation techniques for this particular
domain, and combined them with some of the contrasted strategies tested in the pre-
vious chapter of this thesis. The time series also were subjected to processes involving
missing data and imputation to finally propose an overall better imputation method.
In the final chapter of this work, a real-world example is presented, describing a wa-
ter quality prediction problem. The databases that characterized this problem had
their own original latent values, which provides a real-world benchmark to test the
algorithms developed in this thesis.