Building a predictive model for Kaggle's “Home Depot Product Search Relevance” Competition

Jácome Galarza, Luis Roberto

View/Open

Luis Roberto Jácome.pdf (1.657Mb)

Date

2016-10-14

Author

Jácome Galarza, Luis Roberto

Metadata

Show full item record

Estadisticas en RECOLECTA
(LA Referencia)

URI

http://hdl.handle.net/10810/19242

Abstract

The present work analyses different techniques in order to build a predictive model that could be able to solve the Kaggle’s competition called “Home Depot Product Search Relevance”. Several NLP methods were used for data preprocessing like tokenization, lemmatization, extracting stop words, etc. Word overlap and Mikolov word embeddings were used for feature extraction, Random Forest algorithm was used for applying regression. Finally the statistical open- source R language was used for building the scripts. The results indicate that distributed word representations are a very useful technique for many NLP applications. Word embeddings helped to improve the accuracy of the predictive model; having this experience it can be realized the power of this technique and its ease of use. A big concern of the project was the long processing time of processing the word embeddings in regular desktop/laptop computers. In order to reduce the processing time, it was necessary to extract the words embeddings only of the words found in the datasets. Moreover, some of the datasets were split and processed in different machines. Other possible solutions to this problem are renting cloud computing, grid computing, parallel computing, servers, etc.

Collections

Máster Universitario en Ingeniería Computacional y Sistemas Inteligentes

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 4.0 International