Building a predictive model for Kaggle's “Home Depot Product Search Relevance” Competition
Ikusi/ Ireki
Data
2016-10-14Egilea
Jácome Galarza, Luis Roberto
Laburpena
The present work analyses different techniques in order to build a predictive model that could be able to solve the Kaggle’s competition called “Home Depot Product Search Relevance”. Several NLP methods were used for data preprocessing like tokenization, lemmatization, extracting stop words, etc. Word overlap and Mikolov word embeddings were used for feature extraction, Random Forest algorithm was used for applying regression. Finally the statistical open- source R language was used for building the scripts.
The results indicate that distributed word representations are a very useful technique for many NLP applications.
Word embeddings helped to improve the accuracy of the predictive model; having this experience it can be realized the power of this technique and its ease of use.
A big concern of the project was the long processing time of processing the word embeddings in regular desktop/laptop computers. In order to reduce the processing time, it was necessary to extract the words embeddings only of the words found in the datasets. Moreover, some of the datasets were split and processed in different machines. Other possible solutions to this problem are renting cloud computing, grid computing, parallel computing, servers, etc.