MAPREDUCE- basic recommender system w/ mahout

This recommender system was built on a CentOS virtual machine with Cloudera. After deploying the client configuration I started up Zookeeper, Hive, and Yarn. Firstly, I imported this data set into the HDFS directory, cleaned it using python, and then created a subdirectory to store it. From here tables were created using Hive making it easier for combining, filtering, and analyzing the data. Then I began making movie recommendations with Mahout based on respective movie rating history.


With Vagrant and Ambari I built a cluster of five nodes, essentially five virtual machines with two masters and three slaves. After the Ambari server was installed I continued to install MySQL connector(java) and Anaconda(python). After some customization I deployed the installation of software from Hortonworks. Following this I tested the master nodes and then created the vagrant user’s home directory in HDFS before playing around with Mapreduce.


Created tables and we can see each retrieval returns a timestamp. Any sparse population means few columns and rows, only what is necessary or present, and there aren’t any null values. If we added more values to the ‘wiki’ key with the put command we would have returned those values. There simply more values in the ‘student’ key.


The word count the inverted index is based on was obtained from Stackexchange. Python was used to install Anaconda and Fastspark on each of the five nodes with some added specifications and updates. Inverted indexes are used for optimization of large data sets creating fast query results. An example would be the return from typing a word into a search engine.