Apache™ Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm. Machine learning is a discipline of artificial intelligence focused on enabling machines to learn without being explicitly programmed, and it is commonly used to improve future performance based on previous outcomes.Once big data is stored on the Hadoop Distributed File System (HDFS), Mahout provides the data science tools to automatically find meaningful patterns in those big data sets. The Apache Mahout project aims to make it faster and easier to turn big data into big information.
An algorithm library for scalable machine learning on Hadoop
Once big data is stored on the Hadoop Distributed File System (HDFS), Mahout provides the data science tools to automatically find meaningful patterns in those big data sets. The Apache Mahout project aims to make it faster and easier to turn big data into big information.
Mahout supports four main data science use cases:
Mahout provides an implementation of various machine learning algorithms, some in local mode and some in distributed mode (for use with Hadoop). Each algorithm in the Mahout library can be invoked using the Mahout command line.
The following is a list of algorithms for use in distributed mode (Hadoop-compatible), classified by the four categories: collaborative filtering, clustering, classification or frequent itemset mining. Mahout also includes some machine learning algorithms that can be used locally, but those are not listed here. For a complate list of algorithms, please visit http://mahout.apache.org/users/basics/algorithms.html.
Algorithm | Category | Description |
---|---|---|
Distributed Item-based Collaborative Filtering | Collaborative Filtering | Estimates a user’s preference for one item by looking at his/her preferences for similar items |
Collaborative Filtering Using a Parallel Matrix Factorization | Collaborative Filtering | Among a matrix of items that a user has not yet seen, predict which items the user might prefer |
Canopy Clustering | Clustering | For preprocessing data before using a K-means or Hierarchical clustering algorithm |
Dirichlet Process Clustering | Clustering | Performs Bayesian mixture modeling |
Fuzzy K-Means | Clustering | Discovers soft clusters where a particular point can belong to more than one cluster |
Hierarchical Clustering | Clustering | Builds a hierarchy of clusters using either an agglomerative“bottom up” or divisive “top down” approach |
K-Means Clustering | Clustering | Aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean |
Latent Dirichlet Allocation | Clustering | Automatically and jointly cluster words into “topics” and documents into mixtures of topics |
Mean Shift Clustering | Clustering | For finding modes or clusters in 2-dimensional space, where the number of clusters is unknown |
Minhash Clustering | Clustering | For quickly estimating similarity between two data sets |
Spectral Clustering | Clustering | Cluster points using eigenvectors of matrices derived from the data |
Bayesian | Classification | Used to classify objects into binary categories |
Random Forests | Classification | An ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees |
Parallel FP Growth Algorithm | Frequent Itemset Mining | Analyzes items in a group and then identifies which items typically appear together |
Source: Apache Mahout – Hortonworks