“Scalable machine learning library”
Mahout is Machine learning Software that allow application
that analyse large set of data. It is a solid Java framework
in the Data Mining/Artificial Intelligence area. It is a
machine learning project by the Apache Software Foundation
that tries to build intelligent algorithms that learn from
some data input.
Before Mahout, Machine learning task is difficult perform
quickly as large scale. Mahout is first take the adventure
of Apache Hadoop power to complex problem to breaking up
in multiple parallel task
Mahout offers three machine learning techniques
User Info + Community Info = Recommendation
In Mahout the collaborative filtering and other algorithms
used in recommendation systems.
Collaborative filtering (CF) is a technique, popularized by
Amazon and others, that uses user information such as ratings,
clicks, and purchases to provide recommendations to other site users.
CF is often used to recommend consumer
items such as books, music, and movies, but it is also used in
other applications where multiple actors need to
collaborate to narrow down data. Chances are you’ve seen CF in
action on Amazon, as shown in following figure:
Given a set of users and items, CF applications provide recommendations to the current user of the system.
Four ways of generating recommendations are typical:
All CF approaches end up calculating a notion of similarity
between users and their rated items.
There are many ways to compute similarity, and most CF systems
allow you to plug in different measures so
that you can determine which one works best for your data.
Data Model: Storage for user, item and Preference
User Similarity: Interface defining the similarity between two users
Item Similarity: Interface define the similarity between two items
Recomander: Interface for provide recommendation
User Neighborhood: Interface for computing a nighborhood
Clustering is one of the most popular techniques available in
Machine learning field. This allows the system
to group numurous entities into separate clusters/groups based
on certain characteristics/features of the entities.
Clustering is all about organizing
item from given Collection into groups of Similar item.
Unlike Classification Clustering doesn’t group data into
an existing set of known categories
This is particularly useful when you aren’t sure how to
organize your data in the first place .
Best example of clustering is Google news
Clustering a collection involves three things:
Mahout has support for various clustering techniques
implemented in a distributed passion.
Distributed/parallel implementations will directly relate to drastic
improvement in the performance of the system
as well as overcoming the limitation of limiting the input
data size based on hardware in stand alone implementations.
Major clustering techniques available in Mahout 0.5 are,
The goal of categorization (often also called classification)
is to label unseen documents, thus grouping them
Many classification approaches in machine learning calculate a
variety of statistics that associate the features of a
document with the specified label, thus creating a model that can
be used later to classify unseen documents.
Classification uses knowing data to determine how new data to be
classifed into a set of exististing categories,
When we make aor unmark an email as spam we influence our email
classifcation engin for flagging future spams.
Its also called predictive analysis. Computer classification
systems are a form of machine learning
that use learning algorithms to provide a way for computers to make
decisions based on experience and, in the
process, emulate certain forms of human decision making.
There are two main phases involved in building a classification system:
In the above figure tell about How a classification system works ,
Inside the dotted lasso is the heart of
the classification system—a training algorithm that trains a model
to emulate human decisions.
A copy of the model is then used in evaluation or in production with
new input examples to estimate the target
The key ideas listed in this table are discussed in the subsections that follow
(Author is Big Data Team Lead @ Sesame Technologies Pvt. Ltd.)