Archive for May, 2014



Harnessing the Power of Machine Learning

As a species we are creatures of habit, everything we do has a pattern and our unique signature is imprinted on it. Phenomena that are beyond our conscious control also exhibit some micro and macro order. The sequence of normal heart beats has unique characteristics that help in distinguishing it from pathology. Examples are not limited to our physiology or biological makeup it extends well into our collective behavior. Our preferences for various products and services in the market have repetitive structure as well. For example there are groups of consumer who always prefer Coke over Pepsi and another group that buys Surf over any other brand of detergent. It would be extremely useful for brands to profile these customers and channelize their marketing towards individuals who fit this profile.

Data generated through our activities captures plethora of information about our identity, likes and dislikes etc. This information has tremendous value in every aspect of human life. Programming computers to unravel this hidden information is what Machine Learning is all about. It is the art and science of scientifically deriving insights, patterns and predictions from data. What’s more, it is the core of data science and often the difference between a superficial understanding of the data and a deep insight into it.

Tom Mitchell in his book Machine Learning provides a short and simple definition of what is Machine Learning;

“The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.”

In general, there are two aspects of Machine Learning: the algorithms (theory) and their implementation (application). Both of these aspects are equally important as they cover different facets of the process of learning from data and examples, unbound by assumptions of the structure of the data involved. The field of Machine Learning provides tools to automatically make decisions from data in order to achieve some goal or solution for a problem. In order to perform Machine Learning, one needs to know both of these aspects well.  It is for this purpose that the School of Data Science  has put together a structured learning program for business and technology professionals to learn the Machine Learning theory and practical skills required to predict from data. Exclusive program is starting from the 12th of June in London and it’s called Machine Learning Smackdown. Limited number of seats are available so I encourage you to apply as soon as you can.


Some Machine Learning experts may try to convince you that it is a discipline that requires years of studying in order to master. This, however, applies to every discipline but has never stopped anyone learning it in sufficient depth in a shorter time. What most people usually lack, which makes it a daunting task, is proper instruction, something that School of Data Science has considered while meticulously working on the Machine Learning Smackdown courses. Courses are designed for professionals from industry, services and public sectors interested in developing capabilities of turning data into meaningful insights and intelligent predictions. These courses (Smackdown rounds) are:

Introduction to Machine Learning

First round is all about building a solid foundation upon which more can be build. The course will cover the structure of the field, the fundamental skills required to successfully perform it, and the current “hot” topics of Machine Learning. Full of examples and applications, this course will trigger thinking about how Machine Learning  can be useful to you. Great course for business and technology professionals to learns basics of Machine Learning and use cases.

Get Started in Machine Learning

A foundation level course which aims to help you learn the basic principles around the models and methods of Machine Learning. Through a series of hands-on examples to step through real-world applications of Machine Learning. Attending this course will enable you to understand the basic concepts, become confident in applying the tools and techniques, and provide a firm foundation from which to explore more advanced methods. All of its lab session are done using IPython and Scikit-learn.

Machine Learning Basics with R

A thorough introduction to Machine Learning, with emphasis on classification methods, through the use of the R programming platform. This course has several examples of datasets similar to those in the real world. Along with the hands-on learning of various methods for classification and regression, this course offers an understanding of validation and a selection strategy for the various techniques presented.

Fundamentals of Machine Learning

An advanced course offering a detailed view of a wide selection of Machine Learning methods and paradigms. It also covers Machine Learning theory and several applications of the field. The hands-on part of the course is in the R programming language.

The bottom line is that through these courses you will gain a solid understanding of Machine Learning and lay the foundations of the mindset of a practitioner. You will be able to comprehend a new method’s function and assess its performance, making further learning in the field easier and more interesting. Communication with stakeholders of all sorts will be easier for you and you will have a better appreciation of the signals that lurk in the data available. Most importantly, you will be able to do something useful with it and participate in the big data revolution that is taking place these days. Don’t miss the great opportunity to learn practical Machine Learning to solve real business and social problems.

read more



Everyone can benefit from a solid understanding of data science.

As you have probably figured out by now, data science is my thing. How do I know that? Well, I asked myself the following question: what would I do if I had all the money in the world and all my needs were met? Believe it or not, I would practice and research data science (along with some travelling probably)! Contrary to what some people in the field think, data science is for everyone, regardless of what you do for a living. It’s not limited to the few people, like me, who have a passion for it. That’s because everyone can benefit from a solid understanding of data science.

Now, you may want to learn data science because you want to reap some of its fruits, or you may want to become a full-time professional in the field. If it’s the latter you are after, you may want to check out my book which is finally becoming available in both paperback and electronic format. Unfortunately it took longer than I thought to get it out there, but there was a good reason for all the delays: we wanted it to be sufficiently good and as free of errors as possible. In this book I mention the various ways where you can learn the various skills needed to become a (good) data scientist. What I don’t mention is the series of data science and machine learning courses that are available at The School of Data Science (as these courses were not finalized at the time I wrote the text).

TSDS_Training the next generation

Persontyle is a social enterprise dedicated to data literacy. Through its various educational programs offered by the School of Data Science, you can learn the practical skills required to be a player in this fascinating field and get to have a good time at the same time. But you don’t have to start from the very basics, if you are already knowledgeable about some things. The courses it offers cover various levels and are all self-contained. Also, all the courses have a practical aspect to them, making sure that you get some hands-on experience while doing them. The best part is that they are all very short, so it is easy to fit them in your schedule. If you are unable to attend during the days that they are held, there are also customized solutions, so that you can get some training at a time and place of your convenience. This option is particularly good for you if you have a team with which you work and everyone needs to get the training.

Note, that School of Data Science also has courses for businesspeople who wish to learn about the business aspects of the field. Perhaps you are interested in hiring a data scientist and you want to learn about the field so that you can manage your expectations better and assess your candidates better.

Whatever the case, the resources in this post are really useful and can turn into great assets that can bring a lot of value to your career and the organization you belong to. I say “organization” instead of company because data science is not limited to companies only. NPOs and charities can benefit from data science as well (actually this is one of the things that Persontyle promotes), so it’s not limited to the benefit of stockholders. I hope you take your time to look into these resources and find out how they can be beneficial to you, in an educational and enjoyable way.

read more



Bootstrapping Machine Learning is launching today and Persontyle is a proud supporter!

Bootstrapping Book_Blog

Guest post by Louis Dorard, author of Bootstrapping Machine Learning

Prediction APIs are a growing trend and they are changing the way people approach Data Science. Recently, Persontyle partnered with BigML which is a company that provides one such API. Services like BigML abstract away the complexities of learning models from data and making predictions against these models. Thanks to Prediction APIs, anyone is now in a position to do Machine Learning.

However, apart from a few blog posts here and there, there was no long-form resource to introduce you to Machine Learning through Prediction APIs. All the books on the market will teach you how to implement Machine Learning algorithms. But most people who could benefit from it are not willing to invest the time and efforts required to understand how these algorithms work. As Bret Victor wrote: “Until machine learning is as accessible and effortless as typing the word ‘learn,’ it will never become widespread.”

I was really excited when I first learnt about Prediction APIs in 2011. I kept an eye on them and eventually I decided to write the first guide to use them. Although they are indeed making machine learning quite effortless, people still need to be educated to its possibilities, its limitations, how to prepare the data to learn from, and what to do once a machine learning model has been created. As you can imagine, my core audience is not people wanting to become experts in the field but people looking to leverage these technologies for their apps or businesses. They can be hackers, startuppers, CTOs, lead devs, analysts, … They are not going to become Data Scientists, but rather what you could call Data Artisans and they can now do things that only Data Scientists could in the past.

Instead of writing a traditional book, I went for a self-published ebook and I was inspired by successful self-published authors such as Nathan Barry, Sacha Greif, or even Guy Kawasaki. The ebook is complemented by extra material such as videos, screencasts, tutorials, IPython notebooks, code, datasets, a Virtual Machine, and free subscriptions to BigML. The objective is to save time to the person who wants to get started with BigML or even Google Prediction API.

For those who need more hands-on training or who want to be able to ask me questions in person, The School of Data Science and I will soon run a workshop on Prediction APIs: stay tuned! In the meantime, you can check out the book and start using Machine Learning within a day!

My goal is to help you create better apps by using Machine Learning and Prediction APIs. If you like you can read more about me and you can follow me on Twitter (@louisdorard) to see what I’m up to.

Download a free sample of the book with a detailed table of contents.

BOOK Quote 1 BOOK Quote 2 read more



Clustering Algorithms to Tame Big Data

By John MacCuish

Data science and machine learning techniques are transforming science, technology, and industry. Vast quantities of data – big data and small – are being explored and adding business value. Cluster analysis or clustering , namely, finding groups in data, is an extremely important tool within this process (e.g. read this post by Mark van Rijmenam, author of “Think Bigger – Developing a Successful Big Data Strategy for Your Business”).

As, humans tend to group things. It’s what we do. We can tell apples from oranges by their color and subtle differences in shape; or we divvy up a set of books into fiction and non-fiction – and subdivide it again into poetry, and literature, or math books and stat books, etc. Sometimes it is hard to group things easily, like discriminating friends, colleagues, and acquaintances. It doesn’t stop us. We love or feel the necessity to organize, to create tables and taxonomies. We classify things, ideas, and behaviors. Cluster Analysis is just such an activity: quantitative methods for finding groups in data. Originally referred to as numerical taxonomy, it leans on the quantitative side of organization, and it explores, learns, and projects classes of items from a set of discriminatory features. It generates a hypothesis about a certain order, whether the found classes are distinct, fuzzy, or overlapping. It is useful wherever there is data, almost without exception – pick an industry, a field, or a discipline, there is a use for it.

Cluster Analysis Blog pic 2

Cluster analysis is a machine learning technique to identify the groups within a dataset. This technique is applied to group a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups. Clustering is one of the most widely used technique and has numerous applications e.g. image segmentation, to identify groups/segments within CRM  databases to offer more targeted services and products, search engines like Google use cluster analysis to identify similar documents within their indexes and social networks like Facebook, LinkedIn, Twitter use clustering to identify communities, commonalities, and bands within large groups of users. Well, to put it in a nutshell, clustering methods form the backbone of numerous machine learning applications in Big Data.

“London is calling you.” This past February, Ali Syed from Persontyle, contacted me via email about the possibility of designing and teaching a course on cluster analysis with the R language and environment in London. Given how clustering theory and applications continues to grow, coupled with the expansion of the utility and flexibility of R and R clustering software for data scientists, researchers and industry professionals, I jumped at the chance to design and teach the course. Having spent the past twenty years of working with data and machine learning in applied settings across numerous industries and disciplines, and originally much of it in S, but in the last X years in R, it is a great pleasure to put my experience and knowledge of cluster analysis into a course setting. R contains scores of packages and hundreds of functions devoted to cluster analysis (clustering algorithms, validation, visualization, analysis of results, etc.), and, the ease of creating the course through the RStudio IDE and R Markdown for reproducible web authoring, made accepting the offer to teach the course that much more exciting.

Cluster Analysis Course Pic  

Cluster Analysis with R is a comprehensive course covering both applications and the theory of clustering. The R language and environment is uniquely positioned to provide both effectively and economically the full range of clustering methods and visualization graphics to present clustering concepts and applications. R is very flexible which will allow the students to investigate methods and data more fully. Real world and simulated data will be used throughout the course.

Data will be analyzed from fields as diverse as marketing and ecology, finance and drug discovery, bioinformatics and psychometrics. Customer segmentation for marketing or studying the impact of churning, species dispersal or ordination, designing diverse compound libraries, studying gene expression, determining cliques and cohorts in a population, finding social communities in social media, all involve the application of cluster analysis.

This course will reward those with a very limited understanding of clustering and those who have some experience with clustering but want a broader and deeper understanding of cluster methods and analysis. Participants will come away with an understanding of a full set of clustering tools and the theory behind them, so that they will know the subset of tools to use for their applications … and why. They will learn that clustering is not a simple cookbook exercise, and that they will have to bring a good many methods and creativity to bear on each clustering problem. In the R coding labs, the participants will work with extensive R code examples, and, having used and explored them, they will develop the fluency necessary to modify and extend them to their specific projects.

After attending the course, participants will be able to use R to perform exploratory data analysis, feature selection, dimension reduction, and further pre-processing necessary for the use of their data for clustering. They will be able to find and use the appropriate (dis)similarity measure(s) (symmetric and asymmetric), choose the appropriate set of clustering algorithms to explore from among the different types (partitional, hierarchical, hybrid, online, graphical, asymmetric, co-clustering, etc.), given the size and type of their data. They will be able to employ the various validation and visualization R functions on their clustering results, and have the tools to explore cluster stability, plasticity, and ambiguity.

I’ve designed this course for people interested in developing practical skills on how to implement clustering algorithms using R. Along with basic experience of programing in R and knowledge of statistics, you’ll need an inquisitive mind and curiosity about analyzing data for insights and predictions, and how to best group it. Looking forward to meeting you in the class and together we will do some clustering!

About the Author

John D. MacCuish is the founder and president of Mesa Analytics & Computing, Inc. He has co-authored several software patents and has worked on many image processing, data mining, and statistical modeling applications, including IRS fraud detection, credit card fraud detection, and automated reasoning systems for drug discovery.

read more