Archive for Data Science Practices’ Category



Azure Machine Learning Guide from Scratch.

A lots of people are unfamiliar with one of the best Predictive Modeling tool of Microsoft. It’s called Azure Machine Learning Studio. I am going to give a basic hands-on with Azure ML studio. What I am going to build is a census income predictor. We will be using classification technique to predict income on the top of a sample dataset already in Azure Machine Learning Studio. We will learn the following in this tutorial
  • How to transform data in Azure ML
  • How to make a predictive model
  • How to make a web service
  • How to consume the web service
Let’s go step by step.
  • Make a free account at The trial free version gives you enough space for experimentation.
  • Go to New Experiment and you’d see your work space as follow.
NewExp This is how your work space would look like. At the left are the modules that you would be using to transform, build, the data. I wouldn’t go in technical details. I will keep the tutorial as simple as possible. At the top right, you can see Saved Data sets. Click on it and you will see two drop downs.
  1. Saved Datasets
  2. Samples
Saved Datasets are the datasets that you will upload to your Work space. They would be available to use at any instance over a click. The Samples are the datasets that come up with Azure Machine Learning Studio in Default. Lets drag and drop the Adult Income Data to the work space. The first thing we would do is visualize the dataset. It is the most important step before doing any transformation because unless we don’t know what’s in the data, we would be hitting in the air. visual Right click the module and click Visualize and you should see a result like this. result vis You can click on each variable/column to see the histograms at the right hand side. You would also come across descriptive statistics as follow. statistics These statistics can give you a quick peek into your data. Let’s now move on to next step. As you are well aware of the fact that world is full of noise, dirt and mess. So there’s no data which comes with flowers, but most of the data comes with tangible thorns. The missing values are cancer to the data and we should always get rid of them. Azure ML provides a missing value scrubber module for that. mising Drag the module to the work space and connect the dataset node with it. atttach

Click on it and at the right hand side, you would see the properties of it.

Select Custom Substitution Value and replace all the missing value with 0.

Continuing the steps in data transformation. One of the most important module that you will be working with all your life is Project Columns. This module is used to project, include, and exclude columns. We necessarily have to see for income prediction, which of the columns are necessary for us and which are not. First we search the project column, add it to the work space and attach its node to Missing Values Scrubber. proj We click the Launch Column Selector from properties and a windows pop up. win We would exclude these columns as we believe these columns may not be a good feature for income. To learn more about feature engineering, I would recommend taking a detailed Edx course on Feature Engineering. Now if you right click on Project Column Node and visualize the data, you would see that the above mentioned columns are excluded from the dataset. Modelling Now when we are done with Data Transformation, it’s time to kick into modelling. We would go again to the search pane and search for Split Data. Drag the Module into work space and connect the node with Project Columns. Splitting data into test and train is an important concept of Machine Learning. Whenever we are making a predictive model, we split data into partition of 60 % for training the model and 40 % for testing. The ratio could differ for different problems and data, however in basic it usually remains same. For more details on it, you can have a quick read of machine learning concepts over internet. split I have split the data into 50-50. It’s time to train the model. Click on Machine Learning > Train and Drag the Train Model Module into Work space. target Remember, the split module would have two output nodes, one would go into train model and about the other one, I would explain later in the post. Click on train model, go to right hand side properties, you’d see a Launch Column Selector. Click it and a window will pop it. You would have to mention your target variable here. For our case, its income. target2 Now that you have specified the target Variable, its time to identify an algorithm for training. Since our is a classification problem that is we want to see if the income of an individual is greater than or less than 50k. We would use a classification algorithm. I would use Boosted Decision Trees. tree We would connect the algorithm to the other node of train model. The next step would be to score the model results. Scoring model would give us an output column of the scored prediction results. What is Decision Trees? In simple words, it finds a probability of happening of an event by making trees. More details about the algorithm can be read here. score Now is the time when I would tell you that the other node of Split Colum that would be the test data would go into the right hand side node of Score Model and at the left hand side, we would connect train model. run Run the model and wait until you find green ticker signs at the modules. Once it is done, right click on Score Model and Visualize it. You would find two more column names at the end, One with Score Labels and other with Scored Probabilities. scored-label-viz By the results, and comparison of it with the original income column, we can see that our model has done pretty well but is there a way to evaluate it? Let’s move on. evalu Drag the Evaluate Module and connect it with Score Model. Once again run the model and when the ticker is green, right click on Evaluate model and visualize the results. accuracy Here our model accuracy is about 91% which means our model is doing pretty well. We can see the precision, F-measure, Recall and other results here.   Web Service One of the best feature of Azure Machine Learning Studio is that it provides a building Web Service tutorial which could be access using .NET, Python or R Shiny App. We are now going to build a web service on the top of our predictive service. It is as simple as eating a cake. run Click on Setup Web service and Deploy Web Service. You should then see a page like this. api Above mentioned is your API Key which you will use to deploy web service on a third-part. We will explain that later in the tutorial. You can test your web service by clicking on test. Click on Request/Response and you would see the parameters to use it in a third-party environment. In the middle of page, you’d see all the parameters that your data is using. It also includes the Json string format. At the end of the page, you’d see the code generated by Azure ML studio. python To make a real-time web application, we would be using Python for this tutorial. You can make use of .NET and R Shiny as per your skills. You can download the Python based web app which is made under Django framework via this link. To view the application running in live mode. Please click here Most of the hosting websites don’t have support for python. I have hosted the application at my personal AWS instance. If you have found this tutorial useful, please feel free to comment. Usman is an aspiring data scientist. He tweets @rana_usman and can be reached at read more



Data science is emerging as a hot new profession and academic discipline.


Yes, because data science has the potential to revolutionize the way business, government, science, research, and healthcare are carried out. Data science is emerging as a hot new profession and academic discipline. In my opinion, 2015 will see data science becoming a mainstream career choice. Data science has come a long way – but the evolution is only beginning.

Focus of this post is not to convince cynics. In fact, I’m not interested in them at all. I don’t want to waste a second of my time on cynics, wimps, and haters. Their cynicism leads to mediocrity. Umair Haque, one of my favorite thinker and writer, eloquently said that, “there’s nothing more poisonous to self-belief than people who tell you what you cannot do.”

Today, I want to share couple of interesting updates that highlight the significance of data literacy and why data literacy is a fundamental skill for all professionals. In massively connected digital world, it is imperative that the workforce of today and tomorrow is able to understand what data is available and use scientific methods to analyze and interpret it.

Hal Varian, Google’s chief economist in his 2009 interview for The McKinsey Quarterly said that;

“The ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it. I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. But I do think those skills – of being able to access, understand, and communicate the insights you get from data analysis – are going to be extremely important. Managers need to be able to access and understand the data themselves.” – Hal Varian

Wait a minute. Is this a pitch for Persontyle? I think so…


Persontyle is a social enterprise dedicated to data literacy. Born from the idea of creating a platform for the people, by the people to share passion, knowledge, theory and practices of analyzing data scientifically. Most people are focused on making machines smarter. We are focused on making people smarter. Through its various educational programs offered by the School of Data Science, you can learn the practical skills required to be a player in this fascinating field and get to have a good time at the same time. But you don’t have to start from the very basics, if you are already knowledgeable about some things. The courses, workshops, and bootcamps it offers cover various levels and are all self-contained. Also, all the learning experiences have a practical aspect to them, making sure that you get some hands-on experience while doing them. The best part is that they are all very short, so it is easy to fit them in your schedule. If you are unable to attend during the days that they are held, there are also customized solutions, so that you can get some training at a time and place of your convenience. This option is particularly good for you if you have a data science project team with which you work and everyone needs to get the training.

Data science is scorching hot right now, in headlines, board rooms, universities and philanthropy. As the world is becoming more and more aware of the benefits of data science, more and more organizations are looking for ways to harness the value that lingers in the big data they have access to. It has become clear that the latter is not just a buzz word; it is a whole new world of potential, waiting to be explored.

Develop the most demanded skill for the 21st century.

We believe data literacy will be the fundamental skill for the 21st century. In 2015, we are launching some exciting new data science and big data engineering learning opportunities, also we are collaborating with our partners to offer data science training programs in US and Middle East.


Data science is all about building teams and culture.

It is crazy to think that a doc­tor must know everything, and it is just as crazy to think a data scientist should be an expert in machine learning, statistics, hacking, programming, application development, production deployment, etc, etc. Data science is a team sport. All playing have a role to play, everyone contributes. Somebody has to bring the data together, somebody has to build the data workflows and do the data engineering work, someone needs to apply the machine learning models, someone needs to be there to challenge the results and so on and so forth. Important thing to note here is that all involved in the data science project lifecycle should be aware of the limitations of their expertise and knowledge, and willing to call in for help when required.

“People make a mistake by forgetting that Data Science is a team sport. People might point to people like me or (Jeff) Hammerbacher or Hilary (Mason) or Peter Norvig and they say, oh look at these people! It’s false, it’s totally false, there’s not one single data scientist that does it all on their own.” – DJ Patil

As data science evolves to become a business necessity, the importance of assembling a strong and innovative data teams grows. Assembling a group of talented people with diverse skills is the best way to meet your data science needs. Engage Persontyle data science and data engineering experts to conduct a free Data Science Talent Strategy Workshop for your organization. Book now!

Beyond data science: Advancing data literacy.

First piece I want to share is by Leslie Bradshaw. Excellent post to understand the concept of data literacy and why contextualizing, storytelling, and visualizing are equally important tools to get better, smarter, faster, and more reliably predictive decision making.

“From public policy to sports to finance to health to economics to businesses to citizens to elected officials to education… so many aspects of our individual lives can and will be made better through including more disciplines in the science of data as it evolves to become a literacy of data.” – Leslie Bradshaw

Read the full post here

The 25 hottest skills that got people hired in 2014.

Second post I’m sharing is by of LinkedIn listing the 25 hottest skills that got people hired in 2014. To determine which skills were most in demand in 2014, LinkedIn data specialists analyzed more than 330 million profiles. You can notice that 5 of the skills mentioned are directly related to data literacy and data science.


Read the full post here

Become data literate in 2015.

Idea that data scientists are as mythical as unicorns is simply false. Only wannabes, protectors of mediocrity and industrial age pundits call data scientist a unicorn. You should not believe the fantasy that data scientists are mythical, and should definitely not neglect developing data science capabilities.

I’ve said it before, saying it again that we all can learn and apply data science to make lot of mistakes. In a complex world, the process of trial and error is essential. We need to promote the culture of experimentation. Understand and embrace the concept of trial and error.

Data Science = Ask questions and challenge the status quo to deliver meaningful value. To me, essence of data science is to break the rules and challenge status quo by building new models that make the existing models obsolete. Because greatness doesn’t stem from merely counting what can be counted.

“There are two rules I’ve always tried to live by: turn left, if you’re supposed to turn right; go through any door that you’re not supposed to enter. It’s the only way to fight your way through to any kind of authentic feeling in a world beset by fakery.” – Malcolm McLaren

Happy 2015! A new year. A new beginning. Learn something new. Become data literate in 2015. Break the rules. And never give up on your dreams!

All the best!

Ali Syed

read more



Why Data Science Needs Social Science. And Vice Versa!


By Dr. Pawel Kobylinski

Upon joining the Persontyle team, I have decided to introduce myself to the readers of this blog by giving you a rationale behind the connection between social science and data science. Persontyle wants to help in transferring both ideas and strict technical know-how between the two disciplines mentioned. Whether you are a social scientist willing to learn R or a data scientist eager to grasp Design of Experiments and psychometrics, we are prepared to give you a hand.

Last year Harris, Murphy, and Vaisman (2013) surveyed over 250 data scientists from around the globe. The authors wanted to answer a question fundamental to data science: what are the educational and professional backgrounds of people who during last years ended up as data scientist? The authors report that they found evidence in data in favour of a scientific versus a tool-based education for data scientists. And what were the dominating fields among the scientifically educated? Let me cite the “Analyzing the Analyzers” report: “social or physical sciences, but not math, computer science, statistics, or engineering”.


Surprised? I was a bit. The results made me wonder if the surveyed sample was representative for the data science population. If “Self-Identification” part of the survey questionnaire was reliable and valid (in strict psychometric sense). And – last but not least – if the questionnaire was not missing something crucial, namely measurement of latent personality variables which possibly mediate or moderate the reported overt effects… Forgive me the dense language of the last sentences, I have been trained by experts how to care about the quality and meaning of data… By academic social scientists.

Quite in line with the reported results – not by statisticians, mathematicians, and programmers, to whom I am grateful for teaching me how to deal with numbers. Math people have a great privilege to explore a beautiful, pure, abstract universe in which numbers are disconnected from everyday meaning. Computer science experts are obviously focused on the fascinating and rapidly developing technological tools for processing digitized data. Social science on the other hand has developed a very strict methodological apparatus allowing for measurement and quantification of usually messy social and psychological reality. Furthermore, social scientists are familiar with basic statistical notions: sampling, measurement error, correlation, causation, prediction, statistical inference, etc. Many are acquainted with a bunch of quite complex methods, like repeated measures ANOVA, factor analysis, or regression, the latter often considered as a machine learning method.

If so, there should be no surprise at all – having a solid social science preparation is a great starting point for a data science career. Within the data science field we tend to algorithmize and automatize analytics. What social science people exert in SPSS or Statistica, we do by means of coding. Why coding? Just because we process data on everyday basis and it turns out convenient to have scripted procedures and programs that can be used over and over again, tweaked and combined into larger ensembles. So, if a social scientist dreams of data science, the first step (no leap at all) is obvious – learn statistical programming. What programming language to choose as first one? The most established, widespread, comprehensive and free at the same time: R. Having mastered R you will have to put some effort in learning basic technicalities of databases (all those SQLs and Hadoops are just fancy data containers). And then – voilà, you have become a data scientist. Of course, this is not the end of the highway, after gaining some experience you will find yourself ready for further considerations, for example: are there algorithms that can conduct my analysis faster and on big data sets? Yes there are, you can google them. Or maybe you will find yourself in data science team, in which computer science experts take care of algorithmic efficiency of big data processing tools and you are the one who is responsible for research strategy and analytics prototyping in R. After all you come from social science background, you are prepared to design a research process.

As the use of big data attracts more and more attention, significance of social science becomes increasingly apparent. Dr. Rebecca Eynon’s presentation at the HEA summit earlier this year’s highlights the challenges and opportunities of using big data for social science.

We know from our experience that when disciplines crash into each other great stuff can emerge. I honestly think it is hard to overstate the significance of social science to perform meaningful data science and vice versa. It is time to say goodbye to academic silos as we enter into a new age of cross/multi/and interdisciplinary work.

I encourage you to read this excellent post by Dr. Emma Uprichard from University of Warwick to explore the reasons why without social science big data cannot deal with big questions. And check out the work Dr. Patrick Dunleavy, Dr. Simon Bastow and Jane Tinkler are doing to understand the impact and opportunities presented by the the new methods coming in from the STEM sciences on the social sciences.

Data science needs to care more about data quality and meaning. The field needs to learn more about topics like research design, sampling design, Design of Experiments, psychometrics, latent variables and constructs to name a few. Without social science know-how all the fancy data science technology may find itself at danger of processing trash. GIGO – garbage in, garbage out. The other way round, social science needs to incorporate new tools developed by data scientists. Being not able to lay their hands on the big amount of digitized social data flowing all around us, academia looses a hecking lot of research opportunities. The connection between data science and social science will inevitably tighten. The latest (in)famous Facebook experiment, whatever its ethical implications, proves it is already happening. Here at Persontyle we are aware of the convergence process. Don’t let yourself stay behind.

read more



Learn Powerful Machine Learning Methods for Gaining Insight into Data

ML Blog_MLBR_Ali

Machine Learning is without a doubt the core aspect of data science and predicative analytics in general. Its intuitive, versatile and robust approach to finding patterns in the available data makes it a priceless asset for anyone who wants to turn data into insights. What’s more, today it is more accessible than ever before, thanks to the variety of libraries in the programming languages used for Machine Learning and predictive modelling. This is particularly true for R, the open-source programming platform that specializes in this kind of tasks.


Up until relatively recently, R has been thought of a tool for statisticians, mainly because it handles statistical models proficiently through its various statistical tools. However, recently it has been upgraded through the introduction of a variety of libraries that contain efficient implementations of several Machine Learning algorithms. In addition, the introduction of parallelization libraries enabled R to run on computer clusters, where big data dwells. What’s more, due to its open-source license, R has attracted several practitioners who developed workshops, tutorials, etc. making learning it easier than anything else in the Data Science field.

Machine Learning is also quite accessible today, due to the variety of books on it. However, most of them have underlining assumptions about what you know, plus they give a lot of emphasis either on the programming aspect of it or on the mathematical dimension of the methods covered. It’s very difficult to find a resource that explains the ideas behind the algorithms and walks you through their implementation and the interpretation of their results, without getting overly technical.

What many people tend to forget is that Machine Learning can be quite enjoyable too. This is because it allows for a great deal of creativity in both the development of new algorithms as well as the implementation (and tweaking) of the existing ones. Plus all that results into getting a computer to do something intelligent that can provide value for you and your organization, bringing about a sense of accomplishment. What’s more, Machine Learning can hone your problem-solving skills and turn difficult problems into intriguing challenges that can be very educational too.

Naturally, learning Machine Learning is also about employability. Today, as more and more organizations become aware of the value of data analytics (esp. in a Data Science setting), the need for Machine Learning practitioners has exceeded the demand for it. This is why there are so many books on the topic as well as a variety of university courses. However, unless you are very methodical and have lots of time, reading books won’t cut it, especially if you are looking for a job in the field sometime soon. Besides, you can’t put books on a resume. As for the university courses, these too take time plus they are often quite pricey. For this and all the other aforementioned reasons, the School of Data Science has put forward a 2-day Machine Learning workshop, an efficient way to learn the essential aspects of the field, using the R platform, at a very reasonable price.


The idea of this and all other similar learning programs developed by School of Data Science is to make Machine Learning accessible to everyone who wants to get into it, without spending months on it (you can go in more depth afterwards, on your own if you want). Familiarizing yourself with the basic concepts and getting some experience on how they are applied will enable you to get into the field faster plus it will spur your enthusiasm about this fascinating field.

This workshop aims to develop basic understanding of Machine Learning based on supervised learning methods, through the use of the R programming platform. It describes the different types of learning and the two main categories of their applications: Classification and Regression. With a focus on the former, it takes a close look at typical Machine Learning techniques and how they apply on datasets akin to those encountered in the real world.

Our goal is to give you the basic skills that you need to understand supervised machine learning algorithms and models, and interpret their output, which is important for solving a range of data science problems. This is an applied Machine Learning course, and we focus on the intuitions and practical know-how needed to get Machine Learning algorithms to work in practice, rather than the mathematical equations and derivatives.

Great opportunity for programmers, business analysts, technology consultants and all mortals interested in Machine Learning to learn several methods for building Machine Learning applications that solve different real-world tasks. Lots of hands-on labs to step through real-world applications of Machine Learning.

Read and download the workshop brochure.

Special Offer – 30% Discount!

Please take a moment to register now and avail the special 30% discount offered. Visit the event page and use the promo code MLBR30 to get 30% off.

So, join us for a packed, holistic, and enjoyable workshop this August and let yourself embark into an educational adventure in the world of Machine Learning. If you have any questions about the workshop or registration please feel free to contact me or email at

Happy Machine Learning!

Dr. Zacharias Voulgaris

  read more



The real innovation in big data is human innovation.


By Ali Syed

Digital world is continuously churning vast amount of data which is getting ever vaster ever more rapidly. Some analysts are saying that we are producing more than 200 exabytes of data each year. We’ve heard this so many times that managed well, this (big) data can be used to unlock new sources of economic value, provide fresh insights into science, hold governments to account, spot business trends, prevent diseases, combat crime and so on.

Over the past decade (noughties), we have witnessed the benefits of data from personalized movie recommendations to smarter drug discovery  – the list goes on and on. Joe Hellerstein, a computer scientist from University of California in Berkeley, called it “the industrial revolution of data”. The effect are being felt everywhere, from business to science, from government to the society

“You are thus right to note that one of the impetuses is that social as well as cultural, economic and political consequences are not being attended to as the focus is primarily on analytic and storage issues.” Evelyn Ruppert, Editor Big Data and Society

At the same time this data deluge is resulting in deep social, political and economic consequences. What we are seeing is the ability to built economies form around the data and that to me is the big change at a societal and even macroeconomic level. Data has become the new raw material: an economic input almost on a par with capital and labour.

Organizations need data from multiple systems to make decisions. Need data in easy to understand, consistent format to enable fast understanding and reaction. They are now trying to capture every click because storage is cheap. Customer base is harder to define and constantly changing. While all this is happening expectation is to have the ability to answer questions quickly. Everyone is saying “Reports” don’t satisfy the need any more.

The global economy has entered in the age of volatility and uncertainty; a faster pace economic environment that shifts gears suddenly and unexpectedly. Product life cycles are shorter and time to market is shorter. Instant gratification society, society which expects quick answers and more flexibility more than ever. Consequently, the world of business is always in the midst of a shift, required to deal with the changing economic and social realities.

The combination of dealing with the complexities of the volatile digital world, data deluge, and the pressing need to stay competitive and relevant has sharpened focus on using data science within organisations. At organisations in every industry, in every part of the world, business leaders wonder whether they are getting true value from the monolithic amounts of data they already have within and outside their organisations. New technologies, sensors and devices are collecting more data than ever before, yet many organisations are still looking for better ways to obtain value from their data.

Strategic ability to analyse, predict and generate meaningful and valuable insights from data is becoming top most priority of information leaders’ a.k.a CIOs. Organisations need to know what is happening now, what is likely to happen next and, what actions should be taken to get the optimal results. Behind rising expectations for deeper insights and performance is a flood of data that has created an entirely new set of assets just waiting to be applied. Businesses want deeper insights on the choices, buying behaviours and patterns of their customers. They desire up to date understanding of their operations, processes, functions, controls and seek information about the financial health of their entire value chain, as well as the socio economic and environmental consequences of both near term and distant events.

“Every day I wake up and ask, ‘how can I flow data better, manage data better, analyse data better?” Rollin Ford – CIO of Wal-Mart


Although business leaders have realized there’s value in data, getting to that value has remained a big challenge in most businesses. Friends in industry have cited many challenges, and none can be discounted or minimized: executive sponsorship of data science projects, combining disparate data sets, data quality and access, governance, analytic talent and culture all matter and need to be addressed in time. In my discussions with business executives, I have repeatedly heard that data science initiatives aligned to a specific organisational challenge makes it easier to overcome a wide range of obstacles.

Data promises so much to organisations that embrace it as essential element of their strategy. Above all, it gives them the insights they need to make faster, smarter and relevant decisions – in a connected world where to understand and act in time means survival. To derive value from data, organizations needs an integrated insight ecosystem of people, process, technology and governance to capture and organize a wide variety of data types from different sources, and to be able to easily analyse it within the context of all the data.

We are all convinced that data as a fabric of the digital age underpins everything we do. It’s part and parcel of our digital existence, there is no escape from it. What is required is that we focus on converting big data into useful date. We now have the tools and capabilities to ask questions, challenge status quo and deliver meaningful value using data. In my opinion, organizations and business leaders should focus more on how to minimise the growing divide between those that realise the potential of data, and those with the skills to process, analyse and predict from it. It’s not about data, it’s about people. The real innovation in big data is human innovation.

“The truth is, that we need more, not less, data interpretation to deal with the onslaught of information that constitutes big data. The bottleneck in making sense of the world’s most intractable problems is not a lack of data, it is our inability to analyse and interpret it all.” – Christian Madsbjerg


Whether you are looking to amplify value and impact using data or you want to analyse data scientifically for insights and to ask relevant questions, Persontyle Services is the place to start to ensure you have what you need to strategize, design, implement, and fulfil your analytic needs. We will collaboratively work with you from data to insight and on to impact and value.

read more



Machine Learning: The key to transforming big data into big insights and big opportunities

ML Course Pro39

You’ll probably have heard the term many times before. Nowadays it’s hard to come by an article on trending technologies (particularly information-related ones) without some reference to a Machine Learning in it. It seems that suddenly the world has become aware of the immense practical aspects of this evergreen field of Artificial Intelligence that constitutes the heart of data science.  Machine learning is a computer’s way of learning from data and examples. It’s a type of machine intelligence, and will be among one of the technological disruptions of the coming years.

Many people see Machine Learning as a high-tech field that only the selected few can understand and practice, while others see it as merely glorified programming. As in many other cases, the truth lies somewhere in between. Machine Learning is not an esoteric discipline as it once was, in its earlier stages of development. It has grown very popular and therefore accessible, with a variety of open-source libraries in R, Python, and other programming languages. Also, its theory has become more structured and easier to understand, while most of the methods it entails have been tested over many years in a variety of datasets. Still, Machine Learning is not trivial and involves more than just writing code. It requires a lot of work to learn, though doing a specialized degree on it is unnecessary, unless you are really into research. Yet, despite the variety of literature out there, it is very hard to learn it properly on your own.

Machine Learning is used widely today for all kinds of tasks, from churn prediction in large companies, to web search, to medical diagnostics, to robotics (this in particular would have been next to impossible without Machine Learning). It’s hard to find a field that cannot benefit from Machine Learning in one way or another. The reason is simple: data abundance. With all kinds of data floating around, it is natural to gather meaningful combinations of it (creating what is known in Machine Learning as “features”), and use them to make useful predictions about the world, particularly aspects of it that pose some value to us. You can think of it as cooking skills in an environment where there is easy access to a large variety of cooking ingredients and everyone there has quite an appetite. What’s more, Machine Learning is getting better all the time, so it is quite unlikely that it will run out of methods that can turn data into valuable information more efficiently and more effectively.

But don’t take anyone’s word for all this. Look around at Machine Learning practitioners and their lives. Few of them are sitting idle. Most of them, particularly the more adept ones, earn a decent living and often win prizes at Machine Learning competitions. What’s even more important is that they usually have a good time doing what they do, because this is a line of work which is both manageable and challenging at the same time. If you are into programming, it makes it so much more interesting as it allows for the development of better quality applications, some of which can be marketed as intelligent or predictive applications.


As mentioned earlier, Machine Learning takes some effort to learn, but the whole process becomes much easier when it is done in a systematic and engaging way, with an experienced professional as your guide. This is why School of Data Science has created a series of courses, the Machine Learning Smackdown, that provide you all the help you need to learn Machine Learning properly, gaining some hands-on experience in the process. Completing the Machine Learning Smackdown will turn you into a competent Machine Learning practitioner, able to tackle real-world challenges, turning big data into big insights and big opportunities. Are you ready?

Register now for the 2nd Round, a five day bootcamp starting on the 21st of July to learn basic building blocks of practical Machine Learning using Python Scikit-Learn.

Machine Learning is the best way to exploit the opportunity presented by Big Data.

  MLS_Round2   read more



Harnessing the Power of Machine Learning

As a species we are creatures of habit, everything we do has a pattern and our unique signature is imprinted on it. Phenomena that are beyond our conscious control also exhibit some micro and macro order. The sequence of normal heart beats has unique characteristics that help in distinguishing it from pathology. Examples are not limited to our physiology or biological makeup it extends well into our collective behavior. Our preferences for various products and services in the market have repetitive structure as well. For example there are groups of consumer who always prefer Coke over Pepsi and another group that buys Surf over any other brand of detergent. It would be extremely useful for brands to profile these customers and channelize their marketing towards individuals who fit this profile.

Data generated through our activities captures plethora of information about our identity, likes and dislikes etc. This information has tremendous value in every aspect of human life. Programming computers to unravel this hidden information is what Machine Learning is all about. It is the art and science of scientifically deriving insights, patterns and predictions from data. What’s more, it is the core of data science and often the difference between a superficial understanding of the data and a deep insight into it.

Tom Mitchell in his book Machine Learning provides a short and simple definition of what is Machine Learning;

“The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.”

In general, there are two aspects of Machine Learning: the algorithms (theory) and their implementation (application). Both of these aspects are equally important as they cover different facets of the process of learning from data and examples, unbound by assumptions of the structure of the data involved. The field of Machine Learning provides tools to automatically make decisions from data in order to achieve some goal or solution for a problem. In order to perform Machine Learning, one needs to know both of these aspects well.  It is for this purpose that the School of Data Science  has put together a structured learning program for business and technology professionals to learn the Machine Learning theory and practical skills required to predict from data. Exclusive program is starting from the 12th of June in London and it’s called Machine Learning Smackdown. Limited number of seats are available so I encourage you to apply as soon as you can.


Some Machine Learning experts may try to convince you that it is a discipline that requires years of studying in order to master. This, however, applies to every discipline but has never stopped anyone learning it in sufficient depth in a shorter time. What most people usually lack, which makes it a daunting task, is proper instruction, something that School of Data Science has considered while meticulously working on the Machine Learning Smackdown courses. Courses are designed for professionals from industry, services and public sectors interested in developing capabilities of turning data into meaningful insights and intelligent predictions. These courses (Smackdown rounds) are:

Introduction to Machine Learning

First round is all about building a solid foundation upon which more can be build. The course will cover the structure of the field, the fundamental skills required to successfully perform it, and the current “hot” topics of Machine Learning. Full of examples and applications, this course will trigger thinking about how Machine Learning  can be useful to you. Great course for business and technology professionals to learns basics of Machine Learning and use cases.

Get Started in Machine Learning

A foundation level course which aims to help you learn the basic principles around the models and methods of Machine Learning. Through a series of hands-on examples to step through real-world applications of Machine Learning. Attending this course will enable you to understand the basic concepts, become confident in applying the tools and techniques, and provide a firm foundation from which to explore more advanced methods. All of its lab session are done using IPython and Scikit-learn.

Machine Learning Basics with R

A thorough introduction to Machine Learning, with emphasis on classification methods, through the use of the R programming platform. This course has several examples of datasets similar to those in the real world. Along with the hands-on learning of various methods for classification and regression, this course offers an understanding of validation and a selection strategy for the various techniques presented.

Fundamentals of Machine Learning

An advanced course offering a detailed view of a wide selection of Machine Learning methods and paradigms. It also covers Machine Learning theory and several applications of the field. The hands-on part of the course is in the R programming language.

The bottom line is that through these courses you will gain a solid understanding of Machine Learning and lay the foundations of the mindset of a practitioner. You will be able to comprehend a new method’s function and assess its performance, making further learning in the field easier and more interesting. Communication with stakeholders of all sorts will be easier for you and you will have a better appreciation of the signals that lurk in the data available. Most importantly, you will be able to do something useful with it and participate in the big data revolution that is taking place these days. Don’t miss the great opportunity to learn practical Machine Learning to solve real business and social problems.

read more



Data Science is a way of thinking to challenge the yesterday’s way of everything.


By Ali Syed and Dr. Zacharias Voulgaris

Science is one of humanity’s many attempts to make sense of this world and understand the principles behind its various phenomena. It is the methodical way of processing data (collected using natural sensors and instrumentation), extrapolating the rules we derive from them, validating our conclusions, and drawing a path towards new understanding and its application, broadly defined as knowledge. Knowledge is the psychological result of perception and learning and reasoning. As knowledge is quite diverse, humankind decided to classify it into disciplines and domains. Hence we have astronomy, geology, physics, chemistry, etc.[/p]

The amount and complexity of data produced in science, business, and everyday human activity is increasing at staggering rates. The world of today is fast changing leaving behind the ethos of knowledge as we know it. The ever changing data and empiricism on steroids gives knowledge a new meaning which is based on our creativity and deductive reasoning using data.

“What is the object of knowledge?” asks young Grasshopper. “There is no object of knowledge,” replies the old Shaman, “To know is to be able to operate adequately in an individual or cooperative situation.” “So which is more important, to know or to do?” asks young Grasshopper. “All doing is knowing, and all knowing is doing,” replies the Sage, and then continues, “Knowing is an effective action, that is, knowledge operates effectively in the domain of existence of all living creatures.” (Paraphrased from Maturana & Varela, 1992)

Quantification (the act of discovering or expressing the quantity of something) and being empirical has always been the core part of a scientific method. What has changed in the information age as part of human interactions and perception, is that quantification has become more about infusion of technology and data, some might call it neo empiricism.

Realizing that the world around us has moved on, so should our methods and our practices, and our concepts need to evolve with it.  That is the core of data science as we know it. Data Science starts with creative imagination. Our ability to imagine and think beyond the obvious is one of our extraordinary powers as humans. It is why we are different from other beings on the planet. We are creatures of action, we build and create things. We don’t live in the world directly, we live our lives through our imagination, which is full of ideas, concepts, theories and ideologies, grounded through systematic actions.

Everyone wants to be active in the field of data science and most of the people involved in it actually are. However, this is just the first step. If you really want to do data science it is important to be proactive, or to at least take steps towards that. This attitude is what is often referred to in Greek mythology as the “Promethean stance”, since Prometheus was a symbol of forethought and proactive behavior. When the other gods saw humanity as a lost cause, a species slightly more evolved than animals, Prometheus saw the potential in us and decided to sacrifice himself by stealing fire from the gods and giving it to us. Of course this is just a metaphor to show that humanity has a different kind of thinking (fire) which is closer to that of what we refer to as divine and is a godsend. For some people, however, it is a curse, since they don’t know how to use it properly.

Data Science gives us the broadness and complexity to get closer to the verification of reality. One can learn some tools and algorithms to deal with this data deluge and get hired based on a skill-set reflected on a resume, but you will only be able to add real and meaningful value, based on your ability to improvise, adapt, and create.  If we are prepared to give it the needed time and effort, we all can learn and apply data science, accepting the fact that we will make many mistakes along the way, and learn from our errors. Only cynics, brogrammers, pundits and wannabe tech experts benefit from making it the most complex discipline of our time. If you are not prepared to be wrong, you will never come up with anything original and useful. Data science empowers us to be more wrong rather than more accurate, in order to gradually build the means to create useful and meaningful data products.

We believe there is now a potential for data science to move to the next level, if we don’t cling to the status quo of viewing data science as the gatekeeper of yesterday’s scientific methods. This is a belief that is embraced by a number of data scientists, particularly those who shifted to the industry from academia, where challenging the scientific methods of the past is very rare and extremely difficult.

As the applications of data science are quite intriguing and with apparent financial benefits, it has grown to be a popular field and has attracted talented individuals in both academia and the industry, though it is mainly in the latter where data scientists generally build their careers. This accounts for its explosive growth in the past few years and the promise that the right application of such methods will become something really big in the years to come, so big that the demand for data scientists will not be met by the supply. What it boils down to is the generation of a new breed of data scientists that are too skills-centric and have not dwelled much on the mind-set. Soft skills are also imperative for a data scientist, perhaps even more important than the hard ones. Before we end up having an army of tech-savvy data scientists that are nothing more than clever calculators, it is time to take action and ensure that the next generations of data scientists have the mental discipline and maturity to carry the flame of Science to the future. We need to ensure that data science remains a science and does not degrade into some gibberish practice which is more a set of techniques than a discipline.

“The best minds of my generation are thinking about how to make people click ads… That sucks.” Jeff Hammerbacher


Work on stuff that matters. Data Activism for Social Change 

It’s all in the name, it is about attitude and learning and applying data science for a purpose. People talk about data science concepts and all sorts of wonderful things we can do but they never talk about making it meaningful and valuable beyond the interests of a particular company. It sounds a bit naive, and also very self-centred, where we are living a meaningless life. Where we are happy to spend more on the things that don’t really matter but not willing to allocate time to more important and real issues of life.

A change towards a more humanitarian world. 

The social change we collectively desire is not possible without addressing the issues of injustice, poverty, hunger, education for everyone, freedom to live a free life, basically reduce the suffering of the people, because most of the population of the world is suffering. As Martin Luther King eloquently expressed “injustice somewhere is a threat to justice everywhere.” Only a small portion of the total population gets to exploit and use most of the resources of the world, bringing about an imbalance that sows the seeds of all kinds of social and economic problems.

That is why we are promoting data science for humanity, and encourage data activism, the movement to focus the development and application of data science on more meaningful and real human challenges. The path that we have taken and are promoting is not going to be easy until it picks up momentum. Real change is never easy, the people involved have to struggle. But luckily for every one of us, the world is changing so radically that there is already significant change in the way we think, and a growing realization that we can move towards a better world only through application of meaningful data and more sustainable business models.  And that’s where Persontyle fits in. All meaningful changes take time to come to fruition, they don’t happen overnight, it’s a continuous struggle.

We are determined to focus on issues which matter and hope that more and more people will endorse and support in doing what matters. Gradually, there will be a critical mass of people who would be talking about the data activism and application of data science for humanity. All like minded and determined people will join. It’s beyond ourselves, it’s like establishing virtually a new discipline, designed for our time and age. For the gate keepers of the past, it is not going to be easy, and we might get a lot of push back and resistance.  We are determined to overcome against all odds.

Data activism is our humble attempt to give data science a human face. It’s enriched with meaning and purpose, encompassing an educational and revolutionary aspect. To put it simply, in our view, it is data science that is poised towards change, an organic part of a transition to a better society. The latter is not some flaky idea that philosophers like to talk about but a pragmatic and measurable approach that will result in improvement in our society, much like Jacque Fresco’s vision of a better world through the right use of technology. However, data science is not purely technical at its core, even though it makes use of technology for its purposes. Data science is a way of thinking and acting, employing state-of-the-art technology, in order to turn data into actionable and applied understanding that will be useful to the end-user i.e. society. Data activism attempts to ensure that the end-user is not just some corporate or for-profit organization, but its application extends to other places, such as charities and non-profit organizations.

All this is nice and idealistic, but how is it relevant to today’s value-driven society? Well, today’s world is all about creating value for people regardless of whether it is money, reputation, prosperity or other forms of benefit. The altruistic aim of data science may not yield improvement in the bottom line, but it may provide value to everyone involved in it (the organization that provides the data, the people that process it, and the people that make use of the data products as a result). All this will eventually manifest as shared prosperity for everyone.

The possibilities of this movement of sorts are limited only by our imagination. It is important to note that this does not attempt to deprive the industry of its valuable data scientists. It will just make it easier for new individuals in the field to acquire useful hands-on knowledge and experience through their involvement in volunteer work that will be useful to society through non-profits. In this scenario that we envision everyone wins. Isn’t that what Science is all about?

“We are living in the dawn of the big data era, a time in which the vast digitization of our world has created incalculable amounts of information that is now being used to drive our every decision, from what movie we decide to watch this weekend to how we navigate the globe next year. Though data can be immensely transformative, much of the efforts in data science are still focused on first-world gains, such as optimizing ad networks or recommending restaurants. As designers, developers, and scientists, it is not only incumbent upon us to understand how to analyze, understand, and tell stories with data, but also to think about its use in meaningful and socially conscious ways.” Jake Porway


A key challenge for you as a leader is to develop the imagination of your teams. Help them focus on both the hard and the soft stuff. The “hard stuff” is not just some tech skills, and the “soft stuff” is not just “keep calm and carry on innovating” or some such feel good mantra. Instead of issuing detailed instructions, allow your people the freedom to dream, challenge and create. We need institutions to promote  a culture where team members can unleash their creativity and explore the possibilities, which will not only drive better results, it will create teams that are motivated, happy, and are actually having fun instead of being clock-punching drones.


“Innovation” defined as moving the pieces around and adding more processing power is not some Big Idea that will disrupt a broken status quo: that precisely is the broken status quo. If we really want transformation, we have to slog through the hard stuff (history, economics, philosophy, art, ambiguities, contradictions). Bracketing it off to the side to focus just on technology, or just on innovation, actually prevents transformation.” Benjamin Bratton


The next time you are tempted to apply data science, try a different approach. Provide the building blocks and then let your team create their own masterpiece. You may just discover that game-changing innovation is already present. It may just be locked inside a team that needs to be instructed not to just follow the instructions. Always remember that business people don’t need to just ‘understand data scientists better’. Business people need to be data scientists, or at least integrate the data science philosophy in their thinking. If data science is not for thinking, challenging, questioning, and rebelling then it’s just a useless technology to protect mediocrity and status quo.

Credits: Illustration by Decourseyfx. read more



We’re all MAD here for Data Science

“No great mind has ever existed without a touch of madness.” ― Aristotle

The combination of an increasingly complex world, the vast proliferation of data, and the pressing need to stay one step ahead of the competition has sharpened focus on using data science within organizations. Everyone is trying to find the simple, effective and reliable methods and tools to analyze data.

In a field that’s brimming with a large variety of techniques and tools, it is easy to get lost and adopt the most popular approach to data science: learning a handful of stats and machine learning methods, along with Hadoop, and hoping that what you get from them satisfies your boss. That may be acceptable too for the time being but anything less than mad, passionate, extraordinary is just run of the mill. There are too many mediocre and boring things in life to deal with and data science shouldn’t be one of them.

“Imperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.” — Marilyn Monroe

Data scientists are like hackers of sorts, albeit ethical ones, because they are continually looking for ways to innovate, challenge and explore optimal options to overcome the limitations of their programs, methods, practices and norms. They employ extreme creativity in interesting ways to attain the results. 

That’s where MADlib enters the scene. MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data. In the context of data science, MADlib is actually an open source project for Magnetic, Agile, Deep (MAD) data analysis, an orthogonal approach to traditional Enterprise Data Warehouses. The primary goal of the MADlib is to accelerate innovation in the Data Science community via a shared library of scalable in-database analytics.

MAD pic

‘Magnetic’ refers to a system that attracts data and people to it. Traditional data warehousing tend to “repel” new data sources until the data is clean and integrated. MADlib algorithms embrace the ubiquitous nature of data by helping data scientists make inferences about data even if the data source has not undergone rigorous validation.

‘Agile’ is about enabling Data Scientists to quickly and easily experiment with the data, derive insight, refine the model, and iterate again. This requires fast ad hoc analysis and sandboxing to be productive which is made possible with the massive amounts of parallelism and scalability provided by many of the leading MPP data bases.

‘Deep’ comes from the capability to analyze an entire data set at scale rather than being forced to take a sample of your entire data set that is small enough to fit in memory on a single machine.

MADlib involves the application of intuition and analytics, working in tandem, for the efficient extraction of (high quality) information from the data at hand. Note that it’s not just doing clever things with the data, but also doing so with an objective (usually the business case you need to prove) and with efficiency. The former means that you make your manager happy by providing something that an executive can understand and use. The latter means doing so without having to write a lot of code in R, Python, or whatever tool you are using for the particular application.


MAD skills are defined as Magnetic, Agile and Deep analytics, as well as governance methods for big data applications. Obviously they go beyond the overly complicated techniques that most data scientists use and are often found in data science books. That’s not to say that complex techniques don’t have merit. On the contrary, they are very useful.

An example of MAD skills is the development of SQL code. The MAD oriented data scientist will first of all create a sandbox to carry out various experiments before working with the whole dataset. Creating the sandbox is not as simple as it seems because it involves the intelligent selection of representative samples from the original database. The code developed when working on this sandbox has to be in the form of SQL subroutines, structured in such a way that it is easily reusable. Apart from the simple querying that usually takes place in an SQL setting, the MAD-centric analyst will create methods (like t-tests, likelihood ratios, etc.) in this setting (not in Python, R, etc.) so that he/she can do some analytics directly in the database environment. 

MAD skills are not only a novel approach to data science, but also something sustainable and useful for honing the data scientist mindset. This means transmuting this knowledge and experience into lasting know-how, inside you. When people who are recruiting data scientists require X years of experience, in essence they are going for the know-how that these years usually bring about. That doesn’t mean that you need to have X years of data science experience in your resume though. MAD skills can bring about this valuable know-how sooner, given that you are conscious of your professional development and have a learner’s attitude.

If you don’t have MAD skills in your work, fret not. Things are not black and white but have several shades of grey in-between (some people believe they are 50 or so, but we have no data supporting this hypothesis!). MAD skills exist in many data scientists today in various degrees. You may just employ the DBMS skill set because you feel more comfortable with databases. If that’s the case, just develop the MAD skills you are missing specific to your task. You don’t have to have all of them to see improvement in your efficiency and quality of work.

If you are serious about MAD skills and wish to employ them in your data science practice, you should look into the MADlib library, as well as some all-in-one data science suites, such as the one from Alpine Data Labs (Alpine).

Alpine’s software differentiation is routed in four key components:

  • No Data Movement: Alpine software sends instructions to customers’ existing databases or Hadoop clusters. No new hardware is required. No Data Extract needs to be moved.

  • No Script: Alpine is geared toward business analysts. Its interface is visual and all functions can be run via simple drag-and-drop. While the functions sent to the database can be sophisticated, users work with an interface that focuses them on the math, not the code. No knowledge of the underlying database language is required and data can be mashed up and transformed without having to write a single line of code.

  • All Data, All the Time:  Alpine software was built for the “post-Hadoop and post-Internet” era. Alpine can easily run instructions on a company’s entire dataset, giving teams a better and more complete view of reality.

  • No Download. Ever: With Alpine, results are seen in hours and days, not weeks or months. Business users don’t need to go through extraneous set up processes or lengthy update cycles. Models, predictions and analytics can be built, deployed and used directly from the web or from any mobile device such as Apple iPad or Google Android tablets.

Source :

We’d recommend you try Alpine, as well as any other ones you find that may be applicable to your needs. Nowadays there are several alternatives to Hadoop and if you are more interested in the science part of data science, you can focus on that and get MAD rather than frustrated.

read more



Principles of man-machine framework

According to Dan Woods systems (transactional or analytical) will fail if not designed with the view of people and machines working in harmony.  To achieve the practical balance a man-machine framework must be adopted, where man is in charge and the algorithm is an extension. In his conversation with Arnab Gupta, CEO of Opera Solutions, he explores principles of application design using machine learning. Continue to read this interesting piece to learn more about man-machine harmony  here.

The principles behind man-machine framework are as follows:

  • The machine is a prosthetic of the human mind.
  • The computer interface supports the human thought process, not the other way around.
  • The man-machine interface’s purpose is to facilitate frontline productivity for humans in business.
  • The best processes separate tasks into those appropriate for machines and those appropriate for humans.
In the comments section below suggest principles you think should be included in the man-machine framework . read more