Archive for December, 2013



We’re all MAD here for Data Science

“No great mind has ever existed without a touch of madness.” ― Aristotle

The combination of an increasingly complex world, the vast proliferation of data, and the pressing need to stay one step ahead of the competition has sharpened focus on using data science within organizations. Everyone is trying to find the simple, effective and reliable methods and tools to analyze data.

In a field that’s brimming with a large variety of techniques and tools, it is easy to get lost and adopt the most popular approach to data science: learning a handful of stats and machine learning methods, along with Hadoop, and hoping that what you get from them satisfies your boss. That may be acceptable too for the time being but anything less than mad, passionate, extraordinary is just run of the mill. There are too many mediocre and boring things in life to deal with and data science shouldn’t be one of them.

“Imperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.” — Marilyn Monroe

Data scientists are like hackers of sorts, albeit ethical ones, because they are continually looking for ways to innovate, challenge and explore optimal options to overcome the limitations of their programs, methods, practices and norms. They employ extreme creativity in interesting ways to attain the results. 

That’s where MADlib enters the scene. MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data. In the context of data science, MADlib is actually an open source project for Magnetic, Agile, Deep (MAD) data analysis, an orthogonal approach to traditional Enterprise Data Warehouses. The primary goal of the MADlib is to accelerate innovation in the Data Science community via a shared library of scalable in-database analytics.

MAD pic

‘Magnetic’ refers to a system that attracts data and people to it. Traditional data warehousing tend to “repel” new data sources until the data is clean and integrated. MADlib algorithms embrace the ubiquitous nature of data by helping data scientists make inferences about data even if the data source has not undergone rigorous validation.

‘Agile’ is about enabling Data Scientists to quickly and easily experiment with the data, derive insight, refine the model, and iterate again. This requires fast ad hoc analysis and sandboxing to be productive which is made possible with the massive amounts of parallelism and scalability provided by many of the leading MPP data bases.

‘Deep’ comes from the capability to analyze an entire data set at scale rather than being forced to take a sample of your entire data set that is small enough to fit in memory on a single machine.

MADlib involves the application of intuition and analytics, working in tandem, for the efficient extraction of (high quality) information from the data at hand. Note that it’s not just doing clever things with the data, but also doing so with an objective (usually the business case you need to prove) and with efficiency. The former means that you make your manager happy by providing something that an executive can understand and use. The latter means doing so without having to write a lot of code in R, Python, or whatever tool you are using for the particular application.


MAD skills are defined as Magnetic, Agile and Deep analytics, as well as governance methods for big data applications. Obviously they go beyond the overly complicated techniques that most data scientists use and are often found in data science books. That’s not to say that complex techniques don’t have merit. On the contrary, they are very useful.

An example of MAD skills is the development of SQL code. The MAD oriented data scientist will first of all create a sandbox to carry out various experiments before working with the whole dataset. Creating the sandbox is not as simple as it seems because it involves the intelligent selection of representative samples from the original database. The code developed when working on this sandbox has to be in the form of SQL subroutines, structured in such a way that it is easily reusable. Apart from the simple querying that usually takes place in an SQL setting, the MAD-centric analyst will create methods (like t-tests, likelihood ratios, etc.) in this setting (not in Python, R, etc.) so that he/she can do some analytics directly in the database environment. 

MAD skills are not only a novel approach to data science, but also something sustainable and useful for honing the data scientist mindset. This means transmuting this knowledge and experience into lasting know-how, inside you. When people who are recruiting data scientists require X years of experience, in essence they are going for the know-how that these years usually bring about. That doesn’t mean that you need to have X years of data science experience in your resume though. MAD skills can bring about this valuable know-how sooner, given that you are conscious of your professional development and have a learner’s attitude.

If you don’t have MAD skills in your work, fret not. Things are not black and white but have several shades of grey in-between (some people believe they are 50 or so, but we have no data supporting this hypothesis!). MAD skills exist in many data scientists today in various degrees. You may just employ the DBMS skill set because you feel more comfortable with databases. If that’s the case, just develop the MAD skills you are missing specific to your task. You don’t have to have all of them to see improvement in your efficiency and quality of work.

If you are serious about MAD skills and wish to employ them in your data science practice, you should look into the MADlib library, as well as some all-in-one data science suites, such as the one from Alpine Data Labs (Alpine).

Alpine’s software differentiation is routed in four key components:

  • No Data Movement: Alpine software sends instructions to customers’ existing databases or Hadoop clusters. No new hardware is required. No Data Extract needs to be moved.

  • No Script: Alpine is geared toward business analysts. Its interface is visual and all functions can be run via simple drag-and-drop. While the functions sent to the database can be sophisticated, users work with an interface that focuses them on the math, not the code. No knowledge of the underlying database language is required and data can be mashed up and transformed without having to write a single line of code.

  • All Data, All the Time:  Alpine software was built for the “post-Hadoop and post-Internet” era. Alpine can easily run instructions on a company’s entire dataset, giving teams a better and more complete view of reality.

  • No Download. Ever: With Alpine, results are seen in hours and days, not weeks or months. Business users don’t need to go through extraneous set up processes or lengthy update cycles. Models, predictions and analytics can be built, deployed and used directly from the web or from any mobile device such as Apple iPad or Google Android tablets.

Source :

We’d recommend you try Alpine, as well as any other ones you find that may be applicable to your needs. Nowadays there are several alternatives to Hadoop and if you are more interested in the science part of data science, you can focus on that and get MAD rather than frustrated.

read more



Our Final Invention: Why we need to be intelligent about Artificial Intelligence

Artificial Intelligence and the end of the human era.
Artificial Intelligence helps choose what books you buy, what movies you see, and even who you date. It’s in your smart phone, your car, and it has the run of your house. It makes most of the trades on Wall Street, and controls our transportation, energy, and water infrastructure. Artificial Intelligence is for the 21st century what electricity was for the 20th and steam power for the 19th.

But there’s one critical difference — electricity and steam will never outthink you.

The Hollywood cliché that artificial intelligence will take over the world could soon become scientific reality as AI matches then surpasses human intelligence. Each year AI’s cognitive speed and power doubles — ours does not. Corporations and government agencies are pouring billions into achieving AI’s Holy Grail — human-level intelligence. Scientists argue that AI that advanced will have survival drives much like our own. Can we share the planet with it and survive?

Our Final Invention explores how the pursuit of Artificial Intelligence challenges our existence with machines that won’t love us or hate us, but whose indifference could spell our doom. Until now, intelligence has been constrained by the physical limits of its human hosts. What will happen when the brakes come off the most powerful force in the universe?


James Barrat’s book “Our Final Invention” is every sci-fi fan’s dream: a non-fiction book that puts forward the viable possibility of a premise that has up to recently been considered science fiction. Namely, it talks about how the advances in the field of Artificial Intelligence (AI) will impact the world and how the indifference of the super-intelligent machine we will build “could spell our doom.”

AI is quite varied, ranging between what researchers refer to as Narrow and General AI (ANI and AGI respectively), it still presents a potential threat to our way of life. Regardless of how the AI algorithms are applied resulting to a super-intelligent specialist or a super-intelligent generalist, the fact remains the same: it is very likely that AI machines will get out of control. Super-intelligence is by definition an intelligence that is so advanced that our human intelligence is no match for it. Our brain, no matter how evolved, will be like the brains of an ape compared to the digital brains that we are developing.

Don’t get me wrong; I am all for the development of all this advanced tech and I have accepted the fact that some of our best scientists are working towards that. However, my concern is not so much the how but rather the why of such an advent. Specifically, I find that the applications that drive this research convey unreasonable risk that may bring about undesirable scenarios. If a super-intelligent computer is in charge of our defense systems, how can we be sure that its values will be aligned with ours and that it won’t decide to start a war through a preemptive strike to its country’s enemies? How do we know that an AI machine will exhibit the level of understanding that a human manager will aim towards when in charge of a process that involves the safety of millions? Can a supercomputer be held accountable for its actions and what forms of justice can be effective on it if it goes awry? Read this superb article by Gary Marcus, WHY WE SHOULD THINK ABOUT THE THREAT OF ARTIFICIAL INTELLIGENCE

These are questions that need to be addressed before critical operations are undertaken by AI machines, in order to avoid unnecessary risks. Because if something goes wrong, it won’t be fixable by a lawsuit or any of our conventional forms for justice. I don’t foresee a world war, but some disorder in society is bound to take place. So, does that mean that we should cease our efforts in AI? Far from it. I believe that AI can be used in a safe and effective way to aid data science for the whole species. Most companies are already capable of solving their own problems but no-one can tackle any of the greater problems that plague humanity. Some of us, particularly in Persontyle, harbor the belief that data science can be used for humanitarian objectives as well. Namely, the “Data Science for Humanity” project is all about that; it aims to bring charities and data scientists together to work out the solutions of some quite interesting problems, not just in terms of technical difficulties but also in terms of application. These problems may involve poverty, social justice, human rights, water, climate change, healthcare in third world countries, etc. If a super-intelligent data science software is employed for such a project the risk is minimal. This software can employ an ANI system that will focus on one thing only: finding intelligent ways to make use of the available data so that the objective function (e.g. the number of people in good health) is optimized.

Marvin Minsky

If this sounds a bit futuristic to you, keep in mind that AI is already being employed to some extent in data science. Many machine learning systems that are used by data scientists are in essence applications of AI algorithms. Even the quite promising deep learning field that aims to revolutionize data science is in fact an AI system, albeit a quite sophisticated one. Perhaps the future that sci-fi writers have envisioned is not that far as we might think. Super-intelligent machine are bound to arrive before society looks anything like that of “I, Robot” or “Gattaca”. The question that naturally arises then is “are we mature enough to make good use of this technology, or will it be the blueprint of a prison for us?” I’d like to believe that we can confidently answer “the former.”


Watch this 70 minute conversation with Barrat to understand how his initial optimism about AI turned into pessimism, why he sees artificial intelligence more like ballistic missiles rather than video games, and why true intelligence is inherently unpredictable “black box”.

read more