What Is Data Science

The picture below gives an idea how Data Science relates to those fields:

Data Science is the practical application of all those fields (AI, ML, DL) in a business context.  “Business” here is a flexible term since it could also cover a case where you work on scientific research.  In this case your “business” is science.  Which actually is truer than you want to think about.

But whatever the context of your application is, the goals are always the same:

  • extracting insights from data,
  • predicting developments,
  • deriving the best actions for an optimal outcome,
  • or sometimes even perform those actions in an automated fashion.

As you can also see in the diagram above, Data Science covers more than the application of only those techniques.  It also covers related fields like traditional statistics and the visualization of data or results.   Finally, Data Science also includes the necessary data preparation to get the analysis done.  In fact, this is where you will spend most of your time on as a data scientist.

A more traditional definition describes a data scientist as somebody with programming skills, statistical knowledge, and business understanding. And while this indeed is a skill mix which allows you to do the job of a data scientist, this definition falls a bit short.  Others realized this as well which led to a battle of Venn diagrams.

The problem is that people can be good data scientists even if they do not write a single line of code. And other data scientists can create great predictive models with the help of the right tools.  But without a deeper understanding of statistics.  So the “unicorn” data scientist (who can master all the skills at the same time) is not only overpaid and hard to find.  It might also be unnecessary.

For this reason, I like the definition above more which focuses on the “what” and less on the “how”.  Data scientists are people who apply all those analytical techniques and the necessary data preparation in the context of a business application.  The tools do not matter to me as long as the results are correct and reliable.


What is Artificial Intelligence, Machine Learning, and Deep Learning?

This post should help to understand the differences and relationships of those fields. Let’s get started with the following picture. It explains the three terms artificial intelligence, machine learning, and deep learning:

Artificial Intelligence is covering anything which enables computers to behave like a human.  Think of the famous – although a bit outdated – Turing test to determine if this is the case or not.  If you talk to Siri on your phone and get an answer, this is close already.  Automatic trading systems using machine learning to be more adaptive would also already fall into this category.

Machine Learning is the subset of Artificial Intelligence which deals with the extraction of patterns from data sets. This means that the machine can find rules for optimal behavior but also can adapt to changes in the world. Many of the involved algorithms are known since decades and sometimes even centuries. But thanks to the advances in computer science as well as parallel computing they can now scale up to massive data volumes.

Deep Learning is a specific class of Machine Learning algorithms which are using complex neural networks.  In a sense, it is a group of related techniques like the group of “decision trees” or “support vector machines”.  But thanks to the advances in parallel computing they got quite a bit of hype recently which is why I broke them out here. As you can see, deep learning is a subset of methods from machine learning.  When somebody explains that deep learning is “radically different from machine learning“, they are wrong.  But if you would like to get a BS-free view on deep learning, check out this webinar I did some time ago.

But if Machine Learning is only a subset of Artificial Intelligence, what else is part of this field?  Below is a summary of the most important research areas and methods for each of the three groups:

  • Artificial Intelligence: Machine Learning (duh!), planning, natural language understanding, language synthesis, computer vision, robotics, sensor analysis, optimization & simulation, among others.
  • Machine Learning: Deep Learning (another duh!), support vector machines, decision trees, Bayes learning, k-means clustering, association rule learning, regression, and many more.
  • Deep Learning: artificial neural networks, convolutional neural networks, recursive neural networks, long short-term memory, deep belief networks, and many more.

As you can see, there are dozens of techniques in each of those fields. And researchers generate new algorithms on a weekly basis.  Those algorithms might be complex.


BMW Group (BMW) is a German luxury vehicle, motorcycle, and engine manufacturing company founded in 1916. It is one of the best-selling luxury automakers in the world and is leveraging deep learning with HDP to save on manufacturing and costs.

Three weeks ago, at the DataWorks Summit in Munich, we announced the Data Hero winners for the EMEA region. The winner in the Data Architect category was Tobias Bürger, Lead Big Data Platform & Architecture at BMW Group. You can read the announcement here.

BMW manages structured, sensor, and server log data. From that data, BMW produces batch, interactive SQL, streaming, and AI/Deep Learning analysis. Hortonworks Data Platform (HDP®) is one of the enabling technologies for BMW.

The team at BMW, under Chief Architect Tobias, has implemented over 100 HDP uses cases, including the generation of autonomous driving insights from sensor data, cost savings in research and development, streamlining the manufacturing process, and improving after-sales customer care. Once HDP was brought into BMW, it spread quickly and far beyond the original central users.

Additional BMW HDP use cases have brought architectural improvements around its technology stack, bringing in analytical capabilities never before possible.



You have heard about Big Data for a long time, and how companies that use Big Data as part of their business decision making process experience significantly higher profitability than their competition.

Now that your company is ready to embark on its first Apache Hadoop® journey there are important lessons to be learned. Read on and learn how to avoid the pitfalls and missteps so many companies fall into.

Pitfall 1: We Are Going To Start Small

It is natural for companies in general, and IT organizations in particular, to start their Big Data journeys under conditions where they can manage the risk by determining the viability of the technology. However, we have learned that the more data you have, the higher the likelihood of finding new and exciting insights.

In case after case, the size of the initial cluster is a good predictor of the success of the first Hadoop project. In other words, businesses that start out with cluster sizes of ten nodes or less generally do not have sufficient data in their Hadoop environment to uncover significant insights.

Best Practice: Start out with a cluster of at least 30 nodes. Outline business objectives and then bring in as much data as your infrastructure can comfortably store and process to meet them.

Pitfall 2: Build It And They Shall Come

Another common mistake that companies make is to build their Hadoop cluster without having a clear objective that is connected to deriving real business value. It is true that a number of companies start out with the objective to reduce the operational cost of their existing data infrastructure by moving that data into Hadoop. However, the cost benefits of such projects are largely limited to IT organizations.

To make a positive impact on your company’s revenues, profitability or competitive leverage through Big Data then you must partner with business to come up with concrete use cases that will drive such results. These use cases must outline the key business metrics and identify the data sources and processing steps required to achieve the desired business results.

Best Practice: Start out with a use case built around achieving concrete business results. Even if building a prototype keep an eye on rolling it out to production. Succeed or fail quickly and communicate success to the broader organization.

Pitfall 3: We Need To Hire A Team Of People With Hadoop Background

Many companies at the start of their Hadoop journeys hire an architect to simply install and configure their Hadoop cluster. A Hadoop architect is an expensive resource whose expertise are better utilized down the road when security architecture, governance procedures and IT processes need to be operationalized.

Hadoop is a unique technology that cuts across infrastructure, applications, and business transformation. It is ideal to have a Hadoop-centric practice which is part of the broader analytics organization, however finding personnel with background in Hadoop infrastructure and its various components is a tall order. Hadoop requires a unique set of skills that few companies have in place at the onset of their journey.


With the San Jose DataWorks Summit (June 13-15) just two months away, we’re busy finalizing the lineup of an impressive array of speakers and business use cases. This year our Enterprise Adoption Track will include Nick Evans and Kevin Brown from ExxonMobil with Wade Salazar from Hortonworks.

Big Data is driving major advances in the oil and gas industry, resulting in increased productivity and cost savings throughout the extraction and production cycles. Advances in instrumentation, process automation, and collaboration are generating data from myriad new sources, including sensors, geolocation, weather, and seismic data. Combined with human-generated data, such as market feeds, social media, email, text, and images, a wealth of new analytical insights are transforming the industry as a whole.

Join Nick, Kevin, and Wade as they present:

The Evolution of Streaming and Data Lake Shared Services at ExxonMobil: Lessons from a Fortune 10 Adoption

Abstract: Analytics applications grow more powerful as they leverage new types of data from sensors, machines, server logs, clickstreams, and social media. The Hadoop-based Data Lake enables that analytic potential, but the shared service supporting it must scale efficiently and enable deep insight across a large, broad, diverse data set to a variety of consumers. Come learn how ExxonMobil created its first Big Data shared service across an enormous enterprise – from data ingestion at the edge using Hortonworks DataFlow to long-term storage in Hortonworks Data Platform, culminating in data exploration and analysis with business intelligence tools.

About the Speakers

Nick Evans, ExxonMobil
Nick Evans is the Big Data Service manager for Data & Analytics at ExxonMobil with a team of developers and engineers focused on embedding world class analytics to solve big data opportunities across the corporation.  Mr. Evans has 15 years of experience at ExxonMobil in a variety of roles in information technology.  He is very excited to work in Data & Analytics and Big Data due to its broad application and impact across all business lines.  Mr. Evans holds a BBA in Management Information Systems from Texas Tech University. 

Kevin Brown, ExxonMobil
Kevin Brown is the Big Data Service Platform Engineer for Data & Analytics at ExxonMobil with a team of data architects and engineers focused on embedding world class analytics to solve big data opportunities across the corporation.  Kevin’s previous experience in software development and Linux administration played a critical role in helping pioneer a Big Data platform at ExxonMobil.  Kevin holds an Information Technology degree from Brigham Young University.

Wade Salazar, Hortonworks
Wade Salazar serves Hortonworks as a Solutions Engineering in Houston TX.  Educated as an electrical engineer, fluent in many programming languages, and having worked in the control systems trade for over ten years before joining Hortonworks Wade enables those looking to apply big data tools to industrial processes and equipment.  Outside of work Wade is passionate about technology, the outdoors, cooking, dogs, horses and Texas lore.