6 Worst Mistakes for Data Scientists, and How to Avoid Them. (Explained with Quotable Quotes)

Data ComicFrom Roopam Upadhyay

Over the years in my career in data science and predictive analytics I have noticed some awful practices that young and sometimes seasoned analysts follow. These bad practices, I believe, throw careers of these data scientists on a collision course similar to the Titanic. I will present the six worst mistakes that I feel are at the root of all these bad practices. Additionally, I will try to suggest strategies to avoid these mistakes using some memorable quotes. To begin with, let me present the purpose of being a data scientist, which in my opinion is similar to being a detective. The following quote by Sherlock Holmes sums up the purpose of being a data scientist:

My name is Sherlock Holmes. It is my business to know what other people don’t know.

― Sherlock Holmes

Now coming back to the six worst mistakes for data scientists, the following is my list for the same:

  1. Focus on tools rather than business problems
  2. Planning communication last
  3. Data analysis without a question / plan
  4. Don’t read enough
  5. Fail to simplify
  6. Don’t sell well

1) Focus on Tools rather than Business Problems

The expectations of life depend upon diligence; the mechanic that would perfect his work must first sharpen his tools.

― Confucius

In addition to programming languages such as SAS, R, Python etc. tools for data scientists include statistical and machine learning methods and algorithms . I am certainly not trying to undermine the importance of these tools when I am asking data scientists to shift their focus away from them. Mastering tools, as Confucius suggested, is at the core of being a good craftsman. However to make my point, imagine going to a doctor who is much more confident with her skills with stethoscope than diagnosing patients. Some data scientists also focus too much on tools rather than problems these tools are meant to solve. In my opinion, a good practice for data scientists is to always question the purpose of using the tool and how it will help solve the problem in hand.

It is the old experience that a rude instrument in the hand of a master craftsman will achieve more than the finest tool wielded by the uninspired journeyman.

— Karl Pearson

2) Planning Communication Last

The most important things are the hardest to say, because words diminish them. 

― Stephen King

Trust me in your career as a data scientist you will communicate some really important things: communications that will challenge status-quos and change the way organizations do their business. Hence, you can’t leave the task of planning communication towards the end of the analysis. On the contrary, I believe, planning communication along with your investigation / analysis actually enhances the quality of your analysis. A good communication flows like a tightly knit and gripping story. When you plan your communication along with the analysis, your analysis also flows like a story. In my opinion, a good practice for data scientists is to take time away from their analysis on a daily basis and structure their results and thoughts in the form of a story.

Think like a wise man but communicate in the language of the people. 

― William Butler Yeats

 3) Data Analysis without a Question / Plan

If you don’t know what you want, you end up with a lot you don’t.

― Chuck Palahniuk

Easy availability of data often makes data scientists jump directly towards data without well defined questions. This is suicidal for any data science project.  Data science is a structured process that starts with well defined questions and objectives. Then comes the part of setting a few hypotheses to satisfy the grand objective.

It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.

― Sherlock Holmes

Let me create the distinction between theorising and hypothesising. Hypotheses are testable where facts support or dispel them. As a data scientist our job is to be dispassionate about our hypotheses. The idea is to be truth seekers rather than doing self serving analysis. Additionally, during your analysis you will come up with several clues that were not part of the hypotheses. You build your story on top of these clues like a true detective. However, having clearly defined questions before the analysis  is the most import aspect for data scientists.

Judge a man by his questions rather than by his answers.

― Voltaire

4) Don’t Read Enough

A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.

― George R.R. Martin, A Dance with Dragons

I have found reading extremely helpful throughout my career in data science. The most powerful aspect about reading is the way it helps us generate ideas and also communicate those ideas. Data scientists across the globe are doing some really cool work and reading is our gateway to access that work. In addition to books there are so many other resources for data scientist to gain knowledge including academic articles, research papers, white papers, blogs, LinkedIn articles etc. Reading is a highly disciplined activity and it is easy to slip out of it when there is excessive work load. However, I believe, daily reading should be a part of job description for every data scientist. I recommend that for a successful career in data science you spend at least an hour out of you daily job to read.

It is what you read when you don’t have to that determines what you will be when you can’t help it.

― Oscar Wilde

5) Fail to Simplify

Everything should be made as simple as possible, but not simpler 

― Albert Einstein

At the core of any data science activity, which is often surrounded with complicated mathematics, hacking, and analysis, lies a simple idea. Simplification is getting at the core of that idea. It is often believed that you must simplify things for others i.e. your business users and audience. On the contrary, I believe, simplification is an activity you must do for yourself. It helps you develop a deeper relationship with your work.

Simplicity is the ultimate sophistication.

― Leonardo da Vinci

6) Don’t Sell Well

The story of the human race is the story of men and women selling themselves short.

― Abraham Maslow

Many data scientists believe  that selling is not a part of their job, and trust me they can’t be more wrong. Whether you are working with internal or external customers selling is an integral part of your job.  To explain my point even the greatest scientist had to sell their science: Einstein sold Relativity, Darwin had to sell Evolution, and Newton sold Gravity. These greatest creations of human mind would have stayed in oblivious had it been not for the great salesmanship for their creators.

I am an artisan. I only became an artist when people watch what I do. That is when it becomes art.

― Rhys Ifans

The most important aspect for data scientist is to ensure that their work gets integrated with business processes. Trust me this requires some hard selling. If you believe your solution has value you need to sell it well to show it’s promise.

Salesmanship is limitless. Our very living is selling. We are all salespeople.

―  James Cash Penney

Sign-off Note

These are some of the important lessons I have learnt in my career in data science. I must say I didn’t know them at the beginning and I hope they will help you with your career.

Come, Watson, come!’ [Sherlock Holmes] cried. ‘The game is afoot. Not a word! Into your clothes and come!

―  Sherlock HolmesData Comic

http://ucanalytics.com/blogs/6-worst-mistakes-for-data-scientists-and-how-to-avoid-them-explained-with-quotable-quotes/

Can RDBMS do Analytics?

Can RDBMS do Analytics

RDBMS are relational database management systems like SQL Server, MySQL, Oracle flavors for enterprise data management. Every website that dynamically stores and retrieves data requires an RDBMS. All product line businesses in fact require a much larger space for data storage and much more processing power for retrieval than a traditional data-enabled web application. Think about Teradata, IBM and MSDN library. Their consultancy requires ever-increasing processing power and storage capacity for Relational Databases requirements. But the big guys are thinking more.

We are in the age of analytics. All the big data we have sprawling over the Internet cannot be crunched by the traditional methods. There come Data Analytics. We need clients to get in depth knowledge from the vast rows and columns of multidimensional stores of product sales. Who needs size and color information updates. We are talking about business sense of marginal and rational thinking. What style shoes are in for summer, what kinds of silk are low in cost for upcoming events, what strategies will cut costs in latest production demands, where in the segmented markets do we find demand for hard disks? These are the questions market leaders are getting answers daily from their Data Analytics Software. Yes, these extensions of software toolkits are now increasingly becoming available in all major RDBMS.

Traditional RDBMS were designed to be operational, reading and updating only the current transactions. But the market forces have created niche for analytical databases. These addenda have large historical data base, processing random heuristic queries of business analysts: ask-answer-ask-again patterns of query processing. All the updates of the historical database are kept in the timeline. But this does not means we are saving junks. Only the relevant data from all seasons is filtered, extracted and loaded in a powerful decision support system. There you have the functionality to ask and answer all kinds of who-what-why forms.

The typicals of data analytics are covered in a data warehouse layer over the top of a collection of integrated databases. Here it is not a requirement to have one piece of software dealing with all data storage and processing capacity. A number of standard RDBMS extract mechanisms support the data warehouse vendor for OLAP queries. OLAP, or online analytical processing queries include business-driven inquiries like

“What are the top brands of hard disks in the market?”
“Where the hard disk failures have client dissatisfaction levels to the highest?”

“Why the hard disk fails especially in the particular shop arcades?”

“Are other shopping areas particularly relevant in this scenario?”

“Where hard disk recycle program will reduce costs remarkably?”

“Where the last recycle program did not work?”

Aster Getting Started in Predictive Analytics

1. Introducing Teradata Aster Discovery Platform Getting Started Ahsan Nabi Khan September 25th, 2015

2 When You Need Aster Discovery Platform? 2. DIG DEEP AND FAST: Ad-hoc, interactive exploration of all data within seconds/minutes 1. SCALABLE ANALYTICS: Vast array of analytic algorithms run on commodity hardware as an Integrated Analytics Engine

3 Advanced Analytic Applications: Use Cases •Credit, Risk and Fraud •Packaging and Advertising •Buying Patterns •Cyber Defense •Fraud and Crime •Citizen’s Feedback •Call Data Records •Service Personalization •Friends Graphs •Click Stream •Opinion, Sentiment, Stars Social Media Telecom Commerce Analysis Federal Analysis

4 Discovery Process Model

Aster Getting Started