When you read the marketing spin on Big Data and the tools available today, you may deduce that there is much upside and not much downside to implementing a Big Data project. Nevertheless, you will quickly find that this is not the case. It’s not as simple as typing “apt-get install Hadoop” into a Linux command window, everything installs, you ask it complex questions, and it gives you sage advice. It is tough to get a Big Data project working.
This post is not trying to dissuade you from attempting a Big Data project. If done correctly, it can give you critical information about your customers and products that can separate you from the pack and make your company a leader in the marketplace.
However, you need to go into the project understanding what you are getting into and having the right resources assigned to give you the highest chance of success.
Here are four reasons why I think Big Data projects typically fails:
1. Underestimating how complex it is to get data, clean it, store it, and then analyze it.
One of the top reasons a big data project fails is because companies don’t know where all their data is.
Consequently, one of the first things you will need to do is understand where all of your data lives, develop connectors to extract that data from its source, clean the data, and put it in a form it can be analyzed. If you are using Hadoop for this, then once you have identified the data sources, you will need to build connectors to those sources and extract the data. You then have to make sure the data is relatively free of inconsistencies like multiple records for the same customer, wrong spelling, etc. Then, that data must be put into a standard format (JASON, XML, …) and stored in a high-performance file system such as hdfs. All of this requires some sophisticated programming in a language such as Java. You will also have to build out a large cluster of servers that can have the MapReduce nodes set up; and this infrastructure must be kept configured, maintained, and monitored.
2. You don’t know where the data is and what condition it is in.
As mentioned, Big Data is about “Big Data,” up to petabytes of data all stored in one place that can have analytics operations performed against it. But to gather and store this data, you need to know where it is in the first place, and it lurks in places you may not always think of. For example, almost every company has some CRM or point of sale systems that keep up with its customer data. It is usually a relational database and is used daily to transact business with the company’s customers. And, because you cannot know what data you need for analytics, you will need to extract all of this data and put it in the analytics “Data Lake.” But there is much more data available to be analyzed: web log files, spreadsheets, text messages with customers, feedback from Facebook, Instagram posts and comments, and much more. All of this data needs to go into the data lake in a standard form and extract API’s need to be built to get it.
3. You don’t know what questions to ask (the data).
For the big data project to have any value to a company, it needs to provide the team with actionable information. But, it’s still a dumb computer, so it must be explicitly asked what questions you want to be answered. Because of the amount of data available in a big data ecosystem, you can ask some pretty interesting questions. Something as broad as “what is the aged of most of my current customers,” to something as specific as “what zip code do most customers that buy blue caps live in.” You can also look for patterns of behavior and lots of other exciting things, and the computer can learn as more and more data is analyzed and results stored. Still, the company’s operations and marketing teams need to be able to articulate to the data analyst and programmers what questions they want to study.
4. It is expensive and takes time (java programmers, lots of hardware and complex administration, expensive analysts).
I hope you have received from the first three challenges that it’s not easy to get a big data project off the ground, much less keep it running. So, the last reason a project like this fails is the lack of a skilled team that is not given enough time and budget. If you have never looked at what a MapReduce study looks like, you may want to take a look at this word count example, and this example only counts words and returns results. To add to the complexity, you need to develop in Java (although other programming languages will work, Java is the de-facto standard environment), and you probably want to use an IDE such as Eclipse to manage things. All of this requires some pretty skilled developers that understand how to write MapReduce studies and can work with analysts to know what they need to program. These are not junior programmers. For the actual analysis, your data analytics team will need to understand how to develop the complex study algorithms and be able to articulate these to the developers to implement. These are senior analysts with a deep understanding of the data. And finally, the team needs to be led by a secure project manager that either is or has the sponsorship of a senior executive.
All of this costs both time and money. The good news is that once the initial project is finished, subsequent studies and changes to data sources can become somewhat routine, and the costs will decrease. But you still need to understand that just maintenance will not be without cost or effort.
For a big data project to be successful, you must have the right team in place before you start a project, you have defined expectations and have the money and patience to let the project play out. If done correctly, a big data project can pay off big time. It’s just quite a climb to reach the top.