"In God we trust; all others bring data."
-- W. Edwards Deming, American statistician
The market for Big Data software is expected to grow sixfold between 2015 and 2019 [1]. Large enterprises that once leaned almost exclusively on the business judgment of their managers are now supplementing or supplanting it with data analysis. I will lay out the process of going from raw data to decisions and lay out challenges and opportunities for enterprises as they move towards a culture of data-driven decisions.
What is the Data Analysis Lifecycle?
The Data Analysis Lifecycle (DALC) is the process of extracting insight from data and translating it into business decisions. In rough order of proximity to raw data, the most common functional roles in the DALC are:
-
The data engineer ensures that raw data in all relevant forms is collected in an enterprise-wide repository as a strategic asset. Data engineers maintain a robust data lake and put data analyses into production.
-
The data scientist executes the core of the data analysis and produces the first glimmers of insight from raw data. Data scientists iteratively create and test hypotheses by building mathematical models and visualizations.
-
The business analyst is a domain expert who sits between the business sponsor and the data scientist, and
-
The business sponsor is a key stakeholder in decisions flowing from the data analysis. They set objectives for the data analysis, define parameters such as timeline and budget, and consume dashboards and reports.
Why Data Analysis Is Like A Basketball Game
In a 1984 paper on organizational behavior [2], Robert Keidel outlined three types of task interdependence, i.e. the patterns of interaction and task dependence between participants in a business process, and drew comparisons to American professional sports:
-
The loosest form of interdependence is pooled interdependence, where each player contributes individually but doesn't depend directly on others to accomplish a larger task. This pattern is seen in baseball, where team member contributions are relatively independent, and in sales departments, where individual salespeople pool their respective outputs into the department's output.
-
Sequential interdependence occurs when one stage in a process produces an output used by the next stage in the process. This pattern is slightly more complex to coordinate than pooled interdependence. It’s seen in football, where the ball moves towards the end zone in a series of downs, and in assembly lines, where a product is put together sequentially across a factory floor.
-
The most complex form of task interdependence is reciprocal interdependence, where one department’s output feeds into another department, but with the possibility that the task moves back and forth in cycles. One sees this pattern in a game of basketball, where the frenetic progress of the ball towards the hoop looks very different than the relatively orderly progress of a football to the end zone.
The DALC exhibits quintessential reciprocal interdependence. A business analyst may receive a completed analysis from a data scientist only to find that the business sponsor has changed its parameters in the meantime. Or, a data engineer may push back on a requested analysis because data to support the hypothesis is unavailable. Perhaps most unsettlingly, an analysis may turn out to be fruitless even after the functional roles pass it back and forth for several weeks.
Comparisons to Software Development
The Software Development Lifecycle (SDLC) also employs technology-driven solutions to complex problems with diverse stakeholders. Early versions of the SDLC followed a rigid waterfall model with sequential interdependence between stages: requirements discovery, development, testing and deployment. New releases took months, and if requirements changed, it was hard to course-correct. The SDLC eventually shifted to agile development where core concepts such as user stories, unit tests and release branches were expressed explicitly in tools rather than living as tacit knowledge within its participants. Improved communication and collaboration using developer tools helped shrink release cycles to weeks and helped instrument and improve the SDLC. Now, quality developer tools along with elastically scalable infrastructure have brought us to continuous integration and continuous deployment (aka "DevOps"), which is beginning to look like reciprocal interdependence.
Unlike software development, data analysis is inherently built around reciprocal interdependence. Due to varying industry practices, the DALC is also less standardized than the SDLC is. Finally, data analysis is inherently more experimental than software development; whereas a development sprint reliably yields a software release, data analysis may not necessarily yield a decision. Despite these differences, practitioners of the DALC can learn from the SDLC. The DALC today is largely powered by tacit knowledge that resides within its practitioners. Although tools like Jupyter notebooks [3] help data scientists collaborate, there aren't many tools yet that span the entire DALC and capture handoffs across functional roles. Without quality tools, data analyses largely proceed at the speed of the old waterfall models of the SDLC. I’m starting to see startups take inspiration from SDLC tooling to build high-quality tools for the DALC and realize its full potential.
Challenges within the Data Analysis Lifecycle
Here are some common DALC-related challenges for enterprises:
-
The multi-language challenge. Data scientists come from different backgrounds, each with its own "native programming language" for data analysis. Modern analysis languages include Python and R, along with significant continuing usage of SAS and Matlab. Although it is common (and valuable) to have diverse programming language backgrounds on a team, collaboration tools for people from different backgrounds are still early, leading to communication issues.
-
Deploying analyses into production. The analysis environments above were intended for an individual data scientist's workstation and are significantly slower than the Java and C++ systems common in production IT environments. After an analysis in R or Python is completed, data engineers frequently rewrite the code in Java or C++ to ‘productionize’ it, thereby adding time, technical complexity and the possibility of error.
-
The half-life of data. The process complexity of the DALC can be a significant issue in a world of rapidly changing business conditions. Analyses conducted over several weeks may become instantly obsolete upon completion because of newer data. Recognizing the short half-life of data will help enterprises make profitable, timely business decisions.
Benefits of a Well-Run Data Analysis Lifecycle
As the DALC comes into its own as a well-understood business process, I’m excited about the following emerging trends:
-
Democratization. A well-governed DALC combined with high-quality tools opens up analysis beyond data scientists to other stakeholders. The more people that have access to a clean data and good tools, the more effective a business can be with data-driven decisions.
-
Automation of Insight. Questions that can be asked of data are growing faster than those with skills to answer those questions. The solution is to augment the process of data analysis with automation including machine learning, which can shorten the time to insight by orders of magnitude.
-
Faster Time to Innovation. Enterprises are looking at new kinds of data, or data they may already have, to extract insights faster and innovate faster. Using new kinds of data and data exhaust are all possibilities that can help speed innovation.
Large enterprises are exiting the experimental phase of Big Data and are deploying those technologies in production – this phase opens opportunities for several innovative startups. I’m eagerly watching this massive opportunity unfold and can't wait to work alongside talented entrepreneurs who are bringing it about!
References
[1] Ovum Research, “Ovum forecasts big data software to grow by 50%“, Jul 2015, https://www.ovum.com/press_releases/ovum-forecasts-big-data-software-to-grow-by-50/
[2] R. W. Keidel; “Baseball, football, and basketball: Models for business”, Organizational Dynamics, Volume 12, Issue 3, Winter 1984, Pages 5-18
[3] Project Jupyter, http://jupyter.org/