25 October 2016

"In God we trust; all others bring data."
    -- W. Edwards Deming, American statistician
 

The market for Big Data software is expected to grow sixfold between 2015 and 2019 [1]. Large enterprises that once leaned almost exclusively on the business judgment of their managers are now supplementing or supplanting it with data analysis. I will lay out the process of going from raw data to decisions and lay out challenges and opportunities for enterprises as they move towards a culture of data-driven decisions.

 

What is the Data Analysis Lifecycle?
 

The Data Analysis Lifecycle (DALC) is the process of extracting insight from data and translating it into business decisions. In rough order of proximity to raw data, the most common functional roles in the DALC are:

Data analysis is inherently more experimental than software development.
Share this

 

Why Data Analysis Is Like A Basketball Game

In a 1984 paper on organizational behavior [2], Robert Keidel outlined three types of task interdependence, i.e. the patterns of interaction and task dependence between participants in a business process, and drew comparisons to American professional sports:

 
The DALC exhibits quintessential reciprocal interdependence. A business analyst may receive a completed analysis from a data scientist only to find that the business sponsor has changed its parameters in the meantime.  Or, a data engineer may push back on a requested analysis because data to support the hypothesis is unavailable. Perhaps most unsettlingly, an analysis may turn out to be fruitless even after the functional roles pass it back and forth for several weeks.
 
Comparisons to Software Development

The Software Development Lifecycle (SDLC) also employs technology-driven solutions to complex problems with diverse stakeholders. Early versions of the SDLC followed a rigid waterfall model with sequential interdependence between stages: requirements discovery, development, testing and deployment. New releases took months, and if requirements changed, it was hard to course-correct. The SDLC eventually shifted to agile development where core concepts such as user stories, unit tests and release branches were expressed explicitly in tools rather than living as tacit knowledge within its participants. Improved communication and collaboration using developer tools helped shrink release cycles to weeks and helped instrument and improve the SDLC. Now, quality developer tools along with elastically scalable infrastructure have brought us to continuous integration and continuous deployment (aka "DevOps"), which is beginning to look like reciprocal interdependence.
 
Unlike software development, data analysis is inherently built around reciprocal interdependence. Due to varying industry practices, the DALC is also less standardized than the SDLC is. Finally, data analysis is inherently more experimental than software development; whereas a development sprint reliably yields a software release, data analysis may not necessarily yield a decision. Despite these differences, practitioners of the DALC can learn from the SDLC. The DALC today is largely powered by tacit knowledge that resides within its practitioners. Although tools like Jupyter notebooks [3] help data scientists collaborate, there aren't many tools yet that span the entire DALC and capture handoffs across functional roles. Without quality tools, data analyses largely proceed at the speed of the old waterfall models of the SDLC. I’m starting to see startups take inspiration from SDLC tooling to build high-quality tools for the DALC and realize its full potential.

Questions that can be asked of data are growing faster than those with skills to answer those questions.
Share this

 

Challenges within the Data Analysis Lifecycle

Here are some common DALC-related challenges for enterprises:

 

Benefits of a Well-Run Data Analysis Lifecycle

As the DALC comes into its own as a well-understood business process, I’m excited about the following emerging trends:

Large enterprises are exiting the experimental phase of Big Data and are deploying those technologies in production – this phase opens opportunities for several innovative startups. I’m eagerly watching this massive opportunity unfold and can't wait to work alongside talented entrepreneurs who are bringing it about!


References

[1] Ovum Research, “Ovum forecasts big data software to grow by 50%“, Jul 2015, https://www.ovum.com/press_releases/ovum-forecasts-big-data-software-to-grow-by-50/
[2] R. W. Keidel; “Baseball, football, and basketball: Models for business”, Organizational Dynamics, Volume 12, Issue 3, Winter 1984, Pages 5-18
[3] Project Jupyter, http://jupyter.org/

 
Tap to read full article