geek.ly

25 October 2016

"In God we trust; all others bring data."
-- W. Edwards Deming, American statistician

The market for Big Data software is expected to grow sixfold between 2015 and 2019 ^[1]. Large enterprises that once leaned almost exclusively on the business judgment of their managers are now supplementing or supplanting it with data analysis. I will lay out the process of going from raw data to decisions and lay out challenges and opportunities for enterprises as they move towards a culture of data-driven decisions.

What is the Data Analysis Lifecycle?

The Data Analysis Lifecycle (DALC) is the process of extracting insight from data and translating it into business decisions. In rough order of proximity to raw data, the most common functional roles in the DALC are:

The data engineer ensures that raw data in all relevant forms is collected in an enterprise-wide repository as a strategic asset. Data engineers maintain a robust data lake and put data analyses into production.
The data scientist executes the core of the data analysis and produces the first glimmers of insight from raw data. Data scientists iteratively create and test hypotheses by building mathematical models and visualizations.
The business analyst is a domain expert who sits between the business sponsor and the data scientist, and
The business sponsor is a key stakeholder in decisions flowing from the data analysis. They set objectives for the data analysis, define parameters such as timeline and budget, and consume dashboards and reports.

Data analysis is inherently more experimental than software development.

Share this

Why Data Analysis Is Like A Basketball Game

In a 1984 paper on organizational behavior ^[2], Robert Keidel outlined three types of task interdependence, i.e. the patterns of interaction and task dependence between participants in a business process, and drew comparisons to American professional sports:

The loosest form of interdependence is pooled interdependence, where each player contributes individually but doesn't depend directly on others to accomplish a larger task. This pattern is seen in baseball, where team member contributions are relatively independent, and in sales departments, where individual salespeople pool their respective outputs into the department's output.
Sequential interdependence occurs when one stage in a process produces an output used by the next stage in the process. This pattern is slightly more complex to coordinate than pooled interdependence. It’s seen in football, where the ball moves towards the end zone in a series of downs, and in assembly lines, where a product is put together sequentially across a factory floor.
The most complex form of task interdependence is reciprocal interdependence, where one department’s output feeds into another department, but with the possibility that the task moves back and forth in cycles. One sees this pattern in a game of basketball, where the frenetic progress of the ball towards the hoop looks very different than the relatively orderly progress of a football to the end zone.

The DALC exhibits quintessential reciprocal interdependence. A business analyst may receive a completed analysis from a data scientist only to find that the business sponsor has changed its parameters in the meantime. Or, a data engineer may push back on a requested analysis because data to support the hypothesis is unavailable. Perhaps most unsettlingly, an analysis may turn out to be fruitless even after the functional roles pass it back and forth for several weeks.

Comparisons to Software Development

The Software Development Lifecycle (SDLC) also employs technology-driven solutions to complex problems with diverse stakeholders. Early versions of the SDLC followed a rigid waterfall model with sequential interdependence between stages: requirements discovery, development, testing and deployment. New releases took months, and if requirements changed, it was hard to course-correct. The SDLC eventually shifted to agile development where core concepts such as user stories, unit tests and release branches were expressed explicitly in tools rather than living as tacit knowledge within its participants. Improved communication and collaboration using developer tools helped shrink release cycles to weeks and helped instrument and improve the SDLC. Now, quality developer tools along with elastically scalable infrastructure have brought us to continuous integration and continuous deployment (aka "DevOps"), which is beginning to look like reciprocal interdependence.

Unlike software development, data analysis is inherently built around reciprocal interdependence. Due to varying industry practices, the DALC is also less standardized than the SDLC is. Finally, data analysis is inherently more experimental than software development; whereas a development sprint reliably yields a software release, data analysis may not necessarily yield a decision. Despite these differences, practitioners of the DALC can learn from the SDLC. The DALC today is largely powered by tacit knowledge that resides within its practitioners. Although tools like Jupyter notebooks ^[3] help data scientists collaborate, there aren't many tools yet that span the entire DALC and capture handoffs across functional roles. Without quality tools, data analyses largely proceed at the speed of the old waterfall models of the SDLC. I’m starting to see startups take inspiration from SDLC tooling to build high-quality tools for the DALC and realize its full potential.

Questions that can be asked of data are growing faster than those with skills to answer those questions.

Why Data Analysis is Like NBA Basketball

Why Data Analysis is Like NBA Basketball

Never miss out on the latest geek.ly content

Get our unique and exclusive articles and insights delivered right to your inbox!

Awesome! You're now part of geek.ly community!

Please look for the confirmation email in your inbox ()to complete the subscription. If you don't see the email, please check in your spam folder.