A data science life cycle is an iterative set of data science steps you take to deliver a project or analysis. Because every data science project and team are different, every specific data science life cycle is different. However, most data science projects tend to flow through the same general life cycle of data science steps.
Problem Define
Generally, the project lead or product manager manages this phase. Regardless, this initial phase should:
- Motivate everyone involved to push toward this why.
- State clearly the problem to be solved and why.
- Identify the project risks including ethical considerations.
- Identify the key stakeholders.
- Define the potential value of the forthcoming project.
- Align the stakeholders with the data science team.
- Research related high-level information.
- Assess the resources (people and infrastructure) you’ll likely need.
- Develop and communicate a high-level, flexible project plan.
- Identify the type of problem being solved.
- Get buy-in for the project.
Data Investigation and Cleaning
Without data, you’ve got nothing. Therefore, the team needs to identify what data is needed to solve the underlying problem. Then determine how to get the data:
- Is the data internally available? -> Get access to it
- Is the data readily collectable? -> Start capturing it
- Is the data available for purchase? -> Buy it
Once you have the data, start exploring it. Your data scientists or business/data analysts will lead several activities such as:
- Clean the data.
- Document the data quality.
- Combine various data sets to create new views.
- Visualize the data.
- Load the data into the target location (often to a cloud platform).
- Present initial findings to stakeholders and solicit feedback.
Minimal Viable Model
All data science life cycle frameworks have some sort of modeling phase. However, I want to emphasize the importance of getting something useful out as quickly. This concept borrows from the idea of a Minimal Viable Product.
Breaking this down:
- Minimal: The model is narrowly focused. It is not the best possible model but is sufficient enough to make a measurable impact on a subset of the overall problem.
- Collect the maximum amount of validated learning about the model’s effectiveness: Develop a hypothesis and test it. This validated learning confirms or denies your team’s initial hypotheses. It has two main parts:
Is the model able to make a meaningful impact to the underlying business problem?
- Least effort: Full-fledged deployments are typically costly and time-consuming. Therefore, find the simplest way to get the model out.
Deployment and Enhancements
Deployment
Many data science life cycles include “Deployment” or a similar term. This step creates the delivery mechanism you need to get the model out to the users or to another system. It’s key because:
This step means a lot of different things for different projects. It could be as simple as getting your model output in a Tableau dashboard. Or as complex as scaling it to the cloud to millions of users.
Any short-cuts taken in earlier the minimal viable model phase are upgraded to production-grade systems.
Typically the more “engineering-focused” team members such as data engineers, cloud engineers, machine learning engineers, application developers, and quality assurance engineers execute this phase.
Enhancements
The “Enhancements” portion is not as common in other data science life cycle frameworks. But I’m trying to emphasize the importance of getting something basic out and then improving it.
Recall that the previous step delivered the “Minimal Viable Model”. While great as a starting point, the model probably isn’t as good as it should be. So use the time that the engineers need to deliver the model to improve the models. Conceptually, this “Enhancements” phase means to:
- Extend the model to similar use cases (i.e. a new “Problem Definition” phase)
- Add and clean data sets (i.e. a new “Data Investigation and Cleaning” phase)
- Try new modeling techniques (i.e. developing the next “Viable Model”)
Data Science Ops
Most other data science life cycles end with a Deployment phase or even before that with Assessment.
However, as data science matures into mainstream operations, companies need to take a stronger product focus that includes plans to maintain the deployed systems long-term. There are three major overlapping facets of management to this.
Software Management
A productized data science solution ultimately sits as part of a broader software system. And like all software systems, the solution needs to be maintained. Common practices include:
- Managing access control.
- Maintaining the various system environments.
- Triggering alert notifications for serious incidents.
- Meeting service level agreements (SLAs).
- Executing test scripts with every new deployment.
- Implementing security patches.