Thoughts on Data Science, ML and Startups

A Case For Agile Data Science

This article was first published in TowardsDataScience

tl;dr;

  • I have encountered a lot of resistance in the data science community against agile methodology and specifically scrum framework.
  • I don’t see it this way and claim that most disciplines would improve by adopting an agile mindset.
  • We will go through a typical scrum sprint to highlight the compatibility of the data science process and the agile development process.
  • Finally, we discuss when a scrum is not an appropriate process to follow. If you are a consultant working on many projects at a time or your work requires deep concentration on a single and narrow issue (narrow, so that you alone can solve it).

I have found a medium post recently, which claims that Scrum is awful for data science. I’m afraid I have to disagree and would like to make a case for Agile Data Science.

Ideas for this post are significantly influenced by the Agile Data Science 2.0 book (which I highly recommend) and personal experience. I am eager to know other experiences, so please share them in the comments.


First, we need to agree on what data science is and how it solves business problems so we can investigate the process of data science and how agile (and specifically Scrum) can improve it.

What is Data Science?

There are countless definitions online. For example, Wikipedia gives such a description:

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.

In my opinion, it is quite an accurate definition of what data science tries to accomplish. But I would simplify this definition further.

Data Science solves business problems by combining business understanding, data and algorithms.

Compared to the definition in Wikipedia, I would like to stress that data scientists should aim to solve business problems rather than “extract knowledge and insights.”

How Data Science Solves business problems?

So data science is here to solve business problems. We need to accomplish a few things along the way:

  1. Understand the business problem;
  2. Identify and acquire available data;
  3. Clean / transform / prepare data;
  4. Select and fit an appropriate “model” for a given data;
  5. Deploy model to “production” — this is our attempt to solving a given problem;
  6. Monitoring performance;

As with everything, there are countless ways to go about implementing those steps, but I will try to persuade you that the agile (incremental and iterative) approach brings the most value to the company and the most joy to data scientists.

Agile Data Science Manifesto

I took this from page 6 in the Agile Data Science 2.0 book, so you are encouraged to read the original, but here it is:

  • Iterate, iterate, iterate — tables, charts, reports, predictions.
  • Ship intermediate output. Even failed experiments have output.
  • Prototype experiments over implementing tasks.
  • Integrate the tyrannical opinion of data in product management.
  • Climb up and down the data-value pyramid as you work.
  • Discover and pursue the critical path to a killer product.
  • Get meta. Describe the process, not just the end state.

Not all the steps are self-explanatory, and I encourage you to go and read what Russel Jurney had to say, but I hope that the main idea is clear — we share and intermediate output, and we iterate to achieve value.

Given the above preliminaries, let us go over a standard week for a scrum team. And we will assume a one week sprint.


Scrum Team Sprint

Day 1

There are many sprint structure variations, but I will assume that planning is done on Monday morning. The team will decide which user stories from the product backlog will be transferred to the Sprint backlog. The most pressing issue for our business, as evident from the backlog ranking, is customer fraud — fraudulent transactions are causing our valuable customers out of our platform. During the previous backlog refinement session, the team already discussed this task, and the product owner got additional information from the Fraud Investigation team. So during the meeting, the team decides to start with a simple experiment (and already is thinking of interesting iterations further down the road) — an initial model based on simple features of the transaction and participating users. Work is split so that the data scientist can go and have a look at the data team identified for this problem. The data engineer will set up the pipeline for model output integration to DWH systems, and the full-stack engineer starts to set up a page for transaction review and alert system for the Fraud Investigation team.

Day 2

At the start of Tuesday, all team gathers and shares progress. Data scientist shows a few graphs which indicate that even with limited features, we will have a decent model. At the same time, the data engineer is already halfway through setting up the system to score incoming transactions with the new model. The full-stack engineer is also progressing nicely, and just after a few minutes, everyone is back at their desk working on the agreed tasks.

Day 3

As with Tuesday, the team starts Wednesday with a standup meeting to share their progress. There is already a simple model build and some accuracy and error rate numbers. The data engineer shows the infrastructure for the transaction scoring, and the team discusses how the features arrive at the system and what needs to be done for them to be ready for the algorithm. The full-stack engineer shows the admin panel with metadata on transactions is displayed and the triggering mechanism. Another discussion follows on the threshold value for the model output to trigger a message for a fraud analyst. The team agrees that we need to be able to adjust this value since different models might have different distributions, and also, depending on other variables, we might want to increase and decrease the number of approved transactions.

Day 4

On Thursday, the team already has all the pieces, and during the standup, discuss how to integrate those pieces. Team also outlines how to best monitor models in production, so that model performance could be evaluated and also degradation could be detected before it causes any real damage. They agree that a simple dashboard for monitoring accuracy and error rates will suffice for now.

Day 5

Friday is a demo day. During standup, the team discusses the last issues remaining with the first iteration of the transaction fraud detection. Team members prepare for the meeting with the fraud analysts that will be using this solution.

During the demo, the team shows what they have built for the fraud analysts. The team presents performance metrics and their implications for the fraud analysts. All feedback is converted to tasks for future sprints. Another vital part of the Sprint is a retrospective — meeting where the team discusses three things: 1. What went well in the Sprint; 2. What could be improved; 3. What will we commit to improving in the next Sprint;

Further down the road

During the next Sprint, the team is working on another most important item from the product backlog. It might be feedback from the fraud analysts, or it might be something else that the product owner thinks will improve the overall business the most. However, the team closely monitors the performance of the initial version of the solution. It will continue to do so because ML solutions are sensitive to changes in underlying assumptions that the model made about data distribution.

Discussion

Above is a relatively “clean” exposition of the scrum process for data science solutions. Real-world rarely is that way, but I wanted to convey a few points:

  1. Data Science cannot stand on its own. If we’re going to impact the real world we have to collaborate in a cross-functional team, it should be a part of a wider team;
  2. Iteration is critical in data science, and we should expose artifacts of those iterations to our stakeholders to receive feedback as fast as possible;
  3. Scrum is a framework that is designed for iterative progress. Therefore it is a perfect fit for data science work;

However, it is not a framework for any endeavor. If your job requires you to think deeply for days, then Scrum and agile would probably be very disruptive and counterproductive. Also, if your work requires you to handle a lot of different and small data science-related tasks, following Scrum would be inappropriate, and maybe Kanban should be considered. However, typical product data science work is not like that. Iteration is king, and getting feedback fast is key to providing the right solutions to business problems.

In summary

Data Science is a perfect fit for the Scrum with a single modification — we do not expect to ship finished models. Instead, we ship artifacts of our work and solicit feedback from our stakeholders so we can make progress faster. Project managers might not like data science for the unpredictability of the progress, but iteration is not at fault, it is the only way forward.