Managing Big Data Projects in an Agile Way

Like Agile wasn’t big enough as a buzz word, let’s mix it with big data. Joke aside, doing Agile when engaged in big data projects is almost a must. Of course, other options are also available but Agile practices and techniques are much more suitable for these sorts of projects.

We’re going to focus here on two sides of the story: why doing Agile is the right thing and what are the challenges in doing Agile in Big Data projects.

Why Doing Agile is the Right Thing

To qualify as big, data has to share the following three characteristics:

Volume: usually measured in units like Petabytes (PB), Exabytes (EB) or even Zettabytes (ZB). And no, managing a few Terabytes (TB) is not really big data. Velocity: data flows very fast making it impossible to keep up without specialised tools or automation. Imagine hundreds of camera data streams flowing in real time. Variety: consider all sorts of data: databases, emails, documents (all formats), voice or video recordings or even scanned paper notes. If the word madness comes to your mind don’t worry. It is equally challenging for everybody. And yet some people decided to take the jump and try to make some sense of all this big data.

Let’s see how Agile can help with Big Data, specifically when it comes to the three attributes mentioned above.Agile Project Management


There’s always a considerable amount of unknown and complexity when it comes to data volumes. That’s why the only way to deal with that is to break it down. Reduce the inventory, limit the WIP (work-in-progress), iterate, call it whatever you want but you have to find a way to slice it so you can manage it and understand it. Going for short iteration usually helps.

A big unknown is much harder to manage than a small unknown. Not to mention the obvious, if it’s small enough is less of an unknown. At the beginning of the iteration spend some time and try to understand the data, try to come up with a clear definition of done (DoD). Another thing which is very important when working with big data is choosing the right tools and technologies.

That’s why retrospectives are often a moment of reflection on how you can be more efficient by using more efficient tools. And that leads obviously to frequent spikes (research, investigations) that might take quite a lot of time.


No, not Sprint Velocity! But the incredible speed at which data flows into our systems. Just imagine a CCTV camera sending imagines that need to be captured, stored and analysed at very high speed. Having in-time information is the whole point, isn’t it? It’s not just data that’s flowing in but also…requirements!

The team will have to develop a special skillset about getting requirements very fast, discussing and understanding them and getting to work as soon as possible. Another key element is change. Not only data and requirements are flowing in but also changes. Very frequent, very fast and maybe even conflicting change requests. Now all these arguments are mostly focused on the input. But let’s also talk about the output. It’s not just fast in but also fast out.

Whatever you get by analysing the data needs to be made available to your clients very fast. That’s why a critical thing is having a very simple, clear and non-negotiable definition of done (DoD). If in other environments you might still have time to discuss and debate, when dealing with data speed, that time just isn’t there anymore. Obviously, a good-quality output depends on a good-quality input. So, data quality and data-cleansing tasks should be considered either at the beginning as part of “definition of ready” or as part of “work in process” (WIP).

Otherwise, garbage in, garbage out! Unfortunately, many teams don’t deal very well with time pressure and they might end up compromising quality or even worse, security.


This also brings a few very specific issues. Some of the data might be hard to understand. Having a truly cross-functional team should help here. Just imagine that some of the data is handwritten notes from a sales team. A sales consultant will probably be able to explain the specific jargon or acronyms.Agile Training

One of the biggest issues is correctly identifying and managing correlations and external dependencies. Again, think about flow, a central idea of doing Agile. Your work flows through different operations and is eventually completed. Any external dependency will create a bottleneck causing delays or blockages. So, securing the right sponsorship (and involve them in removing blockages) and extremely important. As it is proper stakeholder management. Generally speaking, the flow will always depend on how you manage your external dependencies which basically means stakeholder management in its purest form.

And it is not just the Product Owner (or equivalent role) who’s responsible for this but also the team. They need to understand that different people use different formats of data for their day to day job. And most of it is unstructured. It could be notes, emails or maybe even some phone conversation recordings. To put it differently, when you have to deal with the variety of data, you’ll never be bored.

Agile Challenges in Big Data Projects

We all know this should be minimised. But quite often that’s just very hard. Most teams complain about not being able to operate with a small inventory. Which translates into long sprints or even longer releases. So long that you could even question the need for agility. Two things to consider here: how sliceable is your product and how sliceable is the technology you’re using.

Volume and velocity are usually those that will contribute the most to your inventory. Best way to deal with that is to reduce the volume (limit the “work-in-progress”) you are operating with and achieve flow to keep velocity under control. Just remember, the more sliceable your product and technology are, the more agile you can be. We used the term “sliceable” as a more visual equivalent of “small inventory”.

We already mentioned this one before. Generally speaking, flow is not an easy thing to achieve. It’s even harder when the velocity of your inputs is not constant or when you have to deal with lots of dependencies. Both being very common in big data projects. Remember, “Mura” (unevenness) is one of the most damaging types of waste in lean theory.

A key objective for the team is to achieve evenness and then develop some agility to deal with all the highs and lows that will invariably occur. As far as dependencies are concerned, we already talked about stakeholder management and securing sponsorship as being very helpful.

Cross-functional teams
We all know the theory and the obvious benefits of having such a team. In the context of a big data project that’s not so obvious. To be able to handle all the requests and perform all the operations, some level of specialisation will be needed. That means having as part of the team some experts whose job it is to just perform one specific task without being too concerned about the overall result. They don’t care about flow, user stories or any of the ceremonies of your framework.

Their only job is to do whatever they’re asked to do. It helps a lot if their task is described clearly so they just do the work and don’t waste time with questions and discussions. Also securing these resources should be a top priority. If you need, let’s say, a security expert at a certain moment and they’re not available that will cause a blockage into your flow.

Reviews and Demos
Have you ever seen a big data project demo? Few things are more boring than that! Not to mention that quite often there’s really not much to show. You could be spending a few weeks of work and at the end have nothing relevant to show. That’s the way some teams do it. Which is a mistake. Think of demos as validation of assumptions. One assumption might be speed.

You might think the data is displayed fast enough but your users might think differently. Also, think about data quality. Users will always find a way to mess up your data. You might as well learn that early enough. Another aspect which is quite valuable about demos is that even if you don’t get much feedback, the act of preparation and presentation will help the team to better understand the product.

It goes without saying that when you do such a demo, you have to have the right people in the audience. Might not be the typical user but it should those who understand what is shown and can give you some meaningful feedback.

Considering the dynamic of the big data field in general, retros should be treated with the utmost importance. And not so much because of traditional process improvement ideas but mostly because of new tools and technologies that pop up and could potentially make our life much easier. Considering that lots of these tools are open source, the rhythm of innovation here is very accelerated.

So keep an eye on what’s new and do a lot of spikes.

Non-functional stories
A not-so-cool part of doing big data is that most of the work will actually be on user stories that don’t deliver immediate or direct value. In fact, most of the user stories will fall under this category. So be careful about how you consider and estimate those stories and how you demonstrate value or even do a demo to potential customers who might not be able to understand what they actually do get.

Related to that, another topic that should be treated very carefully is backlog prioritisation and grooming. The Product Owner (or equivalent role) should demonstrate business acumen, big data environment understanding and technological savviness. Yes, cool but tough job!

Technical debt
Since we mentioned technology, another problem that tends to pop up a lot is technical debt. We all want results immediately and because of that technical teams might be tempted – or forced! – to create a fast solution that later requires much more work or it might even break. Again, proper stakeholder management and an empowered Product Owner should be able to make the right decisions.

Most people would jump to Scrum immediately. When it comes to this kind of project there needs to be a decision about which methodology is appropriate. Some people go for iterations (and use Scrum or something related) others choose flow (maybe Kanban). Some important considerations here: focus more on the product or on the process? Is the product sliceable? Can you do a demo? Do you have to? Do you have any market pressure to deliver fast?

A project is supposed to have a clear vision with a clear outcome and benefit. Except sometimes in this field, you don’t really know what you’re going to get until you actually do some work. Depending on the data you have available, how it’s processed and interpreted the end result might look very different. Ensuring people have the right expectations is crucial to the success of such a project.

Big data projects tend to deal with much more complexity and unknowns than other projects which is why securing the right sponsorship is very important. In addition to the typical duties of a sponsor, here other aspects might be equally important: dynamic funding of the project, reconciliation, killing the project, special empowerment, domain understanding, commitment, etc.

It’s obvious from the elements mentioned above that big data projects are somewhat special. And also, that doing Agile here might be a bit different from other projects. But it should be equally obvious that if you have a big data project, Agile is the right way to go.

© Rolf Consulting