The tyranny of the feedback loop
Intro
Any software product is a projection of the ideas of product owners, designers, developers, and other team members into reality. As we've discussed in the previous article, the projection can have a very different efficiency depending on the approach taken.
Before we begin: if you like my posts, please consider subscribing, if you have a comment, please let me know! These are two channels that help me understand that there is some interest in writings like this.
Let's take a different angle today - can this mapping have an impact on the organization processes and the organization itself and if so how?
It's not a lame question, because whenever team members come up with tech improvements and have a hard time defending their idea, it's often because there is no explicit concept to attach these improvements to. While there is a lot of discussion about bigger processes in a team like scrum or kanban, the smaller ones are often overlooked or are perceived as a fact of life and tend to develop organically depending on the team dynamics and experience. While being small, these processes can change the team and organization dynamics on a big scale.
To understand if that's true we need to define and decompose the process of product development and examine its parts.
Let's begin by defining the concepts of a context switch and a feedback loop.
A context switch is a period of time when an individual is planning to work on a different task.
A context switch is a time when we're the most vulnerable to distractions like checking emails or going for yet another cup of coffee.
A feedback loop is a very simple concept since it consists of just two steps run on repeat:
Make a change
Observe the effect
Any human activity can be defined in terms of feedback loops, be it developing a rocket or getting better at making a sandwich. Usually, the shorter the feedback loop the better.
Every feedback loop has to be executed at least once with any imperfections adding up quite quickly, which means that the number of iterations on a given feedback is as important as the time of an individual iteration.
Product development can be defined as a sequence of potentially nested feedback loops with context switches sprinkled in between. Both feedback loop and context switch do not happen in isolation, rather it's a complex interplay between them and other temporal activities in the company which means that the chances of an action getting completed go down if a (perceived) time to take the action is bigger than the time between external interruptions.1
All this explains why certain tasks simply never get through. People like to get things done in general and will automatically prioritize the actions they have time for.
Product development loop
Any big system can be extended in many different directions and all the efforts are usually split into tasks to introduce some structure to the development, prioritize certain tasks above others, and distribute them across the team. Implementing the tasks one after another makes a product development loop.
The usual steps for a single developer are:
Pick up a task
Implement
Get it live
It's not a feedback loop, however, it's still a loop that is the most important to the team since they usually want to churn out as many tasks as possible to push the product forward. The actual change is the only reason of task existance, everything else is an overhead. The implementation step is also the most visible to the management.
A fast product development loop allows to build a lot with a compact team, the teams with slow product development loops may push management to abandon promising products or hire more developers to "fix" the velocity problem, which means more managers, office space, often even slower velocity and less runaway of course.
Let's go through the steps one by one.
Pick up a new task
Picking up a new task is one giant context switch when a person should not only come up with a solution for it but first of all figure out the actual requirements. That's very often not trivial and may require several iterations to get right and often slips into the implementation step.
Vaguely described tasks can make this step arbitrary long. 2
Implement
The implementation step is the core of any task and is sometimes the quickest one. We can apply the concept of a feedback loop there. For any given feature you will need:
Replicate the state, observe the old behavior
Make a change
Replicate the state, observe the new behavior
Product decisions have a direct impact on the steps above with all the consequences for the product development loop. In many situations, the change itself takes almost negligible time compared to the rest, we will cover some situations below.
In an ideal situation, no time is spent on anything except the actual change3, everything else goes downhill from there. There are only that many hours in a given work day and if one iteration of this loop takes one hour, a developer could do eight iterations max in a given day, probably worse than that due to all distractions.
One may argue that if one has to wait an hour between making a change and observing the results, that's a perfect time to do something else. While we all like to multitask from time to time and some hacks make the issue slightly less painful4, the fact is that most people usually suck in doing that. In this particular case the observation won't be done in one hour precisely because of different distractions and none of the tasks run in parallel will receive the full attention which in turn will increase the probability of bugs and stretch the feedback loops even more.
Get it live
No change is done until it takes effect somewhere. While this step ideally should take no time from the team, getting changes live is in practice yet another feedback loop in action. Bugs happen, deploys fail, and the product change is done when it's visible to customers and does what's expected.
What's the deal there? In case the operation is slow or unreliable (see the section below), developers tend to avoid it and focus on writing code instead which should often go live the same way. What that means is that the frequency of deploys would decrease and the amount of the code to be deployed would increase which would lead to an even bigger chance of errors leading to even slower deploys and hence pushing the spiral down more and more. Once this part is stuck, the product development hardly moves as well with all the consequences for the product development loop.
Let's get to some concrete areas that sit in the loops and can be affected by product and technical decisions.
Examples
Feat environment
In the majority of situations software systems are developed in two environments - feat and production, more advanced setups have a staging one.
The difference is that the feat environment has a completely separate set of data sources and no changes there have any chance of affecting real customers.
Both staging and live environments operate on production data with the difference that staging does not serve real customers by default, the rest is identical.
The feat environment is used during the feature development often a key to the velocity in the development feedback loop. A separate data source allows us to replicate the state as many times as needed, no bugs lead to downtime.
While the state replication is a topic of the next section, a broken feat environment forces developers to test changes on live and since some of the changes are just too sensitive or too hard to test there it's quite possible to get into a surprising scenario where the changes would be made blindly5.
The code that has not been executed is most probably broken. Usually, it's enough to get burned just a few times to make the whole area almost untouchable with direct consequences for the product itself. Both developers and product managers will try hard to avoid doing anything there disregarding its importance to the business.
While read-only changes can be done against production data with limited risk, it's still there though because a lot of apps are doing writes even on reads (watch history on YouTube for example) and you probably never want to log in on behalf of a user without their direct consent otherwise the action can have legal consequences.
The working feat environment is never a given, but rather a continuous effort to make every change work both on feat and on live. Any product changes that break the feat environment will stretch the feedback loops.
Data
A product is not merely a bunch of code, it's also a pile of data of different proportions. If you have a system you have some actors there as well. In the case of a hotel booking website, it could be a customer, an accommodation, an internal user, a hotel employee, and all the entities they produce - messages, bookings, etc. Every entity has a bunch of attributes: a hotel has an address, an employee has an email and a customer can have a shiny golden badge given to them because of reasons. Most of the business logic is defined in terms of entities and their interactions, like
Give a golden badge to a customer with more than 100 bookings
Send a welcome email after successful signup
Disable customer after a failed payment
Any big system is done by doing iterations by many teams that fix or enhance the system and the data tends to grow and take shape organically unless specifically designed.
There are a few conclusions to make there:
data does not come out of nowhere, it should either be generated automatically or manually;
to test any given feature one needs to get the data to a state that would trigger the desired code;
it should be possible to understand whether the system is in the desired state or how to get it there;
in many situations data comes from multiple sources.
Unless specifically acted upon, the complexity of reproducing a scenario increases with the amount of data needed to be generated. Complexity also increases in case the data is hard to observe - one thing is to be able to check for a badge somewhere and the other is to hunt for several data sources to find a randomly looking column holding the value.
Product decisions are often made in the MVP style at a cost of data observability or proper structure with the hope that the proper place will be found later. Unless specifically planned for, this moment usually never happens and any further development has to rely on the hidden knowledge about the implementation. In other cases, the only way to observe the data or execute the code could be at the very end of a very long workflow which makes it hard to understand if the state is correct.
There is also an aspect related to the feat environment - live environment always has more data and it's more variable. This is helpful at times because instead of replicating data on feat from scratch one could find the required state on live and use it as a reference. On the other hand, if the state is impossible to replicate, feat environment becomes useless and as good as nonexistent.
Opaque and hard-to-replicate data has a direct impact on development feedback loops, specifically on the observation step.
Invariants
Invariant is a statement that always holds true.
As an example, we can imagine a system where an active user always has a verified email.
This invariant means that:
We have high confidence that the email address belongs to this particular user
Putting deliverability issues aside we can be confident that we can reach the user by email.
Invariants are indispensable because they are the building blocks for any new feature. The fact is that for any product or software system, all invariants are artificially made up to fit the use case. What that means is that any decision should be made in a way to keep invariants true or the whole system should be updated to reflect the new reality, otherwise, bad things happen.
Let's say we've decided that email verification is too much of a hassle for users and made it optional. The direct consequence would be users with fake addresses or addresses they don't own and that not only reduces the chance for an email to be delivered but also compromises data, since some of the sensitive emails might be sent to random destinations.
Broken invariants also have direct implications for development feedback loops since now developers have to spend much more time trying to find any stable ground to develop the feature as well as for company operations in general because the data is not trusted anymore.
Shared resources
A shared resource in development would be something that's used by more than one developer at the same time. A database is usually a shared resource for example, but you could have many more of course. A shared resource becomes a point of contention in case more than two developers make changes to it.
A realistic example would be a shared search service used by a web app in development. In case developers depend on the search for the app development and the search team deploys an unlucky change that breaks the search, all developers working on the app are immediately blocked till the problem is resolved.
A deploy target is a shared resource as well. Let's say multiple teams work on the different parts of the same app and one of them merges a bad change. Once that happens, both teams cannot get the changes live till the conflict is resolved.
Shared resources also suffer from tragedy of commons. In the example above both teams are doing deployments of a web app, but even though it's beneficial for both of them to have a stable deployment pipeline, it's not a priority for any of them. Moreover, in case the deployment is painful enough, there is an even bigger urge to get a free ride by waiting for the other team to get changes live, and that stretches development times for both teams.
The introduction of any shared resources has a direct impact on the development feedback loops.
Local dev environment
The local dev environment consists of a collection of steps made to be able to run the product locally6. The local environment is really important since not only it's much faster to iterate, but it also helps avoid the deployment step and potentially hit shared resource problems.
Any changes that are not possible to run locally immediately increase the time needed to do and test a change.
The quality of the local environment also matters a lot. A slow, unreliable, or hard-to-set-up environment kills the productivity.
Deployments
A deployment is a process of putting your changes somewhere in a production environment where the code will be executed for real customers. Even though it sounds simple, deployment is usually required to complete any task. There are a few aspects to deployments:
How long does it take to deploy new code?
How long does it take to spot the errors?
What are the chances of success?
Is it possible to roll back quickly?
The process of getting something live consists of the first and second points technically, but the last two points can make deploys very problematic.
A quick deployment with immediate feedback can afford a low success ratio, a bullet-proof deployment can take its time, and the worst case is of course a deployment that will take ages to complete, even longer to spot an error somewhere, and be impossible to roll back.
The product development loop is meaningless without changes hitting live and bad deployments will definitely affect it. Nobody dreams about having bad deployments, let's quickly stop on possible reasons to get them.
Technical decisions
The are ways to influence the chances of success from the technical side. Any static checks and tests help here. It may not be practical to cover the whole codebase with tests, but any complex business logic with lots of arbitrary rules should be covered, otherwise, it'll break with every new condition added.
Typed languages help there too since type information conveys at least part of the information about the code and allows to catch silly mistakes.
Apart from that there are linters for every language and they're instrumental for spotting the usual foot guns that are easy to miss otherwise.
With the growth of the team the benefits of this tooling only increase since more code and more developers new to the company inevitably mean more mistakes.
Any decisions that require the code to be deploy to understand that it’s working harm the deployments
Product decisions
In the context of deployments, product decisions can have a profound effect as well, specifically on the question of the time needed to spot the errors.
Whenever the feature is not implemented with testing in mind and the data is hard to replicate (see above), it's hard to understand whether is broken or not. Every new feature that falls under this scenario makes the chances of successful deployment a bit smaller and with time the product can degenerate into an untestable beast that's close to impossible to deploy.
Long feedback loops
Can it happen that a long feedback loop is a good thing? Yeah, of course!
If an action is quick to do, it will be done more often and in some situations, it's not desirable. For example,
after several failed attempts to enter the pin code iOS would insert an increasing artificial delay to make brute force attempts to unlock the device unpractical
if the price for an action is high, you may want to add more steps before to catch as many errors as possible and trade the feedback speed for cost savings.
Outro
You may wonder why didn't we use the term "tech debt" there. I think it's not insightful enough to explain why one should act on it. Everybody knows that killing tech debt is a good thing generally, but what exactly will become better because of that? The answer is feedback loops in your team. Also, if some tech debt sits outside of them fixing it will not allow the team to implement changes faster, although it may address deficiencies somewhere else.7
Product development is a team sport and developers are as responsible for the way the systems evolve by implementing changes as product managers or designers are by requesting them. The awareness about feedback loops is another tool in the toolbox to keep the team running smoothly since even though feedback loops are a fact of life, their state is a consequence of specific decisions.
One of the most efficient ways to speed up the product development loop is to only work on tasks that matter for a product, the hardest bit is to identify them from all the possibilities of course.
Big thanks to Pavel, Oleg, Artur, and Jacob for the valuable feedback on this article!
Classic case: remember a task you wanted to finish for weeks but were not able to external events - people poking you in the chat, meetings, company events, etc.
The tasks that show the worst performance in this sense are the ones that are not only vaguely defined ("make invitation email better"), but also require multiple decisions just to get going ("send a weekly reminder email to bad customers only" - when should we send it? Who's a bad customer? What should the reminder consist of?) and in the worst case an external communication with a an especially hard to reach person.
e.g. there is a URL to a web page that developer can load to see the old behavior or a unit test that can be run, which means that testing the change would be as simple as reloading the page in the local dev environment
Or give us an excuse to give in and not do anything to reduce the number of interruptions
No one wants to drain their real credit card to make test charges again and again.
The actual location can be remote, the important bit is that it's an environment used exclusively by a single developer
e.g. reduce the number of manual changes needed, fix noisy alarms, etc.