It’s bigger on the inside. Some thoughts and observations on high-load software development

Today let’s talk about software development! Yeah, that subject sounds very extensive, so let’s narrow it a bit down. Let’s talk about high-load business-critical software development. I develop web-services, so let’s talk about those.

What makes a business-critical high-load service? Clever design, neat algorithms, high-quality, high-performance, highly-optimized code, micro-benchmarks, (preferably) lots of unit-tests and integration tests. It that it? Can you implement all of those and just deploy to production?

The first answer would be – yes, you can, since you are sure that you’ve made no critical mistakes, communicated with all the domains, integrated your service into company infrastructure and overall system architecture what can possibly go wrong? Well, as always, it depends on how high exactly is your high-load.

For illustrative purposes, let’s add some assumptions.

Let’s say that all our services are based on the same technological building blocks that the company uses for the sake of development unification and simplification. I.e. same libraries, frameworks, database systems etc.

Then let’s take an average load scenario throughout the simplified system we are building the service for. Let’s say, average load during the standard system business logic execution is X RPS and all the services are connected in a line and communicate between each other only once.

So all the services along the execution path should hold at maximum X RPS. If the new service (not shown on the diagram above) is designed and developed with N RPS in mind, we have four cases.

N < X
N = X
N > X
N >> X

With cases 1 and 2, we’ve already developed a handful of services for such load, we know how our system works under that load, and we are pretty much sure that everything will hold together since all the building blocks were already working for the same load scenarios.

The third case is a bit more tricky – we need to be sure that the new service holds the load. It means that an additional testing stage should be added to the development pipeline – load testing. That stage should be common for all the services so if the load is not much higher than the X – just test it, and we are good.

The fourth one is even more complex – the load is so high that none of our existing systems had been tested against that much RPS and maybe we can’t even generate so much load using our conventional load-testing tools. Let’s call this case an extreme one, and it would be a context for all the further narration.

Now we are facing the problem of generating the load, which is a little bit off the beaten software engineering track. How do we generate load that was never generated before?

First, we take our day-to-day load testing equipment and try to push it to the limit. If the desired load can be reached – hooray – we are lucky. If it can’t though – that’s where things get hairy.

The problem of load testing applications is itself a very deep and complicated, but I’ll try bringing it down to some key points.

Tools – the applications we use for load generation
Assertions – how do we decide whether the load testing yielded positive or negative results
Resources – VM’s, network and other infrastructural stuff that is utilized by load testing tools
Data – the actual requests that we should use to create a load
Insights – logs and monitoring that our load test subject is instrumented by
Scenarios – load generation rules for the tools

Where do these points stand in our extreme case?

The tools for load generation might not be able to generate the required RPS; the assertions that we make shall now be based on the cases which we’ve never seen before; the resources might be exhausted : we might run out of the VM cores, network bandwidth or bump into some load balancing issues; the data for load testing should be representative – i.e. we can’t use the same parameters when generating hundreds of thousands of requests; the monitoring infrastructure might become overwhelmed by logs and metrics; the scenarios – if they are code-based, which they usually are, might require refactoring.

As we can see oddly enough our extreme case brings the problem of load testing load testing process and infrastructure itself.

Now, where does our fellow developer stand here? Each of the points that compose the load testing process can’t (or at least should not) be done without the direct developer involvement, so the developer becomes something bigger than the mere code-writing, algorithm-savvy tech nerd. She becomes infrastructure, data, network, monitoring specialist, at least a little bit. That’s where the programmer becomes the software engineer.

And ideally we all should invest our time in understanding the subjects like those shown above.

As we can see, it’s bigger on the inside.

Leave a Reply Cancel reply