cybernetics and today
2015-07-31
Regulation, resilience, resistance - can the system you control handle all of these? How do you design your software system to encompass these? It's something I've been thinking about for a long time. Cybernetics was an attempt in the early 1950s into the 1960s[1, 2] to address these questions. Correctness in a software module has this model:
correctness(Module) -> Boolean
In a larger program, the equation looks like this
correctness(BigProgram) -> Or(correctness(Module))
But when interconnected with other programs (or a sufficiently large single program), the equation is similar to this:
Sum(correctness(Programs)) / Count(Programs)
The effect is that as the number of programs in the system increases, the closer correctness of the total system becomes to being analog, rather than digital. Given that all programs have bugs, we then can make the general statement - your large scale system is partially correct (not news to anyone). The question then becomes, how tolerant of error can you be, and if the state of the system changes into something which is beyond that error limit, how do you bring the system back in line? One model that has particularly struct me as instructive is the Viable System Model (reference 1 above), as it neatly lays out different command, control, and monitoring portions of a system. Let's go through the VSM and discuss how we might apply that to a modern distributed software/hardware cloud system. Note that after the fashion of the times, each system has a number, not a name. Quoting from Wikipedia... System 1 in a viable system contains several primary activities. Each System 1 primary activity is itself a viable system due to the recursive nature of systems as described above. These are concerned with performing a function that implements at least part of the key transformation of the organization. System 2 represents the information channels and bodies that allow the primary activities in System 1 to communicate between each other and which allow System 3 to monitor and co-ordinate the activities within System 1. Represents the scheduling function of shared resources to be used by System 1. System 3 represents the structures and controls that are put into place to establish the rules, resources, rights and responsibilities of System 1 and to provide an interface with Systems 4/5. Represents the big picture view of the processes inside of System 1. System 4 – The bodies that make up System 4 are responsible for looking outwards to the environment to monitor how the organization needs to adapt to remain viable. System 5 is responsible for policy decisions within the organization as a whole to balance demands from different parts of the organization and steer the organization as a whole.
Take a little bit of time to digest these different concepts and their formulation. Function 1 is concerned with the key concrete aspects of the system. That is, these are the worker (or slave) nodes within a system, combined with the actual program code. Function 2 can be reified into RabbitMQ communications, files, or APIs. Function 3 is your system map and control systems for keeping your systems online; Jenkins, Chef, etc, form the Function 3 in the cloud, along with autoscaling events. Function 4 maps to the monitoring system: Riemann, Prometheus, New Relic, etc. These will feed into function 3 to describe how the world works. Function 5 corresponds to your SREs and other people serving as policy executives.
Note that a key aspect of this system is that it, as described, is self sustaining and self-correcting: resilient. While the VSM was developed in an attempt to describe organizational approaches, it also cleanly maps onto a distributed system infrastructure.
|