When applications grow, it may become necessary to refactor your logic or data structures.
In a distributed system like Darlean, which provides zero-downtime state and logic migration, special care must be taken to ensure that old and new code (which can be active at the same moment) keeps working together as intended, both during and after the migration.
State migrations
When the data structures that store and persist actor state have to be modified in a non-backwards-compatible manner, special care must be taken to ensure that older versions of your actors (which may still be active in your cluster while performing a rolling update to a newer version of your code) still work properly on the new version of the data.
Even though Darlean provides a buil-in migration mechanism that handles the associated complexities of state migrations for you, it is easier to prevent migration issues when you can.
Using objects instead of primitives
Using objects instead of primitives is a simple way to avoid the need for future migrations. When defining a data structure, it is tempting to take a shortcut and to define fields as primitives:
// Discouraged: Use primitive (string) to represent creditcard number
interface ICheckOutState {
creditcard: string;
}
But when your software evolves, you may find out that you not only need the credit card number, but also the expiry date:
// Discouraged: Use primitives at highest level to represent credit card data
interface ICheckOutState {
creditcardNumber: string;
creditcardExpiryDate: string;
}
We had to rename creditcard
to creditcardNumber
(for consistency), and added creditcardExpiryDate
. Of course, we adjust our business logic accordingly to use creditcardNumber
instead of just creditcard
.
The issue is that during a rolling deployment of the new software version, one or more processes are already updated to the new version, while other processes still run the old version. When a new actor stores its state in the new format, and later on, the actor reincarnates on an old process and loads its state, it does not find the old creditcard
field, and your business logic is likely to fail.
A technique thay may have saved you in this case was to not define creditcard
as a primitive value (a string
in this case), but as an object (or interface), even though you did not know exactly which future fields would also be required:
// Recommended: use an object to represent the credit card data
interface ICheckOutState {
creditcard: {
cardNumber: string;
}
}
Now, it is easy to add fields to the state without breaking old business logic:
// Recommended: It's easy to add new credit card fields to a credit card object.
interface ICheckOutState {
creditcard: {
cardNumber: string;
expiryDate: string;
}
}
Performing state migrations
Often, we can get away with our recommendation of using objects over primitives which allows for backwards-compatible data structures that can be read by old versions of the software.
But there are also times where this is not possible. For example:
- When the refactoring is so large that it cannot be done by just adding fields. Like, when fields are renamed or moved to a different place, or when their type changes.
- When you do not want old versions of the software to use your new data format because their lack of new functionality will cause corruption or misbehaviour of the logic.
For these scenario’s, Darlean provides an easy to use state migration mechanism.
How state migration works
- Migrations work on an actor basis. This makes Darlean different from other systems where migrations typically impact entire tables at once. Because Darlean provides an actor lock (which guarantees that for one actor, only one instance will be active at the same moment within the entire cluster), and because state is stored per actor (and is schema-less, so the structure of state could be different between actors of the same type), Darlean is able to perform migrations safely on a per-actor basis. This keeps individual migration actions small, and the system responsive as actors are migrated one-by-one, just before they are activated.
- Migrations are typically registered to Darlean in the suite function. A migration has a version number and code that performs the migration.
The version numbers follow the principles of semantic versioning in that the first number (major) indicates a non-backwards-compatible change, the second number (minor) indicates new backwards-compatible functionality, and the third number (patch) indicates a backwards-compatible fix with no new functionality. - When a migration is being performed (typically during the activation of an actor), the migration mechanism automatically adds (or updates) a migration-info structure to each state object that contains the last applied migration version.
- When an actor loads its state (typically during activation), Darlean automatically compares the version in the migration-info with the version of the software. When the major (first) number of the migration-info is greater than the major (first) number of the supported software version (which is the largest of the version number of all registered migrations), an exception is thrown. This prevents old actors from handling new data.
When this happens during activation, the framework activates another actor on a process that does already support the new version.
Triggering state migrations
The version-checking mechanism as described in the last bullet of the previous section is always enabled to prevent old actors from accidentally processing new data.
The actual triggering of the migration however requires manual work. Instead of the typical await this.state.load()
line in the activate
actor method, the migration is triggered (which, as a side effect, also loads the state):
constructor(private state: IPersistable<MyState>, private mc: IMigrationContext) {}
@activator()
public async activate() {
await this.mc.perform(this.state, async () => this.state.value?.name, { ... );
// Now do something with the migrated `this.state.value`
}
The provided migration context takes care of the migration. It loads the provided state
, runs all registered migrations that were not run before for this state, updates the migration-info in the state and stores the state so that the migration changes are persisted.
After that, the software can directly access and use state.value
which then contains the result of the migration.
Registering state migrations
Migrations can be registered as part of a suite where the individual actors are also defined.
const suite = new ActorSuite();
suite.addActor<MyState>({
type: 'MyActor',
kind: 'singular',
creator: (context) => {
const state = context.persistence<MyState>('state');
return new MyActor(state, context.migrationContext());
},
migrations: [
{ version: '1.0.0', name: 'From km/u to m/s', migrator: async (state, _context) => {
state.velocity /= 3.6;
},
{ version: '1.1.0', name: 'Added driver', migrator: async (state, _context) => {
state.driver = state.owner;
},
{ version: '2.0.0', name: 'Multiple drivers', migrator: async (state, _context) => {
state.drivers = state.driver ? [state.driver] : [];
state.driver = undefined;
}
]
});
In this code snippet, the migration context as assigned to the actor via the constructor, and a list of 3 migrations is defined for an imaginary vehicle state.
- The first migration transforms an already existing velocity value from km/u to m/s by dividing it by 3.6. Because this change is not backwards-compatible (old code that checks for speed limits will fail when the velocity is suddenly much smaller), the major (first) number of the version is incremented (from 0, the default, to 1 in this case).
- The second migration adds a
driver
field to the vehicle state, and initializes this field with the current value of theowner
field. We assume (for the sake of this example) that old software can work with this change, so we keep the major (first) version number to 1, and increment the minor (second) version number. - The third migration allows support for multiple vehicle drivers. It adds a new
drivers
field and populates this with the currentdriver
value (if any). Then it removes the olddriver
field. Because this change is backwards incompatible (old software expects adriver
which is not there anymore), we increment the major version number and set the minor and patch numbers to 0.
Note: In the above example, the migrations are defined ‘inline’ in the same file that defines the suite. That is of course not required. Especially for larger or more complicated migrations, it is perfectly fine to place the migrations in their own files and just reference them here.
As this example illustrates, migrations add up to each other. They are processed one-by-one in the order as defined. So migration 2.0.0
depends on migration 1.1.0
, so it assumes that migration 1.1.0
already created a driver
field and already assigned the owner
to that.
Migrations are purposely defined in the suite logic, outside of the actor itself. That makes it possible that an actor only has to focus on its business logic on the current version of the state. This helps to keep the code clean and maintainable.
Multiple states
Some (complex) actors may have more than one state object. For example, a User
actor that stores static (long-term information) such as name and birthday in one state object, and dynamic (short-term) information (like the password or moment of last login) in another state object.
Such actors are assumed to have one of those state objects as their main state object that stores the migration-info. The actor as a whole hence has one single list of registered migrations, and one single latest version number. The individual migrations are not limited to making changes to the main state only. They can also make changes to the other state objects. The context
object can be used to pass these additional states to the migration code.
The first state on which a load
is invoked is considered to be the main state object for the actor. So the ordering of load
calls in the activator
actor method is important.
The context object
For complex migrations, it may not be sufficient to operate only on the (main) state of the actor. For example, a LogBook
actor may store the individual log entries in a separate (table) persistence. Or a ShoppingCart
stores the volatile information (like the items the user is currently selecting) in one persistable, and the more static information (like the definitive list of items ordered when the order is completed) in a different persistable. Or, it could that for a successful migration, additional data sources need to be queried (like when migrating prices from dollars to euros, the conversion rate at the moment the order was placed).
All those additional resources can be assigned to an application-specific context object that is automatically passed to all migration handlers that can use the context to perform their migrations.
When are state migrations applied?
In other systems, migrations are typically applied at a fixed moment, like when the application starts or the operator presses a button. Migrations are often ‘global’, in that they impact entire database tables (like adding or removing a new column) which often is a stop-the-world action in which the application encounters downtime.
This is different for Darlean. In Darlean, state migrations are typically performed on a per-actor level. And to be more precice, during the activation phase of an individual actor.
Because Darlean uses scema-less persistence, changes in structure can be applied actor by actor. Because actors typically have an actor lock (no more than one instance of the same actor is active within the cluster) and the acivate-phase is also locked (no other actions on the same actor can take place), the activation phase is a trivial place to perform a migration. The actor logic itself will not yet be invoked so that the migration code has the full freedom to manipulate the actor state however it likes.
The main advantage: migrations are not stop-the-world but are performed actor-by-actor when the actor is first accessed after the migration has been registered. Since most migrations are very fast (because they operate only on the state of a single actor), the invoking code does not even notice the slight migration delay.
Rolling back state migrations
Roling back state migrations is an interesting topic. Many frameworks provide a mechanism for rolling back migrations that did not behave as expected. However, this approach assumes that all migrations can be rolled back. In practice, this assumption does not hold. Part of a migration can be to update data in external systems or to send notifications. Other migrations merge 2 fields (like first name and last name) together that cannot always be split up again afterwards. So, not all migration actions can be rolled back.
Therefore, Darlean does not provide a rollback mechanism. Migrations should have been tested upfront thoroughly. And before performing a migration, it is wise to create a backup of the persistence data (which Darlean typically stores on local or shared disk). In case of a failed migration, the entire cluster can be stopped and the data reverted. But we only see this as a last resort.
So, when your migration fails unexpectedly, what can you do?
- Improve your testing procedures for the next time
- Create a new migration that fixes your migration issues and deploy that migration as soon as you can.
- Ensure your migrations are backwards compatible so that you can switch back to an older version of your software. In the example of combining
firstName
andlastName
into a new fieldfullName
, consider keeping the original fields intact so that you can revert to an older version of your software which still understands the old values. - Stop the cluster and restore the persistence data from a backup that you made just prior to the deployment of the migration.
Performing table migrations
Most actors have a very simple state. But some actors may act as a container for a lot of other data. For example. a LoggingActor
that contains a lot of log entries. When the combined size of these log entries can become more than should reasonably fit into one state blob (like 100KB or 1MB) it is good practice to store these items in table persistence.
Like regular persistence, table persistence is scoped to the actor that uses this table persistence. The actor is the only one within the cluster that owns and can access this table data.
A migration because the format of the underlying table entries have changed is performed quite similar to regular migrations. The only difference is that the migration handler iterates over all table entries, converts the data, and then stores the updated data.
Because the actor lock still applies, no other parallel calls are made to these table entries, so it is safe to perform the migration. To speed up the migration, parts of it can be run in parallel. It is up to the migration handler to remember which items have been converted or not, so that in case of any errors, migration can be resumed without corrupting already converted entries.
Logic migrations
When it comes to implementing and deploying new or changed logic (functionality), we must make a difference between whether downtime is acceptable or not.
Migration with downtime
When downtime is acceptable, logic migration is straight forward. Because we can temporarily stop the entire cluster, then install new versions of all applications, and then restart the cluster, we have no issues with backwards compatibility because all running applications are either stopped or all have the same logic version.
So we just adjust our software to contain and invoke the new logic as we would do with regular (monolithic) software, and then deploy that all together.
Migration without downtime
When downtime is not acceptable, the stop-the-world approach as described in the previous section is obviously not a good solution.
For a zero-downtime migration, it is a prerequisite to have a cluster with redundant applications (so more than one instance of each application type). The trick here is to implement the logic changes in a backwards-compatible manner so that components can be migrated one-by-one, and old components still work as expected, even when they invoke new logic components.
Examples of how to implement logic changes in a backwards-compatible way are:
- Adding additional parameters/fields to existing action methods
- Adding additional fields to existing action method return data.
- Adding a new action method with a different name (like
greetV1
next to an existinggreet
method) that contains the new logic.
Note: When using the 3rd approach of creating a new action method with a different name, it is possible to get rid of the
greetV1
action method names after the migration is successful. When the migration is complete and all applications use the new logic, copy the new action methods to their original name (so that you have bothgreetV1
andgreet
with the same functionality); deploy this application-by-application; remove thegreetV1
style methods everywhere; and deploy this again application-by-application.
What not to do
Let’s start with some words on what not to do when it comes to logic migrations. It might be tempting to duplicate your actor (create a copy of your actor under a different name) and add the new logic to that new actor type. Or to register your current actor under 2 different names.
There are two reasons why this is not a good idea:
- Using a different actor type means that actors of the two types cannot access each others state. So, you must find a way to give the new actor type access to the state of the old actor type of the same id.
- Using a different actor type breaks the global actor lock that guarantees that for a single combination of actor type and actor id, only a single instance is active within the whole cluster. When you create a duplicate actor type, suddenly 2 instances for the same id can be active at the same time: one for the old actor type and one for the new actor type. This is typically not what you want.
Below, we provide two better approaches (tips) for performing logic migrations.
Tip 1: Using objects instead of primitives
To make it easier to add new functionality later on in a backwards-compatible manner, it is important to consider using objects instead of primitives as action method arguments and return types. So, instead of greet(message: string): Promise<string>
, take the effort to define an IGreetOptions
and IGreetResult
interface, and pass those to the greet
action method:
// Recommended: both the options and the result can easily be extended
export interface IGreetOptions {
message: string;
receiver?: string; // New field
}
export interface IGreetResult {
greeting: string;
}
public async greet(options: IGreetOptions): Promise<IGreetResult> {
return `${options.message} ${options.receiver}`;
}
It’s a bit more work, that’s for sure. But you will be grateful to yourself when you find out that adding new functionality at a later moment is a breeze because of this.
Tip 2: Shield implementations via service actors
Very much like you hide implementations (in the form of objects) from the rest of your logic via interfaces in object oriented programming, it is also possible (and good practice) to shield actors from the rest of your logic by means of service actors.
A service actor is like a regular actor, but it acts like a wrapper around the actors that implement your logic. Your application does not directly invoke the underlying actors, but it invokes the corresponding action method on the service actor, which knows exactly which ‘internal’ actor to invoke with which parameters.
The advantage of using service actors is that it hides most of the the implementation details from the rest of your application. When you make proper use of service actors, it allows you to:
- Freely rename the underlying actors. For example because new insights have tought you that the old name does not cover anymore that the actor is doing.
- Freely merge or split actors. For example, because you realize that one actor has grown so much that it now actually does multiple things.
- Freely move functionality between actors
- Freely rename internal action methods or change their parameters.
Using the service actor pattern gives much more freedom to make backwards-compatible refactorings that even may eliminate the need for complex migration scenario’s. At least, they allow you to keep your application code unchanged, while keeping any necessary migrations (if any) inside of your service actor and underlying implementation actors.