Like B is slightly out of date ( replication wise ) the service modify something, then A comes with change that modify the same data that you just wrote.
How do you ensure that B is up to date without stopping write to A ( no downtime ).
Have old database be master. Let new be a slave. Load in latest db dump, may take as long as it wants.
Then start replication and catch up on the delay.
You would need, depending on the db type, a load balancer/failover manager. PgBouncer and PgPoolII come to mind, but MySQL has some as well. Let that connect to the master and slave, connect the application to the database through that layer.
Then trigger a failover. That should be it.
VDiff (v2) only compares the source and destination at a specific point in time with resume only comparing rows with PK higher than the last one compared before it was paused. I assume this means:
1. VDiff doesn't catch updates to rows with PK lower than the point it was paused which could have become corrupt, and
2. VDiff doesn't continuously validate cdc changes meaning (unless you enforce extra downtime to run / resume a vdiff) you can never be 100% sure if your data is valid before SwitchTraffic
I'm curious if this is something customers even care about, or is point in time data validation sufficient enough to catch any issues that could occur during migrations?
These larger ones are fully using the PlanetScale SaaS, but they are using Managed -- meaning that there are resources dedicated to and owned by them. You can read more about that here: https://planetscale.com/docs/vitess/managed
All of the PlanetScale features, including imports and online schema migrations or deployment requests (https://planetscale.com/docs/vitess/schema-changes/deploy-re...) are fully supported with PlaneScale Managed.
4.0 Freeze old system
4.1 Cut over application traffic to the new system.
4.2 merge any diff that happened between snapshot 1. and cutover 4.1
4.3 go live
to me, the above reduces the pressure on downtime because the merge is significantly smaller between freeze and go live, than trying to go live with entire environment. If timed well, the diff could be minuscule.
What they are describing is basically, live mirror the resource. Okay, that is fancy nice. Love to be able to do that. Some of us have a mildly chewed bubble gum, a foot of duct tape, and a shoestring.
Lots of systems can tolerate a lot more downtime than the armchair VPs want them to have.
If people don't access to Instagram for 6 hours, the world won't end. Gmail or AWS S3 is a different story. Therefore Instagram should give their engineers a break and permit a migration with downtime. It makes the job a lot easier, requires fewer engineers and cost, and is much less likely to have bugs.