Percona XtraDB Cluster Transaction Replay Anomaly __FULL__
Should we want that, we could even do the Point-In-Time Recovery, but in our case it does not really matter: once the replication will be configured, all required transactions from binlogs will be applied on the new cluster.
Percona XtraDB Cluster Transaction Replay Anomaly
InnoDB: A transaction-safe (ACID compliant) storage engine for MySQL that has commit, rollback, and crash-recovery capabilities to protect user data. InnoDB row-level locking (without escalation to coarser granularity locks) and Oracle-style consistent nonlocking reads increase multi-user concurrency and performance. InnoDB stores user data in clustered indexes to reduce I/O for common queries based on primary keys. To maintain data integrity, InnoDB also supports FOREIGN KEY referential-integrity constraints. InnoDB is the default storage engine as of MySQL 5.5.5.
As long as the SQL statements execute in the same order when replayed from the binary log (when using statement-based replication, or in recovery scenarios), the results will be the same as they were when Tx1 and Tx2 first ran. Thus, table-level locks held until the end of a statement make INSERT statements using auto-increment safe for use with statement-based replication. However, those locks limit concurrency and scalability when multiple transactions are executing insert statements at the same time.
A consistent read means that InnoDB uses multi-versioning to present to a query a snapshot of the database at a point in time. The query sees the changes made by transactions that committed before that point of time, and no changes made by later or uncommitted transactions. The exception to this rule is that the query sees the changes made by earlier statements within the same transaction. This exception causes the following anomaly: If you update some rows in a table, a SELECT sees the latest version of the updated rows, but it might also see older versions of any rows. If other sessions simultaneously update the same table, the anomaly means that you might see the table in a state that never existed in the database.
Internally, InnoDB adds three fields to each row stored in the database. A 6-byte DB_TRX_ID field indicates the transaction identifier for the last transaction that inserted or updated the row. Also, a deletion is treated internally as an update where a special bit in the row is set to mark it as deleted. Each row also contains a 7-byte DB_ROLL_PTR field called the roll pointer. The roll pointer points to an undo log record written to the rollback segment. If the row was updated, the undo log record contains the information necessary to rebuild the content of the row before it was updated. A 6-byte DB_ROW_ID field contains a row ID that increases monotonically as new rows are inserted. If InnoDB generates a clustered index automatically, the index contains row ID values. Otherwise, the DB_ROW_ID column does not appear in any index.
Before it can start executing, a CREATE INDEX or ALTER TABLE statement must always wait for currently executing transactions that are accessing the table to commit or rollback before it can proceed. In addition, ALTER TABLE statements that create a new clustered index must wait for all SELECT statements that access the table to complete (or their containing transactions to commit). Even though the original index exists throughout the creation of the new clustered index, no transactions whose execution spans the creation of the index can be accessing the table, because the original table must be dropped when clustered index is restructured.
In this case we are talking timewise, which are definitely problematic in Galera. The main thing to understand is that Galera replicates transactions as writesets. Those writesets are certified on the members of the cluster, ensuring that all nodes can apply given writeset. The problem is, locks are created on the local node, they are not replicated across the cluster therefore if your transaction takes several minutes to complete and if you are writing to more than one Galera node, with time it is more and more likely that on one of the remaining nodes some transactions will modify some of the rows updated in your long-running transaction. This will cause certification to fail and long running transaction will have to be rolled back. In short, given you send writes to more than one node in the cluster, longer the transaction, the more likely it is to fail certification due to some conflict.
Galera 4 comes with Streaming Replication, which can be used to mitigate all those problems. The main difference will be that the writeset now can be split into parts - no longer it will be needed to wait for the whole transaction to finish before data will be replicated. This may make you wonder - how the certification look like in such case? In short, certification is on the fly - each fragment is certified and all involved rows are locked on all of the nodes in the cluster. This is a serious change in how Galera works - until now locks were created locally, with streaming replication locks will be created on all of the nodes. This helps in the cases we discussed above - locking rows as transaction fragments come in, helps to reduce the probability that transaction will have to be rolled back. Conflicting transactions executed locally will not be able to get the locks they need and will have to wait for the replicating transaction to complete and release the row locks.
Of course, there are drawbacks of running the streaming replication, mainly due to the fact that locks are now taken on all nodes in the cluster. If you have seen large transaction rolling back for ages, now such transaction will have to roll back on all of the nodes. Obviously, the best practice is to reduce the size of a transaction as much as possible to avoid rollbacks taking hours to complete. Another drawback is that, for the crash recovery reasons, writesets created from each fragment are stored in wsrep_schema.SR table on all nodes, which, sort of, implements double-write buffer, increasing the load on the cluster. Therefore you should carefully decide which transaction should be replicated using the streaming replication and, as long as it is feasible, you should still stick to the best practices of having small, short transactions or splitting the large transaction into smaller batches.
Finally, MariaDB users will be able to benefit from backup locks for SST. The idea behind SST executed using (for MariaDB) mariabackup is that the whole dataset has to be transferred, on the fly, with redo logs being collected in the background. Then, a global lock has to be acquired, ensuring that no write will happen, final position of the redo log has to be collected and stored. Historically, for MariaDB, the locking part was performed using FLUSH TABLES WITH READ LOCK which did its job but under heavy load it was quite hard to acquire. It is also pretty heavy - not only transactions have to wait for the lock to be released but also the data has to be flushed to disk. Now, with MariaDB 10.4, it will be possible to use less intrusive BACKUP LOCK, which will not require data to be flushed, only commits will be blocked for the duration of the lock. This should mean less intrusive SST operations, which is definitely great to hear. Everyone who had to run their Galera Cluster in emergency mode, on one node, keeping fingers crossed that SST will not impact cluster operations should be more than happy to hear about this improvement.
Save the file and we are good to go. The above are the requirements as stated in Camunda docs, especially on the supported transaction isolation for Galera. Variable wsrep_sync_wait is set to 7 to perform cluster-wide causality checks for READ (including SELECT, SHOW, and BEGIN or START TRANSACTION), UPDATE, DELETE, INSERT, and REPLACE statements, ensuring that the statement is executed on a fully synced node. Keep in mind that value other than 0 can result in increased latency.
Save the file and we are good to go. A bit of explanation, the above list are the requirements as stated in Camunda docs, especially on the supported transaction isolation for Galera. Variable wsrep_sync_wait is set to 7 to perform cluster-wide causality checks for READ (including SELECT, SHOW, and BEGIN or START TRANSACTION), UPDATE, DELETE, INSERT, and REPLACE statements, ensuring that the statement is executed on a fully synced node. Keep in mind that value other than 0 can result in increased latency. Enabling Performance Schema is optional for ClusterControl query monitoring feature.
The problem was related to the fact that instructions did not mention any file except for /etc/mysql/my.cnf where, in fact, we should have been modifying /etc/mysql/percona-xtradb-cluster.conf.d/wsrep.cnf. That file contained empty variable:
By default, PostgreSQL has its own built in replication mode for Point In Time Recovery (PITR). This can be set up using either file-based log shipping, where Write Ahead Log files are shipped to a standby server where they are read and replayed, or Streaming Replication, where a read only standby server fetches transaction logs over a database connection to replay them. 350c69d7ab