Performance Guido to Qcow2

= Introduction
The Qcow2 image format as used by Qemu supports handling of sparse images (even when the underlying file system used to store the Qcow2 file doesn't), compression, encryption, base-images and in-file snapshots. Base images allow multiple images to be chained, so that writes go to the top level image file only, while reading returns the data from the layer, which is nearest to the top and contains the data. This allows fast and efficient cloning of nearly identical instances, since for each instance only the delta has to be stored in an instance specific image file, which re-uses the common content from a base image.
This can also be used to create external snapshots: On creation, all IO to the image file is stopped, the file renamed to a unique name before a new Qcow2 image file is created on top of the just renamed file, which is than used to store any changes from there on. On revert, the top level image file is just replaced by a new one using the same underlaying image again. On delete, the content from the renamed image file has either to be pushed to all images files, which as based on it, or all subsequent snapshots and there files have to be deleted together.
On the other hand the Qcow2 image format also supports internal snapshots, which use an internal reference counting scheme to implement up to 64K snapshots. This has the benefit of simplifying the handling of snapshots, but also has several drawbacks:
- the implementation is complex and error prone. Up to and including Qemu version 0.15.0 there is a bug, which can corrupt the file.
- there is no easy and officially supported way the backup one snapshot, while a domain using this this is still running.
- there are some performance problems, because the implementation must be very careful to not corrupt the file on errors.
This text is here to better understand the complexity involved.

= Structure
The file format contains some global and per snapshot data structures. A recurring data structure is a two level table, where the top level table consists of multiple consecutive clusters, where each entry is a 64 Bit offset into the file, where a second level table is stored, which has the size of exactly on cluster. In case of the reference count table, this second level table directly consists of 16 Bit counters, while for the so-called L2 table each entry consists of another 64 Bit file offset pointing to the cluster containing the actual data.
To efficiently handle large files, the Qcow2 format does not work on sectors (512 B), but on clusters, which default to 64 KiB. It can be changed between 512 B and 2 MiB. Larger sizes reduce the number of meta data entries, but can waste more space and IO bandwidth, especially when only small parts need to be updated.
For each cluster a reference counter is used to record the number of uses. Snapshots re-use each others data to make creating a snapshot fast. But on modification, shared clusters must not me modified, else all snapshots using that data would be modified as well. Therefor on modification the content of a cluster needs to be copied to a new location, which from then on is used for this and further modifications. When new clusters are allocated or old clusters are freed, all other meta-data structures must be updated as well, which might result in additional cluster allocations. For unshared clusters (reference counter = 1) the most significant bit is used to optimize the case, that now cluster copying needs to be done.
To be crash resilient it must be guaranteed, that after a crash of the host the file structure is still valid. It might happen, that the latest data might not have reach the persistent storage, but especially the meta-data structure should not point to un-allocated or invalid data. This forces Qemu to do a lot of calls to fdatasync(), which is used to flush the data to disc and guarantee ordering constraints.

= Basic access
Compression and encryption are ignored to simplify the following analysis. Also ignored is the explicit masking of the two most significant bits, which are used to indicate non-shared clusters and encryption.

For reading, the offset is split in three parts:
- The most important bits are used as an index into the L1 table to find the file offset, at which the corresponding L2 table is stored. The number of bits are the remaining bits not used for the following to parts.
- The middle bits are than used to index into that L2 table to find the file offset of the data cluster. The number of bits gets calculated from the cluster size and on how many 64 Bit entries fit into a cluster.
- Than the least important bits are finally used to access the specified byte inside that cluster. The number of bits is explicitly given in the global header of the Qcow2 file.

For a normal read access, up to 4 read accesses are needed:
1. opening the file and reading the cluster_size and l1_table_offset,
2. reading the relevant part of the L1 table to get the file offset for the L2 table,
3. reading part of the L2 table to get the file offset of the data cluster, 
4. and finally one access to read the requested data.
This should make it clear, that it is important to cache some data to reduce the number of disc accesses. While the cluster_size is fixed on image creation, the l1_table_offset might get updated when snapshots are created or when the image grows beyond some specific size. The L1 table is best cached in memory, but might get very large for huge files, so implementations should even work when the L1 table doesn't fit in memory. The L2 tables need even more data, while the data clusters need the most. Keeping them all in memory is normally our of question, but keeping at least some (recently used) subset should speed up most implementation considerably.

Write-access is more complicated, especially if only part of a cluster is written, since this (might) require reading the old data before filling in the new data.
- Get cluster_size and l1_table_offset from header
- Read part of the L1 table to get the file offset of the L2 table,
-- If the MSB indicates the L2 to be shared: allocate a new cluster, copy L2 table by reading it from old position and writing it to newly allocated cluster before updating L1 table to point to new unshared L2 table.
- Read part of the L2 table to get the file offset of the data cluster,
-- If the MSB indicates the data cluster to be shared: allocate a new cluster, copy data cluster by reading it from old position and writing it to newly allocated cluster before updating L2 table to point to new unshared data cluster.
The MSG bit can only be used as a hint: If it is set, the pointed to cluster is unshared and can be updated without any need for copying. If the bit is unset, the reference count table must be consulted, which requires up to two accesses: one for reading the reference count table to get the file offset of the reference count block, for which the second access is needed. Only if the indicated 16 Bit counter is greater one, is the page shared and must be unshared by copying it.
Allocating a new cluster might overflow the current reference count table (the first level uses consecutive clusters), in which case a new reference table must be allocated at the end of the file or in some free place big enough.

= Snapshot
== Create
1. increment reference counter of all data and L2 clusters referenced by current L1/L2 tables
2. update current L2 table to mark data clusters as shared, e.g. remove MSB
3. update current L1 table to mark L2 tables as shared, e.g. remove MSG
4. create new snapshot and associate current L1 table with it
5. clone new L1 table from current L1 table

== Goto AKA rollback
1. decrement reference counter of all data and L2 clusters referenced by current L1/L2 tables
2. clone new L1 table from snapshot L1 table
3. increment reference counter of all data and L2 clusters referenced by current L1/L2 tables

== Delete
1. decrement reference counter of all data and L2 clusters referenced by snapshot L1 table
2. free snapshot L1 table
3. free snapshot data


= Caching modes
Qemu supports several caching modes, which guarantee different levels of data security even in the case of host crashes.
off|none	Uses O_DIRECT to bypass the host cache.
writeback	Uses the host cache by not using O_DSYNC, but used fdatasync() for critical writes.
writethrough	Uses only O_DSYNC and fdatasync() to force every write to be synchronous. This is the default, because it's the safest mode, but it cripples Qcow2 performance wise, since many updates require meta data updates as well, which are all force-synced.
unsafe	This uses the host cache by not using O_DSYNC and also disables fdatasync(), making it risky for host crashes.

= Ordering constraints
Rule of thumb: always point to stable data

== On allocation
1. increment reference counter to reserve space and prevent double use
2. write new data
3. update L2
4. update L1
5. update header

== On free
TODO: check this.
1. update L1
2. update L2
3. update header
4. decrement reference counter


= Formulas
cluster_size = 1 << cluster_bits

data_clusters = ceil(file_size / cluster_size)

l2_bits = cluster_bits - 3
l2_len = 1 << l2_bits = cluster_size / 8
l2_clusters = ceil(data_clusters / l2_len) = ceil(data_clusters * 8 / cluster_size)

l1_bits = 62 - l2_bits - cluster_bits = 65 - 2*cluster_bits
l1_len = l2_clusters
l1_clusters = ceil(l2_clusters * 8 / cluster_size)

rb_len = cluster_size / 2
rb_clusters = ceil(clusters / rb_len)

rt_len = rb_clusters
rt_clusters = ceil(rb_clusters * 8 / cluster_size)

header_clusters = 1
snapshot_clusters = sum_i=0^n( snapshot_data_i + l1_clusters_i + l2_clusters_i )

clusters = header_clusters + snapshot_clusters + l1_clusters + l2_clusters + rt_clusters + rb_clusters + data_clusters