UBI - Unsorted Block Images

参考：http://www.linux-mtd.infradead.org/doc/ubi.html

UBI - Unsorted Block Images

Big red note
Overview
Source code
Mailing list
User-space tools
UBI headers
UBI volume table
- Implementation details
Minimum flash input/output unit
NAND flash sub-pages
UBI headers position
Flash space overhead
Saving erase counters
How UBI flasher should work
Marking eraseblocks as bad
Scalability issues
- Implementation details
Reserved blocks for bad block handling (only for NAND chips)
Volume auto-resize
UBI operations
1. LEB un-map
  - Implementation details
2. LEB map
3. Volume update
4. Atomic LEB change
  - Implementation details
Fastmap
More documentation

Big red note

People are often confused about what UBI is, which was the reason for creating this section. Please, realize that:

UBI is not a Flash Translation Layer (FTL), and it has nothing to do with FTL;
UBI works with bare flashes, and it does not work with consumer flashes like MMC, RS-MMC, eMMC, SD, mini-SD, micro-SD, CompactFlash, MemoryStick, USB flash drive, etc; instead, UBI works with raw flash devices, which are mostly found in embedded devices like mobile phones, etc.

Please, do not be confused. Read here for more information about how raw flash devices are different to FTL devices.

Overview

UBI (Latin: "where?") stands for "Unsorted Block Images". It is a volume management system for raw flash devices which manages multiple logical volumes on a single physical flash device and spreads the I/O load (i.e, wear-leveling) across whole flash chip.

In a sense, UBI may be compared to the Logical Volume Manager (LVM). Whereas LVM maps logical sectors to physical sectors, UBI maps logical eraseblocks to physical eraseblocks. But besides the mapping, UBI implements global wear-leveling and transparent I/O errors handling.

An UBI volume is a set of consecutive logical eraseblocks (LEBs). Each logical eraseblock may be mapped to any physical eraseblock (PEB). This mapping is managed by UBI, it is hidden from users and it is the base mechanism to provide global wear-leveling (along with per-physical eraseblock erase counters and the ability to transparently move data from more worn-out physical eraseblocks to less worn-out ones).

UBI volume size is specified when the volume is created and may later be changed (volumes are dynamically re-sizable). There are user-space tools which may be used to manipulate UBI volumes.

There are 2 types of UBI volumes - dynamic volumes and static volumes. Static volumes are read-only and their contents are protected by CRC-32 checksums, while dynamic volumes are read-write and the upper layers (e.g., a file-system) are responsible for ensuring data integrity.

UBI is aware of bad eraseblocks (e.g., NAND flash may have them) and frees the upper layers from any bad block handling. UBI has a pool of reserved physical eraseblocks, and when a physical eraseblock becomes bad, it transparently substitutes it with a good physical eraseblock. UBI moves good data from the newly appeared bad physical eraseblocks to good ones. The result is that users of UBI volumes do not notice I/O errors as UBI takes care of them.

NAND flashes may have bit-flips which occur on read and write operations. Bit-flips are corrected by ECC checksums, but they may accumulate over time and cause data loss. UBI handles this by moving data from physical eraseblocks which have bit-flips to other physical eraseblocks. This process is called scrubbing. Scrubbing is done transparently in background and is hidden from upper layers.

Here is a short list of the main UBI features:

UBI provides volumes which may be dynamically created, removed, or re-sized;
UBI implements wear-leveling across whole flash device (i.e., you may continuously write/erase only one logical eraseblock of an UBI volume, but UBI will spread this to all physical eraseblocks of the flash chip);
UBI transparently handles bad physical eraseblocks;
UBI minimizes chances to lose data by means of scrubbing.

Here is a comparison of MTD partitions and UBI volumes. They are somewhat because:

both consist of eraseblocks - logical eraseblocks in case of UBI volumes, and physical eraseblocks in case of MTD partitions;
both support three basic operations - read, write, and erase.

But UBI volumes have the following advantages over MTD partitions:

UBI volumes have no eraseblock wear-leveling constraints, so users do not have to care about this at all, which means the upper level software may be simpler;
UBI volumes have no bad eraseblocks, which also leads to simpler upper level software;
UBI volumes are dynamic in a sense that they may be created, removed or re-sized dynamically, while MTD partitions are static;
UBI handles bit-flips which again makes the upper level software simpler;
UBI provides a volume update operations which makes it easy to detect interrupted software updates and recover;
UBI provides an atomic logical eraseblock change operation which allows to change the contents of a logical eraseblock without loosing the data if an unclean reboot happens during the operation; this is might be very useful for the upper-level software (e.g., for a file-system);
UBI has an un-map operation, which just un-maps a logical eraseblock from the physical eraseblock, schedules the physical eraseblock for erasure and returns; this is very quick and frees upper level software from implementing their own mechanisms to defer erasures (e.g., JFFS2 has to implements such mechanisms).

There is an additional driver called gluebi which emulates MTD devices on top of UBI volumes. This looks a little strange, because UBI works on top of an MTD device, then gluebiemulates other MTD devices on top, but this actually works and makes it possible for existing software (e.g., JFFS2) to run on top of UBI volumes. However, new software may benefit from the advanced UBI features and let UBI solve many issues which the flash technology imposes.

Source code

UBI is in the main-line Linux kernel starting from version 2.6.22. But it is recommended to use the latest UBI, because we have fixed many bugs since that time, made many improvements and added new features. The UBI git tree may be found at:

git://git.infradead.org/ubi-2.6.git

Here is the corresponding Git-web view.

The git tree has 2 branches - the master branch and linux-next branches. The master branch contains the most recent stuff which is often incomplete, buggy, or has not been tested very well. This branch is re-based from time to time. Please, do not use it unless you are an UBI developer. The linux-next branch contains stable UBI changes which are going to be merged upstream soon. This branch is included to the linux-next git tree. Please, use this branch unless you are an UBI developer.

Mailing list

You are welcome to send feed-back, bug-reports, patches, etc to the MTD mailing list.

User-space tools

UBI user-space tools, as well as other MTD user-space tools, are available from the the following git repository:

git://git.infradead.org/mtd-utils.git

Please, clone it and compile using make from the root mtd-utils directory. This section provides information about how to compile the whole mtd-utils repository tree. You should find the UBI tools under the ubi-utils sub-directory.

The repository contains the following UBI tools:

ubinfo - provides information about UBI devices and volumes found in the system;
ubiattach - attaches MTD devices (which describe raw flash) to UBI and creates corresponding UBI devices;
ubidetach - detaches MTD devices from UBI devices (the opposite to what ubiattach does);
ubimkvol - creates UBI volumes on UBI devices;
ubirmvol - removes UBI volumes from UBI devices;
ubiupdatevol - updates UBI volumes; this tool uses the UBI volume update feature which leaves the volume in "corrupted" state if the update was interrupted; additionally, this tool may be used to wipe out UBI volumes;
ubicrc32 - calculates CRC-32 checksum of a file with the same initial seed as UBI would use;
ubinize - generates UBI images;
ubiformat - formats empty flash, erases flash and preserves erase counters, flashes UBI images to MTD devices;
mtdinfo - reports information about MTD devices found in the system.

All UBI tools support "-h" option and print sufficient usage information.

Note, the ubiattach and ubidetach tools won't work if the kernel version is less than 2.6.25, because corresponding UBI features did not exist in the older kernels.

UBI headers

UBI stores 2 small 64-byte headers at the beginning of each non-bad physical eraseblock:

erase counter header (or EC header) which contains the erase counter of the physical eraseblock (PEB) plus some other not so important information;
volume identifier header (or VID header) which stores volume ID and logical eraseblock (LEB) number this PEB belongs to (plus some other not so important information).

This is why logical eraseblocks are smaller than physical eraseblock - the headers take some flash space.

All UBI headers are protected by the CRC-32 checksum. Please, refer the drivers/mtd/ubi/ubi-media.h file in the linux kernel for more information about the header's contents.

When UBI attaches an MTD device, it has to scan it, read all headers, check the CRC-32 checksums, and store erase counters and the logical-to-physical eraseblock mapping information in RAM. Please, refer this section for information about scalability issues related to this.

After UBI has erased a PEB, it writes the EC header with increased erase counter value. This means that PEBs always have the EC header, except for the short period of time after the erasure and before the EC header is written. Should an unclean reboot happen during this short period of time, the EC header is lost or becomes corrupted. In this case UBI writes new EC header with an average erase counter just after the MTD device scanning is done.

The VID header is written to the PEB when UBI associates it with an LEB. Let's consider what happens to the headers in case of some UBI operations.

The LEB un-map operation just un-maps the LEB from the PEB and schedules the PEB for erasure. When the PEB is erased, the EC header is written straight away. The VID header is not written.
The LEB map operation or a write operation to an un-mapped LEB makes UBI find an appropriate PEB and write the VID header to it (the EC header must already be there). Note, the write operation to an already mapped LEB just writes the data straight to PEB and does not change the UBI headers.

UBI maintains two per-PEB headers because it needs to write different information on flash at different moments of time:

after a PEB is erased, the EC header is written straight away, which minimizes the probability of losing the erase counter due to unclean reboots;
when UBI associates a PEB with an LEB, the VID header is written to the PEB.

When the EC header is written to a PEB, UBI does not yet know the volume ID and LEB number this PEB will be associated with. This is why UBI needs to do two separate write operations and to have two separate headers.

UBI volume table

Volume table is an on-flash data structure which contains information about each volume on this UBI device. The volume table is an array of volume table records. Each record contains the following information:

volume size;
volume name;
volume type (dynamic or static);
volume alignment;
update marker (set for volumes which had interrupted updates;
auto-resize flag;
CRC-32 checksum for this record.

Each record describes one UBI volume and record index in the volume table array corresponds to the volume ID. I.e, UBI volume 0 is described by record 0 in the volume table, and so on. Count of records in the volume table is limited by the LEB size, but cannot be greater than 128. This means that UBI devices cannot have more than 128 volumes.

Every time an UBI volume is created, removed, re-sized, re-named or updated, the corresponding volume table record is changed. UBI maintains two copies of the volume for reliability and power-cut tolerance reasons.

Implementation details

Internally, the volume table resides in a special-purpose UBI volume which is called layout volume. This volume consists of 2 LEBs - one for each copy of the volume table. The layout volume is an "internal" UBI volume, and the users do not see it and cannot access it. When reading or writing the layout volume, UBI uses the same mechanisms which are used for normal user volumes.

UBI uses the following algorithm when updating a volume table record.

Prepare in-memory buffer with the new volume table contents.
Un-map LEB0 of the layout volume.
Write the new volume table to LEB0.
Un-map LEB1 of the layout volume.
Write the new volume table to LEB1.
Flush the UBI work queue to make sure the PEBs are corresponding to the un-mapped LEBs are erased.

When attaching the MTD device, UBI makes sure that the 2 volume table copies are equivalent. If they are not equivalent, which may be caused by an unclean reboot, UBI picks the one from LEB0 and copies it to LEB1 of the layout volume (because it is newer). If one of the volume table copies is corrupted, UBI restores it from the other volume table copy.

Minimum flash input/output unit

UBI uses an abstract model of flash. In short, from UBI's point of view the flash (or MTD device) consists of eraseblocks, which may be good or bad. Each good eraseblock may be read from, written to, or erased. Good eraseblocks may also be marked as bad.

Flash reads and writes may only be done in portions of minimum input/output unit size, which depends on flash type.

NOR flashes usually have min. I/O unit size of 1 byte, because NOR flashes usually allow reading and writing single bytes (in fact, it is even be possible to change individual bits).
Some NOR flashes may have other min. I/O unit sizes, e.g. 16 or 32 bytes in case of ECC'd NOR flashes.
NAND flashes usually have 512, 2048 or 4096 byte min. I/O. unit size, which corresponds to NAND page size. NAND flashes store per-NAND page ECC codes in the OOB area, which means that whole NAND page has to be written at once to calculate the ECC code, and whole NAND page has to be read at once to check the ECC code.

The min. I/O unit size is a very important characteristic of the MTD device. It affects many things, e.g.:

physical position of the VID header depends on the min. I/O unit size, which means that LEB size also depends on it; generally, the larger is the min. I/O unit size, the less is LEB size, and the greater is UBI flash space overhead;
all writes to LEBs should be aligned to min. I/O unit size, and should be multiple of the min. I/O unit size; this does not apply to reads, but bear in mind that on the MTD level all reads are done in fractions of min. I/O unit size anyway; this is just hidden from users by buffering the read data and copying only the requested amount of bytes to the user buffer.

NAND flash sub-pages

As it is said here, all UBI I/O should be done in fractions of min. I/O unit size, which is equivalent to NAND page size in case of NAND flash. However, some SLC NAND flashes allow for smaller I/O units, which are called sub-pages in MTD terminology. Not all NANDs have sub-pages.

MLC NANDs do not have sub-pages, at least to the date of writing of this piece of documentation (April 2009).
SLC NANDs usually do have sub-pages. E.g., 512-byte NAND pages usually consist of 2x256-byte sub-pages, and 2048-byte NAND pages consist of 4x512-byte sub-pages.
SLC OneNAND chips with 2048 bytes NAND page size have 4x512-byte sub-pages.

If the NAND flash supports sub-pages, then what can be done is ECC codes can be calculated on per-sub-page basis, instead of per-NAND page basis. In this case it becomes possible to read and write sub-pages independently.

But obviously, even though the NAND chip may support sub-pages, the NAND controller may disallow them. Indeed, if the flash is managed by a controller which calculates ECC codes on per-NAND page basis, then it is impossible to do I/O in sub-page fractions. E.g. this is the case for the OLPC XO-1 laptop) - its NAND chip supports sub-pages, but the NAND controller does not.

Note, sub-page is an MTD term, but this is also referred to as "NOP" which stands for "number of partial programs". NOP1 NAND flashes have no sub-pages - UBI treats them as NANDS with sub-page size equivalent to NAND page size. NOP2 NAND flashes have 2 sub-pages (half a NAND page each), NOP4 flashes have 4 sub-pages (quarter of a NAND page each).

UBI utilizes sub-pages to lessen flash space overhead. The overhead is less if NAND flash supports sub-pages (see here). Indeed, let's consider a NAND flash with 128KiB eraseblocks and 2048-byte pages. If it does not have sub-pages, UBI puts the the VID header at physical offset 2048, so LEB size becomes 124KiB (128KiB minus one NAND page which stores the EC header and minus another NAND page which stores the VID header. In opposite, if the NAND flash does have sub-pages, UBI puts the VID header at physical offset 512 (the second sub-page), so LEB size becomes 126KiB (128KiB minus one NAND page which is used for storing both UBI headers). See this section for more information about where the UBI headers are stored.

Sub-pages are used by UBI only internally, and only for storing the headers. UBI API does not allow users doing I/O in sub-page units. One of the reasons for this is that sub-page writes may be slow. To write a sub-page, the driver may actually write whole NAND page, but put 0xFF bytes to the sub-pages which are not relevant to this operation. E.g., this means that writing 4 sub-pages may be 4 times slower than writing whole NAND page at once. Thus, UBI does use sub-pages for the headers, but this notion does not exist in the UBI API.

UBI headers position

The EC header always resides at offset 0 and takes 64 bytes, the VID header resides at the next available min. I/O unit or sub-page, and also takes 64 bytes. For example:

in case of NOR flash which has 1 byte min. I/O unit, the VID header resides at offset 64;
in case of NAND flash which does not have sub-pages, the VID header resides at the second NAND page;
in case of NAND flash which has sub-pages, the VID header resides at the second sub-page.

Flash space overhead

UBI uses some amount of flash space for its own purposes, thus reducing the amount of flash space available for UBI users. Namely:

2 PEBs are used to store the volume table;
1 PEB is reserved for wear-leveling purposes;
1 PEB is reserved for the atomic LEB change operation;
some amount of PEBs is reserved for bad PEB handling; this is applicable for NAND flash, but not for NOR flash; the amount of reserved PEBs is configurable and is equal to 20 blocks per 1024 blocks by default;
UBI stores the EC and VID headers at the beginning of each PEB; the amount of bytes used for these purposes depends on the flash type and is explained below.

Lets introduce symbols:

W - total number of physical eraseblocks on the flash chip (NB: the entire chip, not the MTD partition);
P - total number of physical eraseblocks on the MTD partition);
S_P - physical eraseblock size;
S_L - logical eraseblock size;
B_B - number of bad blocks on the MTD partition;
B_R - number of PEBs reserved for bad PEB handling. it is 20 * W/1024 for NAND by default, and 0 for NOR and other flash types which do not have bad PEBs;
B - MAX(B_R,B_B);
O - the overhead related to storing EC and VID headers in bytes, i.e. O = S_P - S_L.

The UBI overhead is (B + 4) * S_P + O * (P - B - 4) i.e., this amount of bytes will not be accessible for users. O is different for different flashes:

in case of NOR flash which has 1 byte minimum input/output unit, O is 128 bytes;
in case of NAND flash which does not have sub-pages (e.g., MLC NAND), O is 2 NAND pages, i.e. 4KiB in case of 2KiB NAND page and 1KiB in case of 512 bytes NAND page;
in case of NAND flash which has sub-pages, UBI optimizes its on-flash layout and puts the EC and VID headers at the same NAND page, but different sub-pages; in this case Ois only one NAND page;
for other flashes the overhead should be 2 min. I/O units if the min. I/O unit size is greater or equivalent to 64 bytes, and 2 times 64 bytes aligned to the min. I/O unit size if the min. I/O unit size is less than 64 bytes.

N.B.: the formula above counts bad blocks as a UBI overhead. The real UBI overhead is: (B - B_B + 4) * S_P + O * (P - B - 4).

Saving erase counters

When working with UBI, it is important to realize that UBI stores erase counters on the flash media. Namely, each physical eraseblock has so-called erase counter header which stores the amount of times this physical eraseblock has been erased (see here). And of course, it is important not to lose the erase counters, which means that the tools you use to erase the flash and to write UBI images have to be UBI-aware. The mtd-utils repository contains the ubiformat utility which takes things right.

How UBI flasher should work

The following is a list of what the UBI flasher program has to do when erasing the flash or when flashing UBI images.

First of all, scan the flash and collect the erase counters. Namely, it read the EC header from each PEB, check the CRC-32 checksum of the header, and save the erase counter in a RAM. It is not necessary to read VID headers. Bad PEBs should be skipped.
Calculate average erase counter. It should be used for PEBs with corrupted or missing EC headers. Such PEBs may be there because of unclean reboots, but there shouldn't be too many of them.
If the intention is to just erase the flash, then each PEB has to be erased and proper EC header has to be written at the beginning of the PEB. The EC header should contain incremented erase counter. Bad PEBs should be just skipped. For NAND flashes, in case of I/O errors while erasing or writing, the PEB should be marked as bad (see here for more information how UBI marks PEBs as bad).
If the intention is to flash an UBI image, then the flasher should do the following for each non-bad PEB.
- Read the contents of this PEB from the UBI image (PEB size bytes) into a buffer.
- Stripe min. I/O units full of 0xFF bytes from the end of the buffer (the details are given below in this section).
- Erase the PEB.
- Change the EC header in the buffer - put the new erase counter value there and re-calculate the CRC-32 checksum.
- Write the buffer to the physical eraseblock.
As usually, bad PEBs should be just skipped. And for NAND flashes, in case I/O errors while erasing or writing, the PEB should be marked as bad.

In practice the input UBI image is usually shorter than the flash, so the flasher has to flash the used PEBs properly, and erase the unused PEBs properly.

Note, when writing an UBI image, it does not matter where eraseblocks from the input UBI image will be written. For example, the first input eraseblock may be written to the first PEB, or to the second one, or to the last one.

Also note, if you implement a flasher which writes UBI images at the production line, i.e., only once, then the flasher does not have to change EC headers of the input UBI image, because this is new flash and each PEB has zero erase counter anyway. This means the production line flasher may be simpler.

If your UBI image contains UBIFS file system, and your flash is NAND, you may have to drop 0xFF bytes the end of input PEB data. This is very important, although not required for all NAND flashes. Sometimes a failure to do this may result in very unpleasant problems which might be difficult to debug later. So we recommend to always do this.

The reason for this is that UBIFS treats NAND pages which contain only 0xFF bytes (let's refer them to as empty NAND pages) as free. For example, suppose the first NAND page of a PEB has some data, the second one is empty, the third one also has some data, the fourth one and the rest of NAND pages are empty as well. In this case UBIFS will treat all NAND pages starting from the fourth one as free, and will write data there. However, if the flasher program has already written 0xFF's to these pages, so they will be written to twice! However, many NAND flashes require NAND pages to be written only once, even if the data contains only 0xFF bytes.

To put it differently, writing 0xFF bytes may have side-effects. What the flasher has to do is to drop all empty NAND pages from the end of the PEB buffer before writing it. It is not necessary to drop all empty NAND pages, just the last ones. This means that the flasher does not have to scan whole buffer for 0xFF's. It is enough to scan the buffer from the end and stop on the first non-0xFF byte. This is much faster. Here is the code from UBI which does the right thing.

/**
 * calc_data_len - calculate how much real data are stored in a buffer.
 * @ubi: UBI device description object
 * @buf: a buffer with the contents of the physical eraseblock
 * @length: the buffer length
 *
 * This function calculates how much "real data" is stored in @buf and returns
 * the length. Continuous 0xFF bytes at the end of the buffer are not
 * considered as "real data".
 */
int ubi_calc_data_len(const struct ubi_device *ubi, const void *buf,
                      int length)
{
        int i;

        for (i = length - 1; i >= 0; i--)
                if (((const uint8_t *)buf)[i] != 0xFF)
                        break;

        /* The resulting length must be aligned to the minimum flash I/O size */
        length = ALIGN(i + 1, ubi->min_io_size);
        return length;
}

This function is called before writing the buf buffer to the PEB. The purpose of this function is to drop 0xFF's from the end and prevent the situation described above. The ubi->min_io_size is the minimal input/output unit size which is equivalent to NAND page size.

By the way, we experienced the similar problems with JFFS2. The JFFS2 images generated by the mkfs.jffs2 program were padded to the physical eraseblock size and were later flashed to our NAND. The flasher did not bother skipping empty NAND pages. When JFFS2 was mounted, it wrote to those NAND pages, and the writes did not fail. But later we observed weird ECC errors. It took a while to find out the problem. In other words, this is also relevant to JFFS2 images.

An alternative to this approach is to enable the "free space fixup" option when generating the UBIFS file system using mkfs.ubifs. This will allow your flasher to not have to worry about 0xFF bytes at the end of PEBs, which is particularly useful if you need to use an industrial flash programmer to write a UBI image. More information is available here.

Marking eraseblocks as bad

This section is relevant for NAND flashes and other flashes which admit of bad eraseblocks. UBI marks physical eraseblocks as bad on 2 occasions:

eraseblock write operation failed, in which case UBI moves the data from this PEB to some other PEB (data recovery) and schedules this PEB for torturing;
erase operation failed with EIO error, in which case the eraseblock s marked as bad straight away.

The torturing is done in background with the purpose of detecting whether the physical eraseblock is really bad. The write failure might have happened because of many reasons, including bugs in the driver or in the upper level stuff like the file system (e.g., the FS mistakenly writes many times to the same NAND page). During the torturing UBI does the following:

erase the eraseblock;
read it back and make sure it contains only 0xFF bytes;
write test pattern bytes;
read the eraseblock back and check the pattern;
and so on for several patterns (0xA5, 0x5A, 0x00).

The eraseblock is not marked as bad if it survives the torture test. Note, a bit-flip during the torture test is treated as a good reason to mark the eraseblock bad as well. Please, refer the torture_peb() function for detailed information.

Scalability issues

Unfortunately, UBI scales linearly in terms of flash size. UBI initialization time linearly depends on the number of physical eraseblocks on the flash. This means that the larger is the flash, the more time it takes for UBI to initialize (i.e., to attach the MTD device). Note: Starting with Linux v3.7 UBI offers an optional and experimental feature, called "fastmap", which allows attaching in nearly constant time, see Fastmap. The initialization time depends on the flash I/O speed and (slightly) on the CPU speed, because:

UBI scans the MTD device when attaching - it reads the erase EC and VID headers from every single PEB; the headers are small (64 bytes each), so this means reading 128 bytes from each PEB in case of NOR flash or one or two NAND pages in case of NAND flash (this depends on whether the NAND flash supports sub-pages or not); this is anyway much less than JFFS2 needs to read when it mounts MTD devices, so UBI attaches MTD devices many times faster than JFFS2 would mount a file system on the same MTD device;
UBI calculates CRC-32 checksum of each EC and VID header, which consumes CPU, although this is usually minor comparing to the flash I/O overhead.

Here are some figures:

a 256MiB OneNAND flash found in Nokia N800 devices is attached for less than 1 sec; the flash does support sub-pages so UBI has to read the first 2KiB NAND page of each PEB while scanning;
a 1GiB NAND flash found in OLPC XO-1 devices is attached for about 2 seconds; the flash is an SLC NAND and supports sub-pages, but the Cafe controller which is used in the laptop does not allow sub-page writes, so UBI has to read two 2KiB NAND pages from each PEB.

Unfortunately we do not have more data and the reader is welcome to send it to us via the MTD mailing list.

Implementation details

In general, UBI needs three tables to operate:

volume table which contains per-volume information, like volume size, type, etc;
eraseblock association (EBA) table which contains the logical-to-physical eraseblock mapping information; for example, when reading an LEB, UBI first looks up the table to find the corresponding PEB number, then reads from this PEB;
erase counters (EC) table which contains the erase counter value for each physical eraseblock; UBI wear-leveling sub-system uses this table when it needs to find, for example, a highly worn-out LEB;

The volume table is maintained on flash. It changes only when UBI volumes are created, deleted and re-sized, which are rare and not time-critical operations, and UBI can afford a slow and simple method of the volume table management.

The EBA and EC tables are changed every time an LEB is mapped to a PEB or a PEB is erased, which happens quite often and means that the table management methods should be fast and efficient.

UBI could maintain on the EBA and EC tables on the flash media, but this would inevitably involve journaling, journal replay, journal commit, etc. In other words, this would introduce a lot of complexity. But UBI would be logarithmically scalable in this case.

One of the UBI requirements was simplicity of the on-flash format, because UBI authors had to read UBI volumes from the boot-loader and they had very tough constraints on the boot-loader code size. It was basically impossible to add complex journal scanning and replay code to the boot-loader.

So UBI does not maintain the EBA and EC tables on the flash media. Instead, it builds them in RAM each time it attaches the MTD device. This means that UBI has to scan whole flash and read the EC and VID headers from each PEB in order to build in-RAM EC and EBA tables.

The drawbacks of this design are poor scalability and relatively high overhead on NAND flashes (e.g., the overhead is 1.5%-3% of flash space in case of a NAND flash with 2KiB NAND page and 128KiB eraseblock). The advantages are simple binary format and robustness, as the result of simplicity.

Nonetheless, it is always possible to create UBI2 which would maintain the tables in separate flash areas. UBI2 would not be compatible with UBI because of completely different on-flash formats, but the user interfaces would stay the same, which would guarantee compatibility of all the software built on top of UBI.

Reserved blocks for bad block handling (only for NAND chips)

It is well-known that NAND chips have some amount of physical eraseblocks marked as bad by the manufacturer. During the lifetime of the NAND device, other bad blocks may appear. Although, manufacturers usually guarantee that the first few physical eraseblocks are not bad and the total amount of bad PEBs will not exceed certain number. For example, a 256MiB (2048 128KiB PEBs) Samsung OneNAND chip is guaranteed to have not more than 40 128KiB PEBs during its endurance lifetime. This is a very common value for NAND devices: 20/1024 PEB, which is about 2% of flash size.

This ratio of 20/1024 is the default number of blocks that UBI reserves for a UBI device. It means that if there's 2 UBI devices on a 4096 PEB NAND, 80 PEB for each UBI device will be reserved. This may appear as a waste of space, but as far as bad blocks can appear everywhere on the NAND flash, and are not equally disposed on the whole device, it's the safer way. So instead of using several UBI devices on a NAND flash, it's more space efficient to use only one UBI device and several UBI volumes.

The default value of 20 PEB reserved per 1024 PEB is a kernel config option. For each UBI device, this value can be adjusted via a kernel parameter or an ubiattach parameter (since kernel 3.7).

Volume auto-resize

When it is needed to create an UBI image which will be flashed to the end user devices in production line, you should define exact sizes of all volumes (the sizes are stored in the UBI volume table). But usually, in the embedded world, we like to have one (read only) volume for the root file system and one read write volume for the rest (logs, user data, etc.). If the size of the root file system is fixed, the size of the second one can vary from one product to another (different flash sizes) and we just want all space left.

That what the auto-resize is about. If the volume has the auto-resize mark, its size will be enlarged when UBI is run for the first time. After the volume size is adjusted, UBI removes the auto-resize mark and the volume is not re-sized anymore. The auto-resize flag is stored in the volume table and only one volume may be marked as auto-resize.

UBI operations

LEB un-map

The LEB un-map operation is implemented by the ubi_leb_unmap() UBI kernel API function. And starting from kernel version 2.6.29 the un-map operation is available to the user-space programs via the UBI_IOCEBUNMAP ioctl command. The ioctl should be called for UBI volume character devices.

The LEB un-map operation:

first un-maps the LEB from the corresponding PEB;
then schedules the PEB for erasure and returns; it does not wait for the erasure of the PEB to be finished; the PEB is instead erased in context of the UBI background thread;

UBI returns all 0xFF bytes when an un-mapped LEB is read, so the un-map operation may be considered as a very fast erase operation. But there is one aspect UBI programmers have to be well aware of.

Suppose you un-map LEB L which is mapped to PEB P. Since P is not synchronously erased, but just scheduled for erasure, there might be "surprises" in case of unclean reboots: if the reboot happens before P has been physically erased, L will be mapped to P again when UBI attaches the MTD device after the unclean reboot. Indeed, UBI will scan the MTD device and find P which refers L, and it will add this mapping information to the EBA table.

But once you write any data to L, or map it using the LEB map operation, it gets mapped to a new PEB and the old contents goes forever, because even in case of an unclean reboot UBI would pick the newer mapping for L.

Implementation details

This section describes how UBI distinguishes between older and newer versions of an LEB in case of an unclean reboot. Suppose we un-map LEB L which is mapped to PEB P₁, which means UBI schedules P₁ for erasure. Then we write some data to L, which means that UBI finds another PEB P₂, maps L to P₂, and writes the data to P₂. If an unclean reboot happens before P₁ is physically erased, but after the write operation, we end up with 2 PEBs (P₁ and P₂) mapped to the same LEB L.

To handle situations like this, UBI maintains a global 64-bit sequence number variable. The sequence number variable is increased each time a PEB is mapped to a LEB and its value is stored in the VID header of the PEB. So each VID header has a unique sequence number, and the larger is the sequence number, the "younger" is the VID header. When UBI attaches MTD devices, it initializes the global sequence number variable to the highest value found in existing VID headers plus one.

In the above situation, UBI just selects a PEB with higher sequence number (P₂) and drops the PEB with lower sequence number (P₁).

Note, the situation is more difficult if an unclean reboot happens when UBI moves the contents of one PEB to another for a wear-leveling purposes, or when it happens during theatomic LEB change operation. In this case it is not enough to just pick the newer PEB, it is also necessary to make sure the data reached the the new PEB.

LEB map

The LEB map operation maps a previously un-mapped logical eraseblock to a physical eraseblock. For example, if the operation is run for LEB A, UBI will find appropriate PEB, writeVID header to the PEB, and amend the in-memory EBA table. The VID header will refer LEB A. After this operation all I/O to LEB A will actually go to the mapped PEB.

The LEB map operation is available via the ubi_leb_map() UBI kernel API function, or via the UBI_IOCEBMAP volume character device ioctl command. However, thie ioctl interface is available only starting from kernel version 2.6.29.

One of the possible use-cases of the LEB map operation is making sure the old LEB contents goes away forever. As it was explained in this section, when an LEB is un-mapped, the corresponding PEB is not erased straight away. And if an unclean reboot happens, the LEB may becomes mapped to the same PEB again, after the UBI attaches the MTD device. So, if you map the LEB just after un-mapping it, you are guaranteed that the old LEB contents never comes back. In other words, the LEB is guaranteed to contain only 0xFF bytes after the map operation returns, even in case of an unclean reboot.

Please, use the LEB map operation carefully. Do not use this unless it is really needed, because mapped LEBs add more overhead on the UBI wear-leveling sub-system, comparing to un-mapped LEBs. Indeed, if an LEB is un-mapped, there is no PEB which contains LEB's data, and the wear-leveling sub-system does not have to move any data to maintain wear-leveling. Conversely, if the LEB is mapped to a PEB, there is one more PEB for the wear-leveling sub-system to care about, and one more LEB to re-map to another PEB if the erase counter of the current PEB becomes too low (then the LEB is re-mapped to a PEB with higher erase counter and the old PEB is used for other operations).

Volume update

The volume update operation is be useful for device software updates. The operation changes the contents of whole UBI volume with new contents. But if it gets interrupted in the middle of the update, the volume goes into the "corrupted" state and further I/O on the volume ends up with an EBADF error. And the only way to get the volume back to the normal state is to start a new volume update operation and finish it.

The volume update operation allows detecting interrupted updates and re-starting it with help of, for example, a "mirror" volume which would have the same contents or by showing a dialog window which would inform the user about the problem and request flashing. In contrast, it is difficult to detect interrupted updates in case of raw MTD partitions.

The volume update operation is available via the user-space UBI interface and not available via the UBI kernel API. To update a volume, you first have to call the UBI_IOCVOLUP ioctl of the corresponding UBI volume character device and pass it a pointer to a 64-bit value containing the length of the new volume contents in bytes. Then this amount of bytes has to be written to the volume character device. Once the last byte has been send to the character device, the update operation is finished. Schematically, the sequence is:

fd = open("/dev/my_volume");
ioctl(fd, UBI_IOCVOLUP, &image_size);
write(fd, buf, image_size);
close(fd);

See include/mtd/ubi-user.h for more details. Bear in mind, the old contents of the volume is not preserved in case of an interrupted update. Also, you do not have to write all new data at one go. It is OK to call the write() function arbitrary number of times and pass arbitrary amount of data each time. The operation will be finished after all the data have been written. If the last write operation contains more bytes than UBI expects, the extra data are just ignored.

Special case of the volume update operation is what we call volume truncation, which is done by the same ioctl command if the data length is zero. In this case the volume is just wiped out and will contain all 0xFF bytes (all LEBs will be un-mapped).

Note, the /sys/class/ubi/ubiX_X/corrupted sysfs file reflects the "corrupted" state of the volume: it contains ASCII "0\n" if the volume is OK and "1\n" if it is corrupted (because volume update had started but was not finished).

The volume update operation does not preserve the old volume contents if it is interrupted, so it is not atomic. However, UBI also provides atomic volume updates by means of thevolume re-name operation.

The volume update is implemented with help of so-called update marker. Once the user has issued the UBI_IOCVOLUP ioctl, UBI sets the update marker flag for the volume in the corresponding record of the UBI volume table. Then the volume is wiped out and UBI waits for the the user to pass the data. Once all the data have arrived and have been written to the flash, the update marker is cleaned. But in case of an interruption (e.g., unclean reboot, crash of the update application, etc.), the update marker is not cleaned and the volume is treated as "corrupted". Only a new successful update operation may clean the update marker.

Atomic LEB change

The atomic LEB change operation changes the contents of an LEB atomically, so that the old contents is preserved if the operation is interrupted. In other words, the result of the operation is that the LEB either has the old contents or the new contents.

The operation is available via the ubi_leb_change() kernel API call. The user-space interface for this operation exists starting from kernel version 2.6.25.

The user-space atomic LEB change operation is run via the UBI_IOCEBCH ioctl command. You have to pass a pointer to a properly filled request object of struct ubi_leb_change_reqtype. The object stores the LEB number to change and the length of the new contents. Then you have to write the specified amount of bytes to the volume character device. Notice some similarity to the user-space interface of the volume update operation. Schematically, the sequence is:

struct ubi_leb_change_req req;

req.lnum = lnum_to_change;
req.len = data_len;
fd = open("/dev/my_volume");
ioctl(fd, UBI_IOCEBCH, &req);
write(fd, data_buf, data_len);
close(fd);

If for some reason the user does not write the declared amount of bytes and closes the file, the operation is canceled and the old contents of the LEB is preserved.

Similarly tho the volume update operation it does not matter how many times the write() function is called and how much data it passes to the UBI volume each time. The atomic LEB change operation finishes once the last data byte has arrived.

The atomic LEB change operation might be very useful for file-systems, for example UBIFS uses this operation as the last resort when it commits the file-system index. This operation may also be exploited to create an FTL layer on top of UBI (see here for the description of the idea).

Keep in mind that the atomic LEB change operation calculates the CRC-32 checksum of the new data, so it has some overhead comparing to the LEB erase plus LEB write sequence. The volume update operation does not calculate data CRC-32, so it is faster to update the volume than to atomically change all its eraseblocks. This additional overhead has to be remembered about and the operation should not be used if the atomicity is not really needed.

Implementation details

Suppose UBI has to change a logical eraseblock L which is mapped to a physical eraseblock P₁. First of all, UBI always has one free PEB reserved for the atomic LEB change operation, let it be P₂. Before the operation, P₁ stores the contents of the LEB L and P₂ is free (it contains only the EC header and 0xFF bytes). The new data are written to P₂, not toP₁, so should anything go wrong, the old contents of the LEB is always there.

When the operation finishes, UBI un-maps L from P₁, maps in to P₂, and schedules P₁ for erasure. If the operation is interrupted, L stays being mapped to P₁ and P₂ is scheduled for erasure.

If an unclean reboot happens half way through the atomic LEB change operation, it is obvious that UBI has to preserve the L -> P₁ mapping and erase P₂ when it is attaches the MTD device next time. But if the unclean reboot happens just after the atomic LEB change operation finishes, but before P₁ is physically erased, it is obvious that UBI has to preserve L -> P₂ mapping and erase P₁.

To resolve situations like that, UBI calculates CRC-32 checksum of the new contents of the LEB before it is written to flash, and stores it in the VID header (together with data length). When UBI finds 2 PEBs P₁ and P₂ mapped to the same LEB L during the initialization, it selects the one with higher sequence number (P₂) only if the data CRC-32 is correct (which means that all data has been written to the flash media), otherwise it selects the PEB with lower sequence number(P₁). Of course, UBI has to read the LEB contents in order to check the CRC-32 checksum.

Fastmap

Fastmap is an experimental and optional UBI feature, which can be enabled by setting CONFIG_MTD_UBI_FASTMAP to 'y'. Once enabled UBI evaluates the module parameter "fm_autoconvert". If it is set to 1 (default is 0) UBI automatically enables fastmap for any attached image. This means UBI creates a new internal volume with the fastmap data such that next time the fast attach mode can be used. In the default configuration UBI will use the information stored in this fastmap volume to accelerate the attach procedure. If you want to test fastmap, set fm_autoconvert to 1 and attach a volume.

The following settings are possible:

CONFIG_MTD_UBI_FASTMAP	fm_autoconvert	Result
n	0	fastmap is completely disabled
y	0	UBI will attach by fastmap if one exists on an image, but no fastmap will be installed on images without a fastmap
y	1	UBI will attach by fastmap if one exists on an image, a fastmap is automatically installed on all attached images

Backwards compatibility

The fastmap on-disk data structure makes use of delete compatible volumes, therefore fastmap enabled images are fully backwards compatible with UBI implementations which do not support fastmap. The kernel will remove the fastmap volumes and continue with scanning. This includes not only v3.6- but also v3.7+ with this option disabled.

Technical design

A on-disk fastmap contains all information needed to attach the whole image, namely all erase counter values, a list of all PEBs and their state, a list of all volumes and their current EBA, ... To avoid too many writes of the fastmap, it also contains a list of PEBs which may have changed and need a full scan while attaching. This list is called "fastmap pool" and has a fixed sized, 5% of the total amount of PEBs. Using this technique UBI needs to write the fastmap only if the pool contains no free PEBs. Otherwise it would have to write the fastmap each time the EBA of a volume has changed.

A fastmap consists of a super block (also known as anchor PEB) and payload data which can live on any PEB. The anchor PEB has to be located within the first 64 PEBs on the MTD device. It contains pointers to the remaining PEBs which carry the actual fastmap data. On modern NAND chips the whole fastmap fits into a single PEB. Hence, the anchor PEB points to itself. After loading the fastmap data, UBI attach information structure is created from it. The attach process works as follows:

UBI tries to find the fastmap anchor PEB, if no anchor PEB was found UBI performs traditional full scan
It follows the pointers stored in the anchor PEB and reads the fastmap payload data
Then it performs a traditional scan only on PEBs in the pool instead of all PEBs

If UBI detects that the used fastmap is invalid or corrupted it automatically falls back to scanning mode and performs a full scan. Using a CRC32 checksum and consistency checks of the internal UBI structures UBI is able to detect whether a fastmap is invalid or not.

A fastmap is written to the devices each time the fastmap pool becomes full (no free PEBs are available), the volume layout changes or the image is detached. One may wonder why writing at detach time is needed. If UBI would not write a new fastmap at detach time all erase counter modifications since the last fastmap write are lost.

Overhead

If fastmap enabled UBI will reserve enough PEBs to carry two complete fastmaps. In practice on modern NAND chips two PEBs are reserved for fastmap.

There is also some runtime overhead, to guarantee that the new fastmap is valid and conistent UBI has to take care that all IO which would cause EBA changes are blocked while attaching. Depending on flash chips this can take up to one second. Therefore, fastmap makes only sense on fast and large flash devices where a full scan takes too long. E.g. On 4GiB NAND chips a full scan takes several seconds whereas a fast attach needs less than one second.

dolinux

Linux内核工程师，计算机底层技术爱好者

UBI - Unsorted Block Images

UBI - Unsorted Block Images

Table of contents

Big red note

Overview

Source code

Mailing list

User-space tools

UBI headers

UBI volume table

Implementation details

Minimum flash input/output unit

NAND flash sub-pages

UBI headers position

Flash space overhead

Saving erase counters

How UBI flasher should work

Marking eraseblocks as bad

Scalability issues

Implementation details

Reserved blocks for bad block handling (only for NAND chips)

Volume auto-resize

UBI operations

LEB un-map

Implementation details

LEB map

Volume update

Atomic LEB change

Implementation details

Fastmap

Backwards compatibility

Technical design

Overhead

More documentation

公告