关于 fstrim,btrfs 和 SSD

翻了一下 btrfs 的邮件列表,看到一篇关于 fstrim on Btrfs 的讨论。


      wear leveling: http://en.wikipedia.org/wiki/Wear_levelling
      fstrim on BTRFS: http://marc.info/?l=linux-btrfs&m=132509156511214&w=2

1.  分析了一下 fstrim 和 btrfs fs defrag/balance 的不同,以及 SSD 的一些特点在这些操作下带来的性能影响

     Q:    From: "Fajar A. Nugraha" <list@fajar.net>
    How useful would trim be for btrfs when using newer SSD which have their 
    own garbage collection and wear leveling (e.g. sandforce-based)?
    I'm trying fstrim and my disk is now pegged at write IOPS. Just wondering 
    if maybe a "btrfs fi defrag" would be more useful, since:
        - with trim, used space will remain used. Thus future writes will only
          utilized space marked as "free", making them wear faster
        - with "btrfs fi defrag", btrfs will move the data around so (to some
           degree) the currently-unused space will be used, and  currently-used
           space will be unused, which will improve wear leveling.

      A:    From: Roman Mamedov <rm@romanrm.ru>
    Modern controllers (like the SandForce you mentioned) do their own wear leveling 
    'under the hood', i.e. the same user-visible sectors DO NOT neccessarily map to the
    same locations on the flash at all times; and introducing 'manual' wear leveling by 
    additional rewriting is not a good idea, it's just going to wear it out more.

      Q:    "Fajar A. Nugraha" <list@fajar.net>
    I know that modern controllers have their own wear leveling, but AFAIK they basically:
        (1) have reserved a certain size for wear leveling purposes
        (2) when a write request comes, they basically use new sectors from
            the pool, and put the "old" sectors to the pool (doing garbage
            collection like trim/rewrite in the process)
        (3) they can't re-use sectors that are currently being used and not
            rewritten (e.g. sectors used by OS files)
    If (3) is still valid, then the only way to reuse the sectors is by forcing a rewrite 
    (e.g. using "btrfs fi defrag"). So the question is, is (3) still valid?

      A:    From: cwillu <cwillu@cwillu.com>

    Erase blocks are generally much larger than logical sectors. There's nothing stopping 
    an SSD from shuffling around logical sectors as much as it wants, at any time, any 
    virtual all SSDs do this behind the scenes already, sufficient to maintain adequate 
    wear levelling.

    The problem isn't levelling, but rather that once the pool of erase blocks with remaining 
    clear space is gone, any further writes require the SSD to do a read/erase/rewrite shuffle 
    of the valid data in an erase block to reclaim and compact the scattered overwritten sectors.  
    Early SSDs ended up operating in this mode continuously, which is why their performance would 
    drop off over time:  every little 512 byte write would require reading several hundred
    kilobytes (if not megabytes) first, so that it could be rewritten with the new data after 
    erasing the whole block (cutting the power during this process would often cause additional 
    hilarity; SD cards have been especially bad for this).

    The later controllers gained some intelligence, such that they would set aside some erase blocks 
    to perform that compaction in the background, allowing them to maintain a pool of free erase 
    blocks.  Note that it's trivial at that point for the drive to move the data from a relatively 
    unworn erase block to one from the pool if necessary, although I don't know that this is actually 
    used, as wear levelling really isn't a big deal in practice.

    What TRIM does in this mix is tell the SSD that various logical blocks can be considered to be 
    overwritten (so to speak), and as such, don't need (and shouldn't!) be rewritten if and when the 
    erase block that holds them is compacted.  This allows the SSD to compact those sectors into the 
    pool earlier than it might have been able to otherwise (in the best case), and in the worst
    case can prevent that data from being needlessly copied again and again.

    Consider if you filled a somewhat naive SSD (specifically, one which held no spare erase blocks 
    for compaction) to capacity, deleted everything, and then overwrote the same logical sector 
    repeatedly: without trim, the ssd has no way of knowing that the rest of the blocks are garbage 
    that can be reused, and so it'll be stuck  reading an entire erase block's worth of garbage, 
    clearing the erase block, and writing that garbage back out with the changed 512 bytes.
    Even with wear-levelling, you'll still suffer a horrendous write-performance loss, and will wear
    through the drive far faster than one might otherwise expect.

    This is why some have said that TRIM support is just a crutch for poor firmware, and is why many
    devices (all, the last time I checked :p) have poorly performing TRIM commands: with a couple 
    erase blocks set aside, that pathological case won't occur; instead you'll have a couple erase 
    blocks that gradually get filled up with old copies of the only logical sector that's changing,
    which can be efficiently erased and returned to the pool.  Add in some transparent compression 
    (e.g., OCZ's), and you can probably get away with very few erase blocks in the free pool and still 
    maintain acceptable write performance.

    In light of this, the problem with just using btrfs's defrag/balance as currently implemented 
    becomes more apparent:  we're not actually freeing up any space, we're just overwriting logical 
    sectors with data that was already stored elsewhere.  In the mythical best case, a magical SSD will 
    notice the duplicated blocks and just store a reference; in the common case of a half-decent
    firmware, the SSD will still get along okay (it's basically the same situation as the previous 
    example); in the worse case of a naive or misguided SSD, you're pretty much guaranteeing the worst 
    case behaviour: filling up the drive with garbage, at which point the writes from the balance/defrag 
    will likely hit the wear-amplification case described above.

2.  关于设计上的一个讨论:是改变磁盘元数据格式,还是在内存数据结构里面加标志来反应状态变化。Chris Mason 倾向于后者:尽量减少磁盘格式变动

   Whether we want to store TRIMMED information on disk? ext4 doesn't do this, so the first fstrim
    will be slow though you've done fstrim in previous mount.
    For btrfs this issue can't be solved without disk format change that will break older kernels,

      A:    From: Chris Mason
    I'd rather not store the trim status on disk.  The extra trims don't have a huge cost, 
    and since some devices have a large granularity for trims, they may ignore the trim 
    until it tosses a larger contiguous area of the disk.
    I'd be fine with a flag to the in-memory free extent struct that indicates if it has 
    been trimmed down to the device.




posted on 2012-11-28 20:33  refrag  阅读(2951)  评论(0编辑  收藏  举报
