Loading

Email-reading: filemap: Correct the conditions for marking a folio as accessed

Origin Patch[PATCH 1/3] filemap: Correct the conditions for marking a folio as accessed - Matthew Wilcox (Oracle)

We had an off-by-one error which meant that we never marked the first page in a read as accessed. This was visible as a slowdown when re-reading a file as pages were being evicted from cache too soon.
In reviewing this code, we noticed a second bug where a multi-page folio would be marked as accessed multiple times when doing reads that were less than the size of the folio.

Question1:off-by-one error 是什么

An off-by-one error or off-by-one bug (known by acronyms OBOE, OBO, OB1 and OBOB) is a logic error involving the discrete equivalent of a boundary condition. [1] 可以理解为循环时少/多了一次。

Question2:mark page 的作用?

/*
 * Mark a page as having seen activity.
 *
 * inactive,unreferenced	->	inactive,referenced
 * inactive,referenced		->	active,unreferenced
 * active,unreferenced		->	active,referenced
 *
 * When a newly allocated page is not yet visible, so safe for non-atomic ops,
 * __SetPageReferenced(page) may be substituted for mark_page_accessed(page).
 */
void folio_mark_accessed(struct folio *folio)

注释里的 active 代表 PG_active flag,referenced 代表 PG_referenced **flag **的形式保存;“->” 表明了转换关系。转换结果反映在 page LRU list 中[2]

A core part of the kernel's memory management subsystem is a pair of lists called the "active" and "inactive" lists.
The active list contains anonymous and file-backed pages that are thought (by the kernel) to be in active use by some process on the system.
The inactive list, instead, contains pages that the kernel thinks might not be in use. When active pages are considered for eviction, they are first moved to the inactive list and unmapped from the address space of the process(es) using them.
Thus, once a page moves to the inactive list, any attempt to reference it will generate a page fault; this "soft fault" will cause the page to be moved back to the active list. Pages that sit in the inactive list for long enough are eventually removed from the list and evicted from memory entirely.

延申阅读:

Question3: Bug 的修复方案?

首先来看这段代码的目的:对于所有需要读的 page/folio,依次 mark accessed。因为两次读的数据可能位于同一个 page,且第一次读 page 已经标记过,所以后续依靠 ra->prev_pos 判断跳过标记。

/*
 * When a read accesses the same folio several times, only
 * mark it as accessed the first time.
 */
if (iocb->ki_pos >> PAGE_SHIFT != ra->prev_pos >> PAGE_SHIFT)
	folio_mark_accessed(fbatch.folios[0]);

for (i = 0; i < folio_batch_count(&fbatch); i++) {
	...
	if (i > 0)
		folio_mark_accessed(folio);

Bug1:We never marked the first page in a read as accessed. This was visible as a slowdown when re-reading a file as pages were being evicted from cache too soon.

ra->prev_pos 被错误地理解为上次预读的最后一个 page/folio。但参考 fiemap_read 代码,ra->prev_pos 其实是上次读的末尾,也是下次读的开始;ra->prev_pos 代表的 page/folio 并没有被真正预读

			copied = copy_folio_to_iter(folio, offset, bytes, iter);
			iocb->ki_pos += copied;
			ra->prev_pos = iocb->ki_pos;

通过把 ra->prev_pos 减一解决。

-               if (iocb->ki_pos >> PAGE_SHIFT !=
-                   ra->prev_pos >> PAGE_SHIFT)
+               if (!pos_same_folio(iocb->ki_pos, ra->prev_pos - 1,
+                                                       fbatch.folios[0]))

Bug2:A multi-page folio would be marked as accessed multiple times when doing reads that were less than the size of the folio.

在对 ra->prev_pos 减一后还有一个问题。在 commit 25d6a23e8d28 (filemap: Convert filemap_get_read_batch() to use a folio_batch) 之后,filemap_get_pages 就以 folio 而不是 page 为读取单位了。folio size 可能大于一个 page,虽然两次读按 PAGE_SIZE 计算出的 index 不同,但可能它们还是同一个 folio,造成多次标记。

通过使用 folio_shift 代替 PAGE_SIZE 解决。

+static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio)
+{
+       unsigned int shift = folio_shift(folio);
+
+       return (pos1 >> shift == pos2 >> shift);
+}

Slowdown on BtrfsMajor btrfs fiemap slowdown on file with many extents once in cache (RCU stalls?)

Note1:在 git bisect [3] 时需要注明 commit-id。

> > I've taken a moment to bisect this and came down to this patch.
> I think you may have forgotten to include the commit-id that was
> the results of your bisect.... ?
Sorry, this is the patch I replied to and it was recent enough that
I assumed it'd still be in mailboxes, but you're right it's better
with a commit id. This is was merged as 5ccc944dce3d ("filemap: Correct
the conditions for marking a folio as accessed")

Note2:学习使用 Perf tools

Here's what perf has to say about it on top of this patch when running
`cp bigfile /dev/null` the first time:

and second time:
99.90%     0.00%  cp       [kernel.kallsyms]    [k]
entry_SYSCALL_64_after_hwfram
 entry_SYSCALL_64_after_hwframe
 do_syscall_64
  - 94.62% __x64_sys_ioctl
       do_vfs_ioctl
       btrfs_fiemap
     - extent_fiemap
        - 50.01% get_extent_skip_holes
           - 50.00% btrfs_get_extent_fiemap
              - 49.97% count_range_bits
                   rb_next
        + 28.72% lock_extent_bits
        + 15.55% __clear_extent_bit
  - 5.21% ksys_read
     + 5.21% vfs_read

(if this isn't readable, 95% of the time is spent on fiemap the second
time around)

资料:

Question4:RCU stalls 是什么

I've also been observing RCU stalls on my laptop with the same workload (cp to /dev/null), but unfortunately I could not reproduce in qemu so I could not take traces to confirm they are caused by the same commit but given the workload I'd say that is it?

RCU (read-copy update) is a kernel synchronization mechanism that increases a Linux system parallelism by enabling the concurrent access of readers and writers to a given shared data. Although RCU readers and writers are always allowed to access a shared data, writers are not allowed to free dynamically allocated data that was modified before the end of the grace-period. The end of a grace period ensures that no readers are accessing the old version of dynamically allocated shared data, allowing writers to return the memory to the system safely. Hence, a drawback of RCU is that a long wait for the end of a grace period can lead the system to run out-of-memory.

To warn that a grace-period is taking too long to occur, RCU Stalls messages are printed to the kernel log, notifying that the wait for the end of the grace period is taking more than the defined timeout. By default, the timeout is 60 seconds. [4]

注:RCU 这块东西很多,RCU LWN 文档从 08 年(或更早)更新到现在,每个变化都值得研究。RCU 是无锁并行编程领域的结晶,学术与工业的结合。下面是一些学习资料:


后续:

  • Yu Kuai (reporter) 无法在 ext4 复现如此 insane 的性能损失。
  • Dominique MARTINET 没法在 qemu 里 复现,并且后续没有再回复。

  1. Off-by-one error - Wikipedia ↩︎

  2. Better active/inactive list balancing [LWN.net] ↩︎

  3. Git - git-bisect Documentation ↩︎

  4. Avoiding RCU Stalls in the real-time kernel - Red Hat Customer Portal ↩︎

posted @ 2022-07-05 10:41  liuchao719  阅读(62)  评论(0编辑  收藏  举报