预读取
在 Linux 中
摘要
- 非常简洁的摘要
- 预读取确实能读取。
- 简洁的摘要
- Linux 中的预读取更类似于 PostgreSQL 中的 readv 实现(至少是 Thomas Munro 为 pg17 实现的流 IO)。
- 通过这种方式阅读它可能更容易理解工作原理。
- 重点
- 预读取不是异步的
- 预读取专属文件系统,请参见下方的调用树底部。
在 Linux 中,预读取已经历经了一段时间的演变,最后一次已知的审阅是由 Neil Brown 在 2022 年 4 月 8 日左右完成的。
他撰写了一篇文章来总结他的发现:预读取:我想要阅读的文档。
导致此更新的文档(显然已合并到 5.18 中):https://www.kernel.org/doc/html/latest/core-api/mm-api.html#readahead
这确实是一个非常好的摘要,它涵盖了预读取的 2 个主要入口函数:page_cache_sync_ra() 和 page_cache_async_ra()。
两个函数都有针对其基础函数的相关最新文档:https://www.kernel.org/doc/html/latest/core-api/mm-api.html?highlight=page_cache_async_readahead#c.page_cache_sync_readahead
花点时间阅读文档(此处摘录)
- page_cache_sync_readahead()
- 在发生缓存未命中时应调用:它将提交读取。如果访问模式表明这将提高性能,则预读取逻辑可能会决定将更多页面附加到读取请求上。
- page_cache_async_readahead()
- 应在使用标记为 PageReadahead 的页面时调用,这是一个标记,表示应用程序已用完了足够的预读取窗口,因此我们应该开始引入更多页面。
好的,到目前为止一切顺利。
但在执行 posix_fadvise 时,代码是不同的,入口点通向
- page_cache_ra_unbounded()
- 当文件系统想要在超出文件声明的i_size进行预读取时,应调用此函数。这几乎肯定不是您想调用的函数。请改用page_cache_async_readahead() 或 page_cache_sync_readahead()。文件由调用者引用。调用者可能持有互斥锁。可能会休眠,但不会重新进入文件系统来回收内存。
今天,linux 使用“folio”,并且通过在此函数中设置 _folio_set_readahead(folio);_ 来设置 readahead 标志。
另一个超级重要的部分是:“可等待”,没错,它不是异步的。唯一部分异步的情况是当存储真正拥塞,从而在提供 readahead 范围的调用执行期间进程被中止。正如 Neil 提到的,“拥塞”并未得到普遍遵循,可能无法按预期工作。
旁注:linux 将范围拆分为 2MB 块以管理内存和减少锁定。硬编码。
摘自 readahead 代码中的评论
/* * Each readahead request is partly synchronous read, and partly async * readahead. This is reflected in the struct file_ra_state which * contains ->size being the total number of pages, and ->async_size * which is the number of pages in the async section. The readahead * flag will be set on the first folio in this async section to trigger * a subsequent readahead. Once a series of sequential reads has been * established, there should be no need for a synchronous component and * all readahead request will be fully asynchronous. */
调用树 - linux posix_fadvise 实际使用的函数
https://elixir.bootlin.com/linux/latest/source/mm/fadvise.c#L31
/* * POSIX_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could * deactivate the pages and clear PG_Referenced. */ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
https://elixir.bootlin.com/linux/latest/source/mm/internal.h#L126
inline wrapper
https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L306
/* * Chunk the readahead into 2 megabyte units, so that we don't pin too much * memory at once. */ void force_page_cache_ra(struct readahead_control *ractl, unsigned long nr_to_read)
https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L281
/* * do_page_cache_ra() actually reads a chunk of disk. It allocates * the pages first, then submits them for I/O. This avoids the very bad * behaviour which would occur if page allocations are causing VM writeback. * We really don't want to intermingle reads and writes like that. */
https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L205
/** * page_cache_ra_unbounded - Start unchecked readahead. * @ractl: Readahead control. * @nr_to_read: The number of pages to read. * @lookahead_size: Where to start the next readahead. * * This function is for filesystems to call when they want to start * readahead beyond a file's stated i_size. This is almost certainly * not the function you want to call. Use page_cache_async_readahead() * or page_cache_sync_readahead() instead. * * Context: File is referenced by caller. Mutexes may be held by caller. * May sleep, but will not reenter filesystem to reclaim memory. */
预分配期间的若干有趣评论
/* * Partway through the readahead operation, we will have added * locked pages to the page cache, but will not yet have submitted * them for I/O. Adding another page may need to allocate memory, * which can trigger memory reclaim. Telling the VM we're in * the middle of a filesystem operation will cause it to not * touch file-backed pages, preventing a deadlock. Most (all?) * filesystems already specify __GFP_NOFS in their mapping's * gfp_mask, but let's be explicit here. */
https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L146
static void read_pages(struct readahead_control *rac)
然后它针对特定的文件系统。
EXT4
https://elixir.bootlin.com/linux/latest/source/fs/ext4/inode.c#L3124
static int ext4_read_folio(struct file *file, struct folio *folio)
如果在内存中未发现
https://elixir.bootlin.com/linux/latest/source/fs/ext4/readpage.c#L211
int ext4_mpage_readpages(struct inode *inode, struct readahead_control *rac, struct folio *folio)
关于顺序、空穴等大量逻辑之后
https://elixir.bootlin.com/linux/latest/source/block/blk-core.c#L833
/** * submit_bio - submit a bio to the block device layer for I/O * @bio: The &struct bio which describes the I/O * * submit_bio() is used to submit I/O requests to block devices. It is passed a * fully set up &struct bio that describes the I/O that needs to be done. The * bio will be send to the device described by the bi_bdev field. * * The success/failure status of the request, along with notification of * completion, is delivered asynchronously through the ->bi_end_io() callback * in @bio. The bio must NOT be touched by the caller until ->bi_end_io() has * been called. */
第一想法
- 在顺序模式上具有 WILLNEED 只会与 linux 自身的 ra 竞争。
- 如果设置 PG_readahead,linux 会将其解释为 past readahead 成功,并且从读取此块时继续执行更多 ra(它显然通过一些空穴检测得到缓解)。
- 未设置 PG_readahead 且读取时,没有检查 ra 逻辑如何使用它。
- 在读取之后具有 DONTNEED 显然在 linux 代码中得到了优化。
- 设置 RANDOM 或 SEQUENTIAL 标志可有效影响 linux 默认 ra。且无需成本。