预读取

来自 PostgreSQL wiki
跳转到导航跳转到搜索

在 Linux 中

摘要

非常简洁的摘要
预读取确实能读取。
简洁的摘要
Linux 中的预读取更类似于 PostgreSQL 中的 readv 实现(至少是 Thomas Munro 为 pg17 实现的流 IO)。
通过这种方式阅读它可能更容易理解工作原理。
重点
预读取不是异步的
预读取专属文件系统,请参见下方的调用树底部。

在 Linux 中,预读取已经历经了一段时间的演变,最后一次已知的审阅是由 Neil Brown 在 2022 年 4 月 8 日左右完成的。
他撰写了一篇文章来总结他的发现:预读取:我想要阅读的文档

导致此更新的文档(显然已合并到 5.18 中):https://www.kernel.org/doc/html/latest/core-api/mm-api.html#readahead

这确实是一个非常好的摘要,它涵盖了预读取的 2 个主要入口函数:page_cache_sync_ra()page_cache_async_ra()

两个函数都有针对其基础函数的相关最新文档:https://www.kernel.org/doc/html/latest/core-api/mm-api.html?highlight=page_cache_async_readahead#c.page_cache_sync_readahead


花点时间阅读文档(此处摘录)

page_cache_sync_readahead()
在发生缓存未命中时应调用:它将提交读取。如果访问模式表明这将提高性能,则预读取逻辑可能会决定将更多页面附加到读取请求上。
page_cache_async_readahead()
应在使用标记为 PageReadahead 的页面时调用,这是一个标记,表示应用程序已用完了足够的预读取窗口,因此我们应该开始引入更多页面。

好的,到目前为止一切顺利。

但在执行 posix_fadvise 时,代码是不同的,入口点通向

page_cache_ra_unbounded()
当文件系统想要在超出文件声明的i_size进行预读取时,应调用此函数。这几乎肯定不是您想调用的函数。请改用page_cache_async_readahead()page_cache_sync_readahead()。文件由调用者引用。调用者可能持有互斥锁。可能会休眠,但不会重新进入文件系统来回收内存。

今天,linux 使用“folio”,并且通过在此函数中设置 _folio_set_readahead(folio);_ 来设置 readahead 标志。

另一个超级重要的部分是:“可等待”,没错,它不是异步的。唯一部分异步的情况是当存储真正拥塞,从而在提供 readahead 范围的调用执行期间进程被中止。正如 Neil 提到的,“拥塞”并未得到普遍遵循,可能无法按预期工作。

旁注:linux 将范围拆分为 2MB 块以管理内存和减少锁定。硬编码。

摘自 readahead 代码中的评论

   /*
    * Each readahead request is partly synchronous read, and partly async
    * readahead.  This is reflected in the struct file_ra_state which
    * contains ->size being the total number of pages, and ->async_size
    * which is the number of pages in the async section.  The readahead
    * flag will be set on the first folio in this async section to trigger
    * a subsequent readahead.  Once a series of sequential reads has been
    * established, there should be no need for a synchronous component and
    * all readahead request will be fully asynchronous.
    */

调用树 - linux posix_fadvise 实际使用的函数

https://elixir.bootlin.com/linux/latest/source/mm/fadvise.c#L31

   /*
   * POSIX_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could
   * deactivate the pages and clear PG_Referenced.
   */
   int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)

https://elixir.bootlin.com/linux/latest/source/mm/internal.h#L126

   inline wrapper

https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L306

   /*
   * Chunk the readahead into 2 megabyte units, so that we don't pin too much
   * memory at once.
   */
   void force_page_cache_ra(struct readahead_control *ractl,
           unsigned long nr_to_read)

https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L281

   /*
   * do_page_cache_ra() actually reads a chunk of disk.  It allocates
   * the pages first, then submits them for I/O. This avoids the very bad
   * behaviour which would occur if page allocations are causing VM writeback.
   * We really don't want to intermingle reads and writes like that.
   */
   

https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L205

   /**
   * page_cache_ra_unbounded - Start unchecked readahead.
   * @ractl: Readahead control.
   * @nr_to_read: The number of pages to read.
   * @lookahead_size: Where to start the next readahead.
   *
   * This function is for filesystems to call when they want to start
   * readahead beyond a file's stated i_size.  This is almost certainly
   * not the function you want to call.  Use page_cache_async_readahead()
   * or page_cache_sync_readahead() instead.
   *
   * Context: File is referenced by caller.  Mutexes may be held by caller.
   * May sleep, but will not reenter filesystem to reclaim memory.
   */ 

预分配期间的若干有趣评论

   /*
    * Partway through the readahead operation, we will have added
     * locked pages to the page cache, but will not yet have submitted
     * them for I/O.  Adding another page may need to allocate memory,
     * which can trigger memory reclaim.  Telling the VM we're in
     * the middle of a filesystem operation will cause it to not
     * touch file-backed pages, preventing a deadlock.  Most (all?)
     * filesystems already specify __GFP_NOFS in their mapping's
     * gfp_mask, but let's be explicit here.
     */

https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L146

   static void read_pages(struct readahead_control *rac)

然后它针对特定的文件系统。

EXT4

https://elixir.bootlin.com/linux/latest/source/fs/ext4/inode.c#L3124

   static int ext4_read_folio(struct file *file, struct folio *folio)

如果在内存中未发现

https://elixir.bootlin.com/linux/latest/source/fs/ext4/readpage.c#L211

   int ext4_mpage_readpages(struct inode *inode,
           struct readahead_control *rac, struct folio *folio)

关于顺序、空穴等大量逻辑之后

https://elixir.bootlin.com/linux/latest/source/block/blk-core.c#L833


   /**
    * submit_bio - submit a bio to the block device layer for I/O
    * @bio: The &struct bio which describes the I/O
    *
    * submit_bio() is used to submit I/O requests to block devices.  It is passed a
    * fully set up &struct bio that describes the I/O that needs to be done.  The
    * bio will be send to the device described by the bi_bdev field.
    *
    * The success/failure status of the request, along with notification of
    * completion, is delivered asynchronously through the ->bi_end_io() callback
    * in @bio.  The bio must NOT be touched by the caller until ->bi_end_io() has
    * been called.
    */

第一想法

  • 在顺序模式上具有 WILLNEED 只会与 linux 自身的 ra 竞争。
  • 如果设置 PG_readahead,linux 会将其解释为 past readahead 成功,并且从读取此块时继续执行更多 ra(它显然通过一些空穴检测得到缓解)。
  • 未设置 PG_readahead 且读取时,没有检查 ra 逻辑如何使用它。
  • 在读取之后具有 DONTNEED 显然在 linux 代码中得到了优化。
  • 设置 RANDOM 或 SEQUENTIAL 标志可有效影响 linux 默认 ra。且无需成本。