
Problem:
   When Linux needs to allocate memory it may find that there is
   insufficient free memory so it needs to reclaim space that is in
   use but not needed at the moment.  There are several options:

   1/ Shrink a kernel cache such as the inode or dentry cache.  This
      is fairly easy but provides limited returns.
   2/ Discard 'clean' pages from the page cache.  This is easy, and
      works well as long as there are clean pages in the page cache.
      Similarly clean 'anonymous' pages can be discarded - if there
      are any.
   3/ Write out some dirty page-cache pages so that they become clean.
      The VM limits the number of dirty page-cache pages to e.g. 40%
      of available memory so that (among other reasons) a "sync" will
      not take excessively long.  So there should never be excessive
      amounts of dirty pagecache.
      Writing out dirty page-cache pages involves work by the
      filesystem which may need to allocate memory itself.  To avoid
      deadlock, filesystems use GFP_NOFS when allocating memory on the
      write-out path.  When this is used, cleaning dirty page-cache
      pages is not an option so if the filesystem finds that  memory
      is tight, another option must be found.
   4/ Write out dirty anonymous pages to the "Swap" partition/file.
      This is the most interesting for a couple of reasons.
      a/ Unlike dirty page-cache pages, there is no need to write anon
         pages out unless we are actually short of memory.  Thus they
         tend to be left to last.
      b/ Anon pages tend to be updated randomly and unpredictably, and
         flushing them out of memory can have a very significant
         performance impact on the process using them.  This contrasts
         with page-cache pages which are often written sequentially
         and often treated as "write-once, read-many".
      So anon pages tend to be left until last to be cleaned, and may
      be the only cleanable pages while there are still some dirty
      page-cache pages (which are waiting on a GFP_NOFS allocation).

[I don't find the above wholly satisfying.  There seems to be too much
 hand-waving.  If someone can provide better text explaining why
 swapout is a special case, that would be great.]

So we need to be able to write to the swap file/partition without
needing to allocate any memory ... or only a small well controlled
amount.

The VM reserves a small amount of memory that can only be allocated
for use as part of the swap-out procedure.  It is only available to
processes with the PF_MEMALLOC flag set, which is typically just the
memory cleaner.

Traditionally swap-out is performed directly to block devices (swap
files on block-device filesystems are supported by examining the
mapping from file offset to device offset in advance, and then using
the device offsets to write directly to the device).  Block devices
are (required to be) written to pre-allocate any memory that might be
needed during write-out, and to block when the pre-allocated memory is
exhausted and no other memory is available.  They can be sure not to
block forever as the pre-allocated memory will be returned as soon as
the data it is being used for has been written out.  The primary
mechanism for pre-allocating memory is called "mempools".

This approach does not work for writing anonymous pages
(i.e. swapping) over a network, using e.g NFS or NBD or iSCSI.


The main reason that it does not work is that when data from an anon
page is written to the network, we must wait for a reply to confirm
the data is safe.  Receiving that reply will consume memory and,
significantly, we need to allocate memory to an incoming packet before
we can tell if it is the reply we are waiting for or not.

The secondary reason is that the network code is not written to use
mempools and in most cases does not need to use them.  Changing all
allocations in the networking layer to use mempools would be quite
intrusive, and would waste memory, and probably cause a slow-down in
the common case of not swapping over the network.

These problems are addressed by enhancing the system of memory
reserves used by PF_MEMALLOC and requiring any in-kernel networking
client that is used for swap-out to indicate which sockets are used
for swapout so they can be handled specially in low memory situations.

There are several major parts to this enhancement:

1/ page->reserve, GFP_MEMALLOC

  To handle low memory conditions we need to know when those
  conditions exist.  Having a global "low on memory" flag seems easy,
  but its implementation is problematic.  Instead we make it possible
  to tell if a recent memory allocation required use of the emergency
  memory pool.
  For pages returned by alloc_page, the new page->reserve flag
  can be tested.  If this is set, then a low memory condition was
  current when the page was allocated, so the memory should be used
  carefully. (Because low memory conditions are transient, this
  state is kept in an overloaded member instead of in page flags, which
  would suggest a more permanent state.)

  For memory allocated using slab/slub: If a page that is added to a
  kmem_cache is found to have page->reserve set, then a  s->reserve
  flag is set for the whole kmem_cache.  Further allocations will only
  be returned from that page (or any other page in the cache) if they
  are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set).
  Non-emergency allocations will block in alloc_page until a
  non-reserve page is available.  Once a non-reserve page has been
  added to the cache, the s->reserve flag on the cache is removed.

  Because slab objects have no individual state its hard to pass
  reserve state along, the current code relies on a regular alloc
  failing. There are various allocation wrappers help here.

  This allows us to
   a/ request use of the emergency pool when allocating memory
     (GFP_MEMALLOC), and
   b/ to find out if the emergency pool was used.

2/ SK_MEMALLOC, sk_buff->emergency.

  When memory from the reserve is used to store incoming network
  packets, the memory must be freed (and the packet dropped) as soon
  as we find out that the packet is not for a socket that is used for
  swap-out.
  To achieve this we have an ->emergency flag for skbs, and an
  SK_MEMALLOC flag for sockets.
  When memory is allocated for an skb, it is allocated with
  GFP_MEMALLOC (if we are currently swapping over the network at
  all).  If a subsequent test shows that the emergency pool was used,
  ->emergency is set.
  When the skb is finally attached to its destination socket, the
  SK_MEMALLOC flag on the socket is tested.  If the skb has
  ->emergency set, but the socket does not have SK_MEMALLOC set, then
  the skb is immediately freed and the packet is dropped.
  This ensures that reserve memory is never queued on a socket that is
  not used for swapout.

  Similarly, if an skb is ever queued for delivery to user-space for
  example by netfilter, the ->emergency flag is tested and the skb is
  released if ->emergency is set. (so obviously the storage route may
  not pass through a userspace helper, otherwise the packets will never
  arrive and we'll deadlock)

  This ensures that memory from the emergency reserve can be used to
  allow swapout to proceed, but will not get caught up in any other
  network queue.


3/ pages_emergency

  The above would be sufficient if the total memory below the lowest
  memory watermark (i.e the size of the emergency reserve) were known
  to be enough to hold all transient allocations needed for writeout.
  I'm a little blurry on how big the current emergency pool is, but it
  isn't big and certainly hasn't been sized to allow network traffic
  to consume any.

  We could simply make the size of the reserve bigger. However in the
  common case that we are not swapping over the network, that would be
  a waste of memory.

  So a new "watermark" is defined: pages_emergency.  This is
  effectively added to the current low water marks, so that pages from
  this emergency pool can only be allocated if one of PF_MEMALLOC or
  GFP_MEMALLOC are set.

  pages_emergency can be changed dynamically based on need.  When
  swapout over the network is required, pages_emergency is increased
  to cover the maximum expected load.  When network swapout is
  disabled, pages_emergency is decreased.

  To determine how much to increase it by, we introduce reservation
  groups....

3a/ reservation groups

  The memory used transiently for swapout can be in a number of
  different places.  e.g. the network route cache, the network
  fragment cache, in transit between network card and socket, or (in
  the case of NFS) in sunrpc data structures awaiting a reply.
  We need to ensure each of these is limited in the amount of memory
  they use, and that the maximum is included in the reserve.

  The memory required by the network layer only needs to be reserved
  once, even if there are multiple swapout paths using the network
  (e.g. NFS and NDB and iSCSI, though using all three for swapout at
  the same time would be unusual).

  So we create a tree of reservation groups.  The network might
  register a collection of reservations, but not mark them as being in
  use.  NFS and sunrpc might similarly register a collection of
  reservations, and attach it to the network reservations as it
  depends on them.
  When swapout over NFS is requested, the NFS/sunrpc reservations are
  activated which implicitly activates the network reservations.

  The total new reservation is added to pages_emergency.

  Provided each memory usage stays beneath the registered limit (at
  least when allocating memory from reserves), the system will never
  run out of emergency memory, and swapout will not deadlock.

  It is worth noting here that it is not critical that each usage
  stays beneath the limit 100% of the time.  Occasional excess is
  acceptable provided that the memory will be freed  again within a
  short amount of time that does *not* require waiting for any event
  that itself might require memory.
  This is because, at all stages of transmit and receive, it is
  acceptable to discard all transient memory associated with a
  particular writeout and try again later.  On transmit, the page can
  be re-queued for later transmission.  On receive, the packet can be
  dropped assuming that the peer will resend after a timeout.

  Thus allocations that are truly transient and will be freed without
  blocking do not strictly need to be reserved for.  Doing so might
  still be a good idea to ensure forward progress doesn't take too
  long.

4/ low-mem accounting

  Most places that might hold on to emergency memory (e.g. route
  cache, fragment cache etc) already place a limit on the amount of
  memory that they can use.  This limit can simply be reserved using
  the above mechanism and no more needs to be done.

  However some memory usage might not be accounted with sufficient
  firmness to allow an appropriate emergency reservation.  The
  in-flight skbs for incoming packets is one such example.

  To support this, a low-overhead mechanism for accounting memory
  usage against the reserves is provided.  This mechanism uses the
  same data structure that is used to store the emergency memory
  reservations through the addition of a 'usage' field.

  Before we attempt allocation from the memory reserves, we much check
  if the resulting 'usage' is below the reservation. If so, we increase
  the usage and attempt the allocation (which should succeed). If
  the projected 'usage' exceeds the reservation we'll either fail the
  allocation, or wait for 'usage' to decrease enough so that it would
  succeed, depending on __GFP_WAIT.

  When memory that was allocated for that purpose is freed, the
  'usage' field is checked again.  If it is non-zero, then the size of
  the freed memory is subtracted from the usage, making sure the usage
  never becomes less than zero.

  This provides adequate accounting with minimal overheads when not in
  a low memory condition.  When a low memory condition is encountered
  it does add the cost of a spin lock necessary to serialise updates
  to 'usage'.



5/ swapon/swapoff/swap_out/swap_in

  So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on
  any network socket that it uses, and can know when to account
  reserve memory carefully, new address_space_operations are
  available.
  "swapon" requests that an address space (i.e a file) be make ready
  for swapout.  swap_out and swap_in request the actual IO.  They
  together must ensure that each swap_out request can succeed without
  allocating more emergency memory that was reserved by swapon. swapoff
  is used to reverse the state changes caused by swapon when we disable
  the swap file.


Thanks for reading this far.  I hope it made sense :-)

Neil Brown (with updates from Peter Zijlstra)


