linux – gokulvasanblog

Empathizing Mammoths Brain: Determining Watermark in Linux

Watermark determines the scan rate policy of the memory management.

Linux kernel as described in my previous writing has 3 major watermarks, High, Min and Low. This blog is a brief view of the watermark calculation in the Linux system.

1. Use variable from admin window /proc/sys/vm/min_free_kbytes

Converts kbytes unit to page unit.

pages_min = min_free_kbytes >> (PAGE_SHIFT – 10);

2. Calculate total managed pages in each zone except highmem zone( if it exists )

for_each_zone(zone) {
       if(!is_highmem(zone))
            lowmem_pages += zone->managed_pages;
}

For Each zone calculate the following
3. Calculate fraction for each zone’s pages_min with respect to lowmem_pages.
tmp= ( pages_min / lowmem_pages ) * zone->managed_pages.

Managed_pages are total available page count in a zone that could be used for allocation.

4. Calculate WMARK_MIN for HIGHMEM using following calculation:
min_pages = zone->managed_pages / 1024
min_pages = MIN( MAX(min_pages, SWAP_CLUSTER_MAX), 128)

5. Else, simply use the tmp variable as min_pages watermark count.

6. Calculate the distance for the rest of the watermarks.
Its the fraction of managed_pages by 10000, scaled by watermark_scale_factor = 10
fraction = ( zone->managed_pages / 10000 ) * watermark_scale_factor
distance = MAX( tmp >> 2, fraction);

7. Assign watermarks
zone->watermark[WMARK_MIN] = min_pages;
zone->watermark[WMARK_LOW] = min_pages + distance;
zone->watermark[WMARK_HIGH] = min_pages + distance * 2;

Empathising mammoth’s mentality: A study on Linux Memory Management.

This series of articles tries a hand on hand analysis of memory management policies accommodated in acclaimed Unix-like operating systems like OpenSolaris, FreeBSD with Linux.

Memory management conventionally has three major policies.

Fetch Policy: When to fetch a page from auxiliary memory.
Placement Policy: Where to place the fetched page in main memory.
Replacement Policy: Which page to replace in case of memory shortage.

The mentioned policies are well studied and reasoned( Though I will write on this as well), nevertheless another policy which directly impacts the performance of a system is scan rate policy.

Rest of this specific work will focus on this policy and its approach in different operating system.

What is a scan rate policy?

Operating systems keeps updating the position of the allocated pages in their corresponding list based on its access frequency and other parameters. This enables the replacement policy to fetch the pages based on the position for replacement. Scanning through the entire list of allocated pages and updating the position of every page each time is an expensive operation. To avoid such a colossal expense, operating systems will only scan a portion of the list at a given moment of time and updates the corresponding page’s position. The policy that decides on the fraction to scan at a given instance of time is termed as scan rate policy.

Generalities in the model. Habitually on nearing a particular distance to exhaustion, the system starts a kernel level daemon, that runs in concurrence with other processes, termed either as pageout daemon( OpenSolaris and OpenBSD) or Kswapd daemon( Linux); Albeit the name of the process changes based on platform implementation, its intention remains the same .i.e., scan a certain amount of pages, apply the desired replacement policy and evict a set of pages to leeway for the current memory need. The free list of pages might have many waypoints in between the list called watermark. In Common, watermark decides on when to start the scan, rate of scan, number of times to scan, amount of processor utilisation page scanner can take( called CPU CAP), amount of I/O utilisation page scanner can use( called I/O CAP), etc., but the mentioned set varies both in implementation and inclusion based on the platform of implementation. Extensively, many operating systems quantify the measure of watermarks in page numbers.

Implementation In OpenSolaris:

OpenSolaris uses watermark based policy, i.e., scan rate policy changes based on the watermark .

Model. OpenSolaris have three major watermarks, namely lotsfree, desmem, minfree. The system has two different scan policies called slow scan and fast scan. The system also have two different rate at which the scanning will happen per second, i.e., 4times/second and 100times/second. Further, Solaris restricts the CPU utilisation of the page scanner between 4( min_percent_cpu) to 80( max_percent_cpu) percent( Used as a scaling factor) and the I/O is capped at either 40 or 60 IO/second( maxpgio ), but this restriction is based on the formulae $\frac{(diskRPM * 2)}{3}$ , Intuitively it means $\frac{2}{3}$ busy on disk arm is all that is tolerable for paging.

By default lotsfree is $\frac{1}{64}$ of total memory, desmem is $\frac{1}{2}$ of lotsfree and minfree is $\frac{1}{2}$ of desmem. Fast scan is scaled to $\frac{Total Memory}{2}$ and slow scan is hardcoded with 100. The whole of this value initialisation is done in function setupclock.

Another watermark worth mentioning is throttlefree. This watermark in general is equated with minfree.

**Fig1:** OpenSolaris Scan Rate With Respect to Watermark

Policy. Once memory reaches lotsfree watermark, pageout daemon kickstarts. The amount of page to scan is determined by the formulae:

$ScanRate= [ [ \frac{lotsfree - freeMem}{lotsfree}] * fastscan + [ \frac{freeMem}{lotsfree} * slowscan] ]$

where freeMem mentions the current instance of free memory in quantity of pages. Intuitively intial values of scan rate is majorly contributed by slow scan, but as it nears the exhaustion fastscan’s value starts increasing and contributing towards scanrate.

CPU utilisation is interpolated with the scanrate by sporadic check and comparison of the time taken to scan n pages ( called PAGES_POLL_MASK hard coded with 1023)with the utilisation.

CPU utilisation is calculated as follows:

min_pageout_ticks = ${\frac{hz * min-percent-cpu}{100}}$
max_pageout_tick = ${\frac{hz * max-percent-cpu}{100}}$
ScalingFactor = max_pageout_tick – min_pageout_tick
pageout_ticks = min_pageout_tick + $\frac{lotsfree - FreeMem}{lotsfree} * ScalingFactor$ .

Basically the fraction of memory available with respect to lotsfree is scaled to the ScalingFactor of CPU.
Between lotsfree and desfree the memory is scanned 4 times/second. Once the memory reaches less than desfree its scanned 100 times/second.
The throttleFree watermark suspends the allocation of pages that has the wait flag set till scanner brings the memory back above minsfree.

P.S. The same logic is implemented in 4.3BSD, but lotsfree is $\frac{1}{4}$ of total memory, desfree is $\frac{1}{2}$ of lotsfree and minfree is $\frac{1}{2}$ of desfree.

Implementation in FreeBSD:

FreeBSD have a much simple scan policy compared to Solaris and 4.3BSD. The watermarks are replaced by a minimum percent of memory that needs to be maintained. If the memory reaches below this threshold then pageout daemon is started.

Model. FreeBSD divides the pages into 5 categories of pools, namely wired, active, inactive, cache and free. Pages that are unmovable from the physical memory are named wired, pages that is currently part of active execution is called active, pages that are not part of active execution is called inactive, pages that are not used anymore are cached and pages that are not allocated is in free list.

Policy. The system has 2 thresholds, namely minimum and target thresholds. The systems goal is to maintain a minimum threshold of pages in active, cache and freelist, Once the pages reach below the minimum threshold then the pageout daemon kickstarts and starts pushing the pages back to target threshold. By default the values are:

Free + Cache( free_target) : 3.7% and 9%
inactive( Inactive_target): 0% and 4.5%

But these values are modifiable through user level interfaces.

Procedure:

Pageout daemon calculates the minimum required pages that are needed to get above the threshold.
Tries to acquire the required pages through 3 passes.
1. page_shortage = free_target – (free_count + cache_count). Tries to acquire this target. If the achieved target does not satisfy the needed target, then we move to pass 2.
2. page_shortage = ( free_target + inactive_target) – ( free_count + cache_count + inactive_count). Tries to acquire this target. If the achieved target does not satisfy the needed, then we move to pass 3.
3. kickstart swap out daemon, still all memory is full, then kill the biggest memory consumed process.
After the 1st pass check the time difference between previous pass and current pass. If the difference is more than lowmem_period(10), then trigger vm_lowmem event. This event tries to decrease many other registered cache sizes and reclaims the same.

Implementation in Linux:

Linux uses watermark amalgamated with demand based scanning, i.e., on allocating a page, the watermark is tested and if the current watermark before the allocation is less than the desired, then try reclaiming the zone directly and also start kswapd if required. The scan amount is based on user defined parameter called swappiness . The reclamation policy is

Model. The Memory( A node in NUMA based machines) is basically divided into zones. Each zone has High, Low and Min watermark. The ratio of the watermarks can be set by the user. Scan rate is determined by variable called swappiness, default value is 60, but can also be set by user, maximum value recommended is 100.

Fig2: Linux Scan Rate With Respect To Watermarks

Policy.

Before allocation, test for Min watermark is made by adding the allocation order count of the request to free page count. If pages are below the low watermark and GFP is not either set to GFP_MEMALLOC, then direct reclamation of the zone is tried with the count of SWAP_CLUSTER_MAX, provided GFP_DIRECT_RECLAIM is set for the allocation request.
- If the GFP is set with GFP_KSWAPD_RECLAIM, then kswapd is triggered to reclaim MAX( SWAP_CLUSTER_MAX, high watermark).
If the watermark falls below Min watermark then synchronous direct reclamation is attempted, i.e., any allocation request for pages with GFP_WAIT set is suspended on the wait queue. Concurrently kswapd tries to bring back the pages above high watermark.
In both the cases reclaim uses the following mechanism to scan the pages.
- The variable named priority scan_control decides on the portion of the zone to scan.
  - Scan_portion = lru_type_within_zone>> priority.
- Anonyomous page priority(ap) = swappiness.
- File page priority(fp) = 200 – ap.
- Pressure is applied on each zone on file or anon lru formulae derived as:
  - $pressure \propto \frac{1}{ \frac{Number of pages rotated}{ Number of pages scanned}}$ .
  - $pressure = respective priority( ap (or) fp ) * \frac{Number of pages scanned}{Number of pages rotated}$ .
  - respective pressure is either anonyomous or file page priority.
- Fraction per list Computation
  - Denominator = ap + fp .
  - Numerator = ap (or) fp.
  - Fraction to scan = $\frac{Numerator}{Denominator}$ .
- Scaling: Fraction to scan is scaled to scan_portion
  - Scan_amount = Fraction to scan * Scan_portion.
  - Basically, scan portion is scaled to the fraction we derived.
- If the priority is zero and swappiness are non-zero, then Scan count is set to SCAN_EQUAL. This means the ignore the fraction calculation and directly derive the scan count based on the priority for each inactive list.

P.S. The statistical variables rotated and scanned is decomposed to half once it reaches a threshold.

intuitively, it is a heuristical thought that reasons with a belief that more the pages are rotated( see Second Chance Algorithm) in a particular lru list then the possibility to reclaim pages from the same are less. Finally, it’s worth to note that described policies are simplified to provide clarity on the core idea.

Detailed Explanation Of Linux Scan count determination — Detail Description Of Linux Scan count determination.

Understanding Malloc – Part3(SLOB)

This issue is continuation of previous instalment, i.e. on Simple List Of Blocks. In previous section we saw the logical perspective of SLOB, In this section we would look into the implementation perspective of SLOB.

Data Structure. To start with, lets first derive data structure necessary to make the implementation work.Usually, the correct selection of data structure is the key to designing a good algorithm. To design for SLOB we primarily need two data structures, one is the list holding the chunks of same size and other to hold the chunk frame itself.

The List. Each element in the list needs to be a collection holding at least two elements, namely a variable holding the size of the chunks and other holding the head of the chunk frames. The list we are intending to design could be of two kinds; contiguous and disjoint. Both has its own advantages and disadvantages; the list being contiguous saves space of holding reference to the next element, but at the same time prevents the list from further expansion. In this implementation we would design blocks as disjoint linked list and list as contiguous. This approach imitates the implementation of many real libraries.


struct list {
    unsigned int size; /// holds the size of the blocks.
    void *head;  /// head of list of blocks.
};

The Blocks. When noticed from implementation perspective, the blocks are simply a list that grows and shrinks in horizontal direction, whereas the list holding blocks is represented vertically remains constant. To understand the data structure that defines the notion of blocks , its basically a collection holding a header element, data and a reference to the next element in the list, else next holds NULL. Creating this structure is quite tricky, which we would discuss in the coming sections. For now we will just stick with the declaration of data structure.

struct block {
     struct list *lst; ///the reference to the list the block belongs.
     struct block *nxt; ///reference to nxt block in the list.
     void *block;  ///The block or payload itself.
};

The beginning of the structure holds the reference to the list to which the block belongs, this enables the return to the list( free) a constant operation, then the list holding the reference to the next block, finally the chunk itself which would be returned to the user on malloc.

Construction of Data structure. Before we start addressing the construction of data structure, we need to think of an approach of calling an init function before the main() function.Calling init before main() provides an abstract view of the library calls. Moreover this is the approach almost all library level implementation follows .

To make this work, GCC has a set of compiler directives named as attributes. The attributes extend the language syntax .i.e. Basically instructs the compiler of how the particular function or variable should be altered during compilation or linking phase. To define a attribute:


__attribute__((attribute_list)) <function or variable>.

Of many attribute_list our area of interest is constructor and destructor.

From horse mouth : “The constructor attribute causes the function to be called automatically before execution enters main (). Similarly, the destructor attribute causes the function to be called automatically after main () has completed or exit () has been called”.

To understand the working of the mentioned attributes, let me make a confession about main(). The main is not the first function to execute when you start executing a process, rather there are many other preparation functionalities that execute before and after main().

so giving a rough view of how the main looks in whole picture is

_start     // First symbol to start execution of a process
.        // stored at location 0x080482d0
.
.
call constructors
ret = main();
call destructors
.
.
copy the ret value into parent
signal parent with sigchld

Though real implementation is much more complex, I presume this level of explanation would suffice for getting a pictorial understanding.

To make this work, the shared object file contains special sections (.ctors and .dtors on ELF) which contain references to the functions marked with the constructor and destructor attributes, respectively. When the library is loaded/unloaded the dynamic loader program (ld.so ) checks whether such sections exist, and if so, calls the functions referenced therein.

Now lets get back to our implementation on construction of data structure. Rough sketch of the implementation would be

create a contiguous list, i.e. an array of list based on number of heterogeneous blocks needed by the process.
For creating each set of homogeneous block, call sbrk() with size as (sizeof(header)+ BLOCK_SIZE) * number of blocks in the set.
now dissect each allocated block into piece of chunk frames.

Lets make this into a small snippet, where initialisation is done with single element on the list.


#define SLOB_LEN 1
struct list slob[SLOB_LEN];

__attribute__((constructor)) void init(void)
{
    int i = 0;
    do {
        slob[i].size = 8; /// hard coded for simplicity
        slob[i].head = make_blocks(8, 2, &slob[i]);
        i++;
     } while(i &lt; SLOB_LEN);
}

/*
  @ FUNCTION NAME: make_blocks
  @ DESCRIPTION: makes the blocks with given specifications
  @ INPUT: block_size: size of the block
           no_of_blks: count of the blocks
           list      : list to which the block belongs
  @ OUTPUT: returns the starting address of the blocks created
*/
void* make_blocks(const int block_size, const int no_of_blks, struct list* list)
{
    void *chnk = sbrk((block_size + sizeof(struct list)) * no_of_blks);
    int i = 0;
    struct block *blks = NULL;

    do {
        struct block *tmp = (struct block*) chnk;  //type cast chnk into header

        /* Header construction*/
        tmp->lst = list;
        tmp->nxt = blk;
        /* Block itself*/
        tmp->block = tmp + 1;
        /* prepend to the list*/
        blks = tmp;

        chnk = chnk + (sizeof(struct block) + block_size);
        i++;
    } while(i < no_of_blks);
    return blks;
}

This might look little complicated on first sight but on second glance I can guarantee that its pretty straight forward.

Malloc. This is pretty straightforward, just iterate through the list to find the size that best suits. The SLOB is also best fit, because the list would be arranged in ascending order.

void* Malloc(const int size)
{
   int i = 0;
   void *p = NULL;
   do{
      if(slob[i].size >= size) {
         if(slob[i].head) {
             p = slob[i].head->block;
             slob[i].head = slob[i].head->nxt;
         }
      }
i++;
   }while(i < SLOB_LEN);
   return p;
}

Free. The field in the struct block holding the reference to the list enables return to the list in constant time.


void free(void* blk){
   struct block *p = blk;
   p--;
   p->nxt = p->lst.head->nxt;
   p->lst.head = p;
}

Conclusion.

To implement SLOB; In advance we need to know the size of the blocks and count of the blocks which is somehow reducing the real dynamism.
SLOB is among the fastest alloc system in the series.
The real implementation can be found at: SLOB

The next article will take us into true dynamism called Buddy system.

References