Memory Management

What does memory management mainly do?#

Memory allocation and recovery: Allocates and releases memory required by processes. The malloc function: requests memory, the free function: releases memory.
Address translation: Converts virtual addresses in the program to physical addresses in memory.
Memory expansion: When the system does not have enough memory, it logically expands memory using virtual memory technology or automatic overlay technology.
Memory mapping: Maps a file directly into the process's address space, allowing file content to be accessed directly through memory pointers, which is faster.
Memory optimization: Optimizes memory usage efficiency by adjusting memory allocation strategies and recovery algorithms.
Memory safety: Ensures that processes do not interfere with each other's memory usage, preventing malicious programs from compromising system security by modifying memory.

What is memory fragmentation?#

Memory fragmentation is caused by the allocation and release of memory, typically divided into the following two types:

Internal Memory Fragmentation: Memory that has been allocated to a process but is not used. The main cause of internal memory fragmentation is that when memory is allocated using fixed sizes, such as powers of 2, the allocated memory may be larger than what the process actually needs. For example, if a process only needs 65 bytes of memory but is allocated 128 bytes (2^7), then 63 bytes become internal memory fragmentation.
External Memory Fragmentation: Occurs when the unallocated contiguous memory regions are too small to satisfy any process's memory allocation request. These small, non-contiguous memory spaces are referred to as external fragmentation. In other words, external memory fragmentation refers to memory that is not allocated to processes but cannot be used. The segmentation mechanism discussed later will lead to external memory fragmentation.

What are common memory management methods?#

Contiguous Memory Management#

Block Management
Block management divides memory into several fixed-size blocks, each containing only one process. If a program needs memory to run, the operating system allocates a block to it. If the program only requires a small space, a large portion of the allocated memory is wasted. The unused space in each block is referred to as internal memory fragmentation. In addition to internal memory fragmentation, there may also be external memory fragmentation between two memory blocks, which is too small to be allocated again.
In Linux systems, contiguous memory management uses the Buddy System algorithm to implement this, which is a classic contiguous memory allocation algorithm that effectively addresses external memory fragmentation. The main idea of the Buddy System is to divide memory into powers of 2 (each block size is a power of 2, e.g., 2^6=64 KB) and combine adjacent memory blocks into pairs (note: only adjacent blocks can be paired).
When allocating memory, the Buddy System tries to find the most suitable block size. If the found block is too large, it splits it into two equal-sized buddy blocks. If it is still too large, it continues to split until it reaches an appropriate size.
If two adjacent memory blocks are freed, the system will merge these two blocks to form a larger block for subsequent memory allocation. This reduces memory fragmentation and improves memory utilization.

Non-contiguous Memory Management#

Non-contiguous memory management has the following three methods:

Segmented Management: Manages/allocates physical memory in segments (contiguous physical memory). The virtual address space of an application is divided into segments of varying sizes, which have actual significance. Each segment defines a set of logical information, such as the main program segment MAIN, subroutine segment X, data segment D, and stack segment S.
Paged Management: Divides physical memory into contiguous pages of equal length, and the virtual address space of the application is also divided into contiguous virtual pages of equal length. This is a widely used memory management method in modern operating systems.
Segmented Paging Management Mechanism: A memory management mechanism that combines segmented management and paged management, where physical memory is first divided into several segments, and each segment is further divided into several equal-sized pages.

Virtual Memory#

What is virtual memory? What is its use?#

Virtual Memory is a very important technology in computer system memory management. Essentially, it is logically existing and is an imagined memory space, primarily serving as a bridge for processes to access main memory (physical memory) and simplifying memory management.
In summary, virtual memory mainly provides the following capabilities:

Process Isolation: Physical memory is accessed through virtual address space, which corresponds one-to-one with processes. Each process believes it has the entire physical memory, and processes are isolated from each other; code in one process cannot modify the physical memory used by another process or the operating system.
Improved Physical Memory Utilization: With virtual address space, the operating system only needs to load the portion of data or instructions currently being used by the process into physical memory.
Simplified Memory Management: Each process has a consistent and private virtual address space, allowing programmers to access physical memory through virtual address space without dealing with actual physical memory, thus simplifying memory management.
Multiple Processes Sharing Physical Memory: During execution, processes load many dynamic libraries from the operating system. These libraries are shared among all processes, and only one copy is actually loaded into memory, referred to as shared memory.
Increased Memory Usage Security: Controls process access to physical memory, isolates access permissions between different processes, and enhances system security.
Provides a Larger Usable Memory Space: Allows programs to have usable memory space exceeding the size of the system's physical memory. This is because when physical memory is insufficient, the disk can be used as a substitute, saving physical memory pages (usually 4 KB in size) to disk files (which affects read and write speed), and data or code pages can be moved between physical memory and disk as needed.

What are virtual addresses and physical addresses?#

Physical Address is the address in actual physical memory, specifically the address in the memory address register. The memory addresses accessed in programs are not physical addresses but Virtual Addresses.
In other words, when we are developing programs, we are actually dealing with virtual addresses. For example, in C language, the value stored in a pointer can be understood as an address in memory, which is what we refer to as a virtual address.
The operating system generally uses an important component in the CPU chip called the MMU (Memory Management Unit) to convert virtual addresses into physical addresses, a process known as Address Translation.

How are virtual addresses mapped to physical memory addresses?#

The main mechanisms by which the MMU translates virtual addresses to physical addresses are:

Segmentation mechanism
Paging mechanism
Segmented paging mechanism

Segmentation Mechanism#

Segmentation Mechanism manages/allocates physical memory in segments (contiguous physical memory). The virtual address space of an application is divided into segments of varying sizes, which have actual significance. Each segment defines a set of logical information, such as the main program segment MAIN, subroutine segment X, data segment D, and stack segment S.

What is the purpose of the segment table? What is the address translation process?#

In the segmentation mechanism, the virtual address consists of two parts: segment selector and offset within the segment.

Segment selector and offset within the segment:

Segment selector is stored in the segment register. The most important part of the segment selector is the segment number, which serves as an index for the segment table. The segment table contains the base address of the segment, the segment limit, and the privilege level, etc.
The offset within the segment in the virtual address should be between 0 and the segment limit. If the offset is valid, the physical memory address is obtained by adding the segment base address to the offset.

From the above, we know that the virtual address is mapped to the physical address through the segment table. The segmentation mechanism divides the program's virtual address into four segments, each having an entry in the segment table, where the segment base address can be found, and then adding the offset allows us to find the address in physical memory, as shown below:
The segmentation method is effective in solving the problem that the program itself does not need to worry about specific physical memory addresses, but it also has some shortcomings:

The first is the issue of memory fragmentation.
The second is the problem of low efficiency of memory swapping.

Does memory segmentation lead to memory fragmentation?#

Memory segmentation management can allocate memory according to actual needs, so the size of the segment allocated corresponds to the demand, thus internal memory fragmentation does not occur.
However, since the length of each segment is not fixed, multiple segments may not perfectly use all available memory space, leading to several non-contiguous small physical memory spaces, resulting in the problem of external memory fragmentation.
The solution to the "external memory fragmentation" problem is memory swapping.
For example, the 256MB memory occupied by a music program can be written to the hard disk, and then read back from the hard disk into memory. However, when reading back, we cannot load it back to the original position but must place it immediately after the already occupied 512MB memory. This way, a contiguous 256MB space can be freed, allowing the new 200MB program to be loaded.
This memory swapping space in Linux systems is what we commonly see as Swap space, which is allocated from the hard disk for swapping between memory and disk space.

Why does segmentation lead to low memory swapping efficiency?#

In a multi-process system, using segmentation makes it easy to generate external memory fragmentation. When external memory fragmentation occurs, it necessitates re-swapping the memory area, which creates a performance bottleneck.
This is because the access speed of the hard disk is much slower than that of memory, and each memory swap requires writing a large contiguous block of memory data to the hard disk.
Therefore, if the memory being swapped is a program occupying a large memory space, the entire machine will appear sluggish.
To address the "external memory fragmentation and low memory swapping efficiency" issues caused by segmentation, memory paging was introduced.

Paging Mechanism#

Paging divides the entire virtual and physical memory space into fixed-size segments. This contiguous and fixed-size memory space is called a page. In Linux, each page is sized at 4KB.
The mapping between virtual addresses and physical addresses is done through a page table, as shown below:

The page table is stored in memory, and the Memory Management Unit (MMU) performs the task of converting virtual memory addresses into physical addresses.
When the virtual address accessed by the process cannot be found in the page table, the system generates a page fault exception, entering the kernel space to allocate physical memory, update the process's page table, and then return to user space to resume the process's execution.

What is the purpose of the page table? What is the address translation process?#

In the paging mechanism, each application has a corresponding page table.
The virtual address in the paging mechanism consists of two parts:

Page number: The virtual page number can be used to retrieve the corresponding physical page number from the page table;
Offset within the page: Physical page starting address + offset within the page = physical memory address.

The specific address translation process is as follows:

The MMU first parses the virtual address to obtain the virtual page number;
The virtual page number is used to retrieve the corresponding physical page number from the application's page table (find the corresponding page table entry);
The physical page starting address (physical address) corresponding to the physical page number is added to the offset within the virtual address to obtain the final physical address.

How does paging solve the "external memory fragmentation and low memory swapping efficiency" issues of segmentation?#

Memory paging does not produce small gaps between pages like memory segmentation does, as memory space is pre-divided. Thus, paging ensures that pages are tightly arranged, eliminating external fragmentation.
However, because the minimum unit of memory allocation in the paging mechanism is a page, even if a program requires less than a page size, at least one page must be allocated, leading to internal memory fragmentation.
If memory space is insufficient, the operating system will release the "least recently used" memory pages from other running processes, temporarily writing them to the hard disk, referred to as swapping out. When needed again, they are loaded back in, referred to as swapping in. Therefore, only a few pages are written to disk at a time, which does not take much time, and the efficiency of memory swapping is relatively high.

Furthermore, paging allows us to load programs without needing to load the entire program into physical memory at once. We can perform the mapping between virtual memory and physical memory pages without actually loading the pages into physical memory, only loading them when the corresponding virtual memory pages are needed during program execution.

What issues does a single-level page table have? Why is a multi-level page table needed?#

Since the operating system can run many processes simultaneously, this means that the page table can become very large.
In a 32-bit environment, the virtual address space has a total of 4GB. Assuming a page size of 4KB (2^12), approximately 1 million (2^20) pages are needed. Each "page table entry" requires 4 bytes to store, so the entire 4GB space mapping would require 4MB of memory to store the page table.
This 4MB page table does not seem very large. However, it is important to note that each process has its own virtual address space, meaning each has its own page table.
Thus, for 100 processes, 400MB of memory would be needed to store the page tables, which is a significant amount of memory, not to mention in a 64-bit environment.

Multi-level Page Table#

To solve the above problem, a solution called Multi-Level Page Table is needed.
We can further page the single-level page table containing over 1 million "page table entries," dividing the first-level page table into 1024 second-level page tables, each containing 1024 "page table entries," forming two-level paging. As shown in the figure below:

The two-level table consists of a first-level page table and a second-level page table. The first-level page table has 1024 page table entries, and the first-level page table is associated with the second-level page tables, which also have 1024 page table entries. The relationship between the first-level page table entries and the second-level page table entries is one-to-many, and the second-level page tables are loaded on demand (only a small portion of the second-level page tables will be used), thus saving space.
Assuming only 2 second-level page tables are needed, the memory usage for the two-level page table would be: 4KB (first-level page table) + 4KB * 2 (second-level page tables) = 12 KB.
Multi-level page tables are a typical scenario of trading time for space, using increased page table lookup times to reduce page table space usage.

You might ask, doesn't dividing into two-level tables require 4KB (first-level page table) + 4MB (second-level page tables) of memory to map a 4GB address space, which occupies even more space?#

Of course, if the entire 4GB virtual address is mapped to physical memory, the space occupied by two-level paging is indeed larger. However, we often do not allocate that much memory for a single process.
In fact, we should look at the problem from a different perspective. Remember the locality principle that is ubiquitous in computer organization?
Each process has a 4GB virtual address space, but clearly, for most programs, the space used is far from reaching 4GB, as there will be many corresponding page table entries that are empty and not allocated. For allocated page table entries, if there are page table entries that have not been accessed for a certain period, the operating system will swap them out to the hard disk when physical memory is tight, meaning they will not occupy physical memory.
If we use two-level paging, the first-level page table can cover the entire 4GB virtual address space, but if a page table entry in the first-level page table is not used, there is no need to create the corresponding second-level page table for that entry, and it can be created only when needed. A simple calculation shows that if only 20% of the first-level page table entries are used, the memory space occupied by the page table would be 4KB (first-level page table) + 20% * 4MB (second-level page tables) = 0.804MB, which is a huge saving compared to the 4MB of a single-level page table.

TLB#

Although multi-level page tables solve the space issue, the conversion from virtual addresses to physical addresses involves several more steps, which clearly reduces the speed of address conversion, introducing time overhead.
Programs exhibit locality, meaning that during a period of time, the execution of the entire program is limited to a certain part of the program. Accordingly, the storage space accessed during execution is also confined to a certain memory area.
We can utilize this feature by storing the most frequently accessed page table entries in faster access hardware. Thus, computer scientists have added a special cache in the CPU chip to store the most frequently accessed page table entries, known as the TLB (Translation Lookaside Buffer), commonly referred to as page table cache, address translation cache, or fast table.
With the TLB, the CPU will first check the TLB when addressing; if not found, it will continue to check the regular page table.

Segmented Paging Memory Management#

Memory segmentation and paging are not mutually exclusive; they can be combined in the same system, typically referred to as segmented paging memory management.

The implementation of segmented paging memory management:

First, the program is divided into multiple logically meaningful segments, as mentioned in the segmentation mechanism;
Next, each segment is divided into multiple pages, further dividing the contiguous space created by segmentation into fixed-size pages;

Thus, the address structure consists of segment number, page number within the segment, and offset within the page.
The data structure used for segmented paging address translation is a segment table for each program, with each segment having a page table. The addresses in the segment table are the starting addresses of the page tables, while the addresses in the page tables correspond to the physical page numbers of a page, as shown in the figure:

To obtain the physical address in segmented paging address translation, three memory accesses are required:

The first access is to the segment table to obtain the starting address of the page table;
The second access is to the page table to obtain the physical page number;
The third access combines the physical page number with the offset within the page to obtain the physical address.

Segmented paging address translation can be implemented using a combination of software and hardware methods. Although this increases hardware costs and system overhead, it improves memory utilization.

How does malloc allocate memory?#

What does the memory distribution of a Linux process look like?#

In the Linux operating system, the virtual address space is divided into kernel space and user space, with different ranges depending on the bitness of the system. For example, in the most common 32-bit and 64-bit systems, as shown below:

From this, we can see:

The 32-bit system's kernel space occupies 1G, located at the highest address, while the remaining 3G is user space;
The 64-bit system's kernel space and user space both occupy 128T, respectively occupying the highest and lowest parts of the entire memory space, with the remaining middle part being undefined.

Now, let's discuss the differences between kernel space and user space:

When a process is in user mode, it can only access memory within user space;
Only after entering kernel mode can it access memory in kernel space;

Although each process has its own independent virtual memory, the kernel addresses in each virtual memory actually correspond to the same physical memory. Thus, when a process switches to kernel mode, it can easily access kernel space memory.
Next, let's further understand the division of the virtual space. The ways in which user space and kernel space are divided differ, and the distribution of kernel space will not be elaborated on.
Let's look at the distribution of user space. Taking a 32-bit system as an example, I have drawn a diagram to represent their relationships:
From this diagram, you can see that the user space memory is divided into 6 different memory segments from low to high:

Code segment, including binary executable code;
Data segment, including initialized static constants and global variables;
BSS segment, including uninitialized static variables and global variables;
Heap segment, including dynamically allocated memory, which grows upwards from low addresses;
File mapping segment, including dynamic libraries, shared memory, etc., which grows upwards from low addresses (related to hardware and kernel versions);
Stack segment, including local variables and function call contexts, etc. The stack size is fixed, generally 8 MB. Of course, the system also provides parameters for us to customize the size;

In these 6 memory segments, the memory of the heap and file mapping segments is dynamically allocated. For example, using the C standard library's malloc() or mmap() can dynamically allocate memory in the heap and file mapping segments, respectively.

How does malloc allocate memory?#

When malloc requests memory, it uses two methods to request heap memory from the operating system.

Method 1: Allocates memory from the heap through the brk() system call.
Method 2: Allocates memory in the file mapping area through the mmap() system call.

Method 1 is straightforward, simply moving the "top of the heap" pointer to a higher address using the brk() function to obtain new memory space. As shown in the figure:

Method 2 allocates a block of memory in the file mapping area through the "private anonymous mapping" method in the mmap() system call, essentially "stealing" a block of memory from the file mapping area. As shown in the figure:

The malloc() source code defines a default threshold:

If the memory allocated by the user is less than 128 KB, it requests memory through brk();
If the memory allocated by the user is greater than 128 KB, it requests memory through mmap().

Does malloc() allocate physical memory?#

No, malloc() allocates virtual memory.

How much virtual memory does malloc(1) allocate?#

When malloc() allocates memory, it does not strictly allocate the expected number of bytes as requested by the user but pre-allocates a larger space as a memory pool.
The specific size of the pre-allocated space depends on the memory manager used by malloc.
In this example, since the allocated memory is less than 128 KB, it is allocated from the heap space using the brk() system call, so we can see the label [heap] on the far right.
It can be seen that the memory address range of the heap space is 00d73000-00d94000, which is 132KB in size, indicating that malloc(1) actually pre-allocates 132K bytes of memory.

Does free return memory to the operating system?#

Memory allocated by malloc through the brk() method does not return memory to the operating system when freed; instead, it caches it in the memory pool of malloc for future use;
Memory allocated by malloc through the mmap() method does return memory to the operating system when freed, resulting in true memory release.

Why not use mmap for all memory allocation?#

If all memory is allocated using mmap, it means that each time a system call is executed.
Additionally, because memory allocated by mmap is returned to the operating system upon release, each time the virtual address allocated by mmap is in a page fault state, and the first access to that virtual address will trigger a page fault interrupt.
Frequent memory allocation through mmap not only causes a context switch each time but also triggers a page fault interrupt (after the first access to the virtual address), leading to significant CPU consumption.
To improve these two issues, malloc pre-allocates larger memory as a memory pool when requesting memory from the heap space using the brk() system call, as the heap space is contiguous. When memory is released, it is cached in the memory pool.
The next time memory is requested, the corresponding memory block can be taken directly from the memory pool, and the mapping relationship between this memory block's virtual address and physical address may still exist, thus reducing both the number of system calls and the number of page fault interrupts, significantly lowering CPU consumption.

Why not use brk for all memory allocation?#

Memory allocated through brk from the heap space is not returned to the operating system. If we continuously allocate 10k, 20k, and 30k of memory, and then free the 10k and 20k blocks, they become free memory space. If the next memory request is less than 30k, this free memory space can be reused.
However, if the next memory request exceeds 30k and there is no available free memory space, the operating system must be requested, causing actual memory usage to continue to increase.
Therefore, with frequent malloc and free operations in the system, especially for small memory blocks, the heap will generate increasingly more unusable fragments, leading to "memory leaks." This type of "leak" phenomenon cannot be detected using valgrind.
Thus, the implementation of malloc fully considers the differences and pros and cons of brk and mmap behavior, defaulting to using mmap for allocating large blocks of memory (128KB).

How does the free() function know how much memory to release with just one memory address?#

The memory address returned to user space by malloc is 16 bytes greater than the starting address of the process's heap space.
This extra 16 bytes stores the description information of the memory block, such as the size of the memory block.
Thus, when the free() function is executed, it offsets the incoming memory address by 16 bytes to analyze the current memory block's size, allowing it to know how much memory to release.

What happens when memory is full?#

What is the process of memory allocation and recovery?#

When an application reads or writes to this virtual memory, the CPU will attempt to access this virtual memory. At this point, it may find that this virtual memory is not mapped to physical memory, causing the CPU to generate a page fault interrupt, switching the process from user mode to kernel mode and handing the page fault interrupt to the kernel's Page Fault Handler (page fault interrupt function) for processing.
The page fault interrupt handler checks for free physical memory. If available, it directly allocates physical memory and establishes a mapping relationship between virtual memory and physical memory.
If there is no free physical memory, the kernel will begin the work of recovering memory, primarily through two methods: direct memory recovery and background memory recovery.

Background Memory Recovery (kswapd): When physical memory is tight, the kswapd kernel thread is awakened to recover memory. This memory recovery process is asynchronous and does not block the execution of processes.
Direct Memory Recovery (direct reclaim): If the asynchronous recovery does not keep up with the speed of memory requests from processes, direct recovery will begin. This memory recovery process is synchronous and will block the execution of processes.
If direct memory recovery still does not free enough physical memory to satisfy the current request, the kernel will resort to its last resort—triggering the OOM (Out of Memory) mechanism.

The OOM Killer mechanism selects a process that occupies a high amount of physical memory based on an algorithm and kills it to free up memory resources. If physical memory remains insufficient, the OOM Killer will continue to kill processes that occupy high amounts of physical memory until enough memory is freed.

What memory can be reclaimed?#

There are mainly two types of memory that can be reclaimed, and their recovery methods differ.

File Pages: Kernel caches of disk data (Buffer) and kernel caches of file data (Cache) are referred to as file pages. Most file pages can be directly released from memory, and if needed later, they can be re-read from the disk. However, those pages modified by applications that have not yet been written to disk (dirty pages) must be written to disk before memory can be released. Therefore, the method for reclaiming clean pages is to directly release memory, while the method for reclaiming dirty pages is to write them back to disk before releasing memory.
Anonymous Pages: This portion of memory has no actual carrier, unlike file caches that have hard disk files as a carrier, such as heap and stack data. This memory may need to be accessed again, so it cannot be directly released. The recovery method for this memory is through Linux's Swap mechanism, which writes infrequently accessed memory to disk and then releases this memory for use by other processes that need it. When these memory pages are accessed again, they can be read back from the disk into memory.

The recovery of file pages and anonymous pages is based on the LRU algorithm, which prioritizes reclaiming less frequently accessed memory. The LRU recovery algorithm maintains two doubly linked lists:

active_list: This list contains memory pages that have been accessed recently (active memory pages);
inactive_list: This list contains memory pages that have been accessed infrequently (inactive memory pages);
The closer a page is to the tail of the list, the less frequently it has been accessed. Thus, when reclaiming memory, the system can prioritize reclaiming inactive memory based on access frequency.

Performance impact of memory recovery#

The operation of reclaiming memory generally involves disk I/O. If memory recovery operations are frequent, it means that the number of disk I/O operations will be high, which will inevitably affect system performance, making the entire system feel sluggish.
Solutions.

Adjusting the recovery tendency of file pages and anonymous pages#

From the perspective of the recovery operations of file pages and anonymous pages, the recovery operation of file pages has less impact on the system compared to that of anonymous pages, as the recovery of clean file pages does not involve disk I/O, while both the swapping in and out of anonymous pages involve disk I/O.
Linux provides a /proc/sys/vm/swappiness option to adjust the recovery tendency of file pages and anonymous pages.
The range of swappiness is 0-100; a higher value indicates a more aggressive use of Swap, meaning a greater tendency to reclaim anonymous pages; a lower value indicates a more passive use of Swap, meaning a greater tendency to reclaim file pages.
It is generally recommended to set swappiness to 0 (the default value is 60), which means that when reclaiming memory, there will be a greater tendency to reclaim file pages, but this does not mean that anonymous pages will not be reclaimed.

Triggering the kswapd kernel thread for asynchronous memory recovery as early as possible#

The kernel defines three memory thresholds (watermarks) to measure whether the remaining memory (pages_free) is sufficient or tight:

Minimum page threshold (pages_min);
Low page threshold (pages_low);
High page threshold (pages_high);

These three memory thresholds divide memory usage into four situations:

In the orange section of the figure: If the remaining memory (pages_free) is between the low page threshold (pages_low) and the minimum page threshold (pages_min), it indicates that memory pressure is high and there is little remaining memory. At this time, kswapd0 will perform memory recovery until the remaining memory exceeds the high threshold (pages_high). Although memory recovery will be triggered, it will not block application execution, as the two processes are asynchronous.
In the red section of the figure: If the remaining memory (pages_free) is less than the minimum page threshold (pages_min), it indicates that user-available memory has been exhausted. At this point, direct memory recovery will be triggered, blocking the application because the two processes are synchronous.

The low page threshold (pages_low) can be indirectly set through the kernel option /proc/sys/vm/min_free_kbytes (this parameter represents the minimum amount of free memory the system reserves).
Although min_free_kbytes sets the minimum page threshold (pages_min), the high and low page thresholds (pages_high and pages_low) are calculated based on the minimum page threshold (pages_min), and their relationship is as follows:

pages_min = min_free_kbytes
pages_low = pages_min*5/4
pages_high = pages_min*3/2

4. How to protect a process from being killed by OOM?#

How does Linux choose which process to kill? This involves an oom_badness() function in the Linux kernel that scans through all processes that can be killed and scores each one. The process with the highest score will be killed first.
The value is calculated by multiplying the total number of available pages in the system by the OOM adjustment value oom_score_adj, dividing by 1000, and then adding the number of physical pages the process is currently using. The larger the calculated value, the higher the chance that this process will be killed by OOM.

If you do not want a certain process to be killed first, you can adjust its oom_score_adj to change its score and reduce the likelihood of it being killed by OOM.
If you want a certain process to never be killed, you can set its oom_score_adj to -1000.

7. How to avoid issues of prefetch failure and cache pollution?#

1. Linux and MySQL Caching#

Linux Operating System Cache#

When an application reads data from a file, the Linux operating system caches the read file data in the Page Cache.
Page Cache is data in memory space, and since memory access is much faster than disk access, the next time the same data is accessed, there is no need for disk I/O; the cached data can be returned directly.
Thus, Page Cache accelerates data access.

MySQL Cache#

MySQL data is stored on disk, and to improve the read and write performance of the database, the InnoDB storage engine has designed a Buffer Pool, which is also data in memory space.
With the Buffer Pool:

When reading data, if the data exists in the Buffer Pool, the client will read the data directly from the Buffer Pool; otherwise, it will read from disk.
When modifying data, the data in the Buffer Pool is modified first, then the page is marked as dirty, and finally, a background thread writes the dirty page to disk.

2. How does traditional LRU manage memory data?#

The traditional LRU algorithm has not been used by Linux and MySQL because it cannot avoid the following two problems:

Prefetch failure leads to a decrease in cache hit rate;
Cache pollution leads to a decrease in cache hit rate.

2. What is prefetch failure, and how to address it?#

What is the prefetch mechanism?#

The Linux operating system provides a prefetch mechanism for the read cache mechanism based on Page Cache. An example is:

If an application wants to read data from file A on disk within the offset range of 0-3KB, since the basic read and write unit of the disk is a block (4KB), the operating system will read at least 0-4KB of content, which can fit within one page.
However, due to the principle of spatial locality (data close to currently accessed data is likely to be accessed in the future), the operating system will choose to load the disk blocks offset [4KB,8KB), [8KB,12KB), and [12KB,16KB) into memory, thus requesting 3 additional pages in memory.
The following diagram represents the operating system's prefetch mechanism:

Thus, the benefit of the prefetch mechanism is to reduce the number of disk I/O operations and improve the system's disk I/O throughput.
The MySQL InnoDB storage engine also has a similar prefetch mechanism, where MySQL loads pages from disk and preemptively loads adjacent pages to reduce disk I/O.

What problems does prefetch failure cause?#

If the pages that were preloaded are not accessed, it means that the prefetch work was in vain, which is termed prefetch failure.
If the traditional LRU algorithm is used, the "prefetched pages" will be placed at the head of the LRU linked list, and when memory space is insufficient, pages at the tail will need to be evicted.
If these "prefetched pages" are never accessed, a strange problem arises: the prefetch pages that are not accessed occupy the front positions of the LRU linked list, while pages at the tail that may contain hot data are evicted, significantly reducing the cache hit rate.

How to avoid the impact of prefetch failure?#

Both the Linux operating system and MySQL InnoDB have improved the traditional LRU linked list to avoid the impact of prefetch failure, with specific improvements as follows:

The Linux operating system implements two LRU linked lists: active LRU list (active_list) and inactive LRU list (inactive_list);
MySQL's InnoDB storage engine divides a single LRU linked list into two areas: young area and old area.
The design philosophy of these two improvements is similar: they divide data into cold and hot data and apply the LRU algorithm separately. Unlike the traditional LRU algorithm, where all data is managed by a single LRU algorithm.
For example, suppose there is an LRU linked list of length 10, with the young area occupying 70% and the old area occupying 30%.

Now, if a page numbered 20 is prefetched, it will only be inserted at the head of the old area, while the page at the tail of the old area (page 10) will be evicted.

If page 20 is never accessed, it will not occupy a position in the young area and will be evicted earlier than the data in the young area.
If page 20 is accessed immediately after being prefetched, it will be inserted at the head of the young area, pushing the page at the tail of the young area (page 7) into the old area, becoming the head of the old area. This process will not result in any pages being evicted.

3. What is cache pollution, and how to address it?#

What is cache pollution?#

When we read data in bulk, since the data is accessed once, all this large amount of data will be added to the "active LRU list," causing all the hot data previously cached in the active LRU list (or young area) to be evicted. If this large amount of data is not accessed for a long time, the entire active LRU list (or young area) becomes polluted.

How to avoid the impact of cache pollution?#

The previous LRU algorithm adds data to the active LRU list (or young area) as soon as it is accessed, making the threshold for entering the active LRU list too low! It is precisely because the threshold is too low that cache pollution easily leads to the eviction of hot data originally in the active LRU list.
Thus, if we raise the threshold for entering the active LRU list (or young area), we can effectively ensure that hot data in the active LRU list (or young area) is not easily replaced.
The Linux operating system and MySQL InnoDB storage engine improve the threshold as follows:

Linux Operating System: A memory page is upgraded from the inactive list to the active list only after it has been accessed twice.
MySQL InnoDB: When a memory page is accessed twice, it does not immediately upgrade from the old area to the young area; instead, it undergoes a time stay judgment in the old area:
- If the time between the second access and the first access is within 1 second (default value), the page will not be upgraded from the old area to the young area;
- If the time exceeds 1 second, the page will be upgraded from the old area to the young area.
  By raising the threshold for entering the active LRU list (or young area), we can effectively avoid the impact of cache pollution.
  When reading data in bulk, if this large amount of data is only accessed once, it will not enter the active LRU list (or young area) and will not evict hot data, remaining in the inactive LRU list (or old area), where it will soon be evicted.