TCP

Basic Concepts of TCP#

TCP Header Format#

Sequence Number: A random number generated by the computer as its initial value when establishing a connection, passed to the receiving host through the SYN packet. Each time data is sent, this "data byte size" is "accumulated" once. It is used to solve the problem of network packet disorder.
Acknowledgment Number: Refers to the sequence number of the next "expected" data to be received. The sender can consider that all data before this sequence number has been received normally after receiving this acknowledgment. It is used to solve the problem of packet loss.
Control Flags:
ACK: When this bit is set to 1, the "Acknowledgment" field becomes valid. TCP stipulates that this bit must be set to 1 except for the initial SYN packet when establishing a connection.
RST: When this bit is set to 1, it indicates that an exception has occurred in the TCP connection and the connection must be forcibly terminated. This means calling the Socket.close() function without needing four-way handshake.
SYN: When this bit is set to 1, it indicates a desire to establish a connection and sets the initial value of the sequence number in its "Sequence Number" field.
FIN: When this bit is set to 1, it indicates that no more data will be sent in the future and wishes to disconnect. When the communication ends and a disconnection is desired, the hosts on both sides can exchange TCP segments with the FIN bit set to 1.

How to Uniquely Identify a TCP Connection?#

The TCP four-tuple can uniquely identify a connection, which includes:
Source Address, Source Port, Destination Address, Destination Port

The fields for source and destination addresses (32 bits) are in the IP header, serving to send packets to the other host via the IP protocol.
The fields for source and destination ports (16 bits) are in the TCP header, serving to inform the TCP protocol which process the packet should be sent to.

How many maximum connections can a server listening on an IP port have?

The server typically listens on a fixed local port, waiting for connection requests from clients.

Therefore, the client IP and port are variable, and the theoretical calculation formula is as follows:

Maximum TCP Connections = Number of Client IPs * Number of Client Ports

For IPv4, the maximum number of client IPs is 2 to the power of 32, and the maximum number of client ports is 2 to the power of 16, which means the maximum TCP connections on a single server is approximately 2 to the power of 48.
Of course, the maximum concurrent TCP connections on the server cannot reach the theoretical limit and will be affected by the following factors:

File Descriptor Limit: Each TCP connection is a file, and if the file descriptors are exhausted, it will result in Too many open files. Linux has three types of limits on the number of open file descriptors:
- System-level: The maximum number of files that can be opened by the current system, viewable via cat /proc/sys/fs/file-max;
- User-level: The maximum number of files that a specified user can open, viewable via cat /etc/security/limits.conf;
- Process-level: The maximum number of files that a single process can open, viewable via cat /proc/sys/fs/nr_open;
Memory Limit: Each TCP connection occupies a certain amount of memory, and the operating system's memory is limited. If memory resources are exhausted, it will result in OOM.

Since the IP layer can fragment, why does the TCP layer still need MSS?#

MTU: The maximum length of a network packet, generally 1500 bytes in Ethernet;
MSS: The maximum length of TCP data that can be accommodated in a network packet after excluding the IP and TCP headers;

When the IP layer has data (TCP header + TCP data) that exceeds the MTU size to send, the IP layer must fragment the data into several pieces, ensuring that each fragment is smaller than the MTU. After fragmenting an IP datagram, the target host's IP layer will reassemble it and then pass it to the upper TCP transport layer.
This seems orderly, but there is a hidden danger: if an IP fragment is lost, all fragments of the entire IP packet must be retransmitted.

Because the IP layer itself does not have a timeout retransmission mechanism; it is the TCP transport layer that is responsible for timeouts and retransmissions.
When a certain IP fragment is lost, the receiving IP layer cannot assemble a complete TCP packet (header + data) and thus cannot deliver the data packet to the TCP layer. Therefore, the receiver will not respond with an ACK to the sender. Since the sender does not receive the ACK confirmation packet for a long time, it will trigger a timeout retransmission, leading to the retransmission of the "entire TCP packet (header + data)."

Thus, to achieve optimal transmission efficiency, the TCP protocol usually negotiates the MSS value of both parties when establishing a connection. When the TCP layer finds that the data exceeds the MSS, it will first fragment it. Naturally, the length of the IP packets formed will not exceed the MTU, so there will be no need for IP fragmentation.
After TCP layer fragmentation, if a TCP fragment is lost, retransmission will also be done based on the MSS, rather than retransmitting all fragments, greatly increasing the efficiency of retransmission.

What are the differences between UDP and TCP? What are their respective application scenarios?#

Differences between TCP and UDP:#

Connection
TCP is a connection-oriented transport layer protocol that requires establishing a connection before transmitting data.
UDP does not require a connection and transmits data immediately.
Service Object
TCP is a one-to-one service, meaning a connection has only two endpoints.
UDP supports one-to-one, one-to-many, and many-to-many interactive communication.
Reliability
TCP reliably delivers data, ensuring that data is error-free, not lost, not duplicated, and arrives in order.
UDP makes a best-effort delivery without guaranteeing reliable data delivery. However, we can implement a reliable transport protocol based on UDP, such as the QUIC protocol. For more details, refer to this article: How to Implement Reliable Transmission Based on UDP Protocol? (opens new window)
Congestion Control and Flow Control
TCP has congestion control and flow control mechanisms to ensure the safety of data transmission.
UDP does not have these mechanisms; even if the network is congested, it will not affect the sending rate of UDP.
Header Overhead
TCP has a longer header, which incurs some overhead. The header is 20 bytes when the "options" field is not used, and it will be longer if the "options" field is used.
UDP has a fixed header of only 8 bytes, resulting in less overhead.
Transmission Method
TCP is stream-oriented, with no boundaries, but guarantees order and reliability.
UDP sends data in packets, with boundaries, but may lose packets or have them arrive out of order.
Fragmentation Differences
If TCP data size exceeds MSS, it will fragment at the transport layer. The target host will also reassemble the TCP packet at the transport layer. If a fragment is lost, only that fragment needs to be retransmitted.
If UDP data size exceeds MTU, it will fragment at the IP layer. The target host will reassemble the data at the IP layer before passing it to the transport layer.

Application Scenarios for TCP and UDP:#

Due to TCP being connection-oriented and ensuring reliable data delivery, it is often used for:

FTP file transfer;
HTTP / HTTPS;

Due to UDP being connectionless, it can send data at any time, and its processing is both simple and efficient, it is often used for:

Communication with a small number of packets, such as DNS, SNMP, etc.;
Multimedia communication such as video and audio;
Broadcast communication;

TCP Connection Establishment#

What is the process of the TCP three-way handshake?#

Initially, both the client and server are in the CLOSE state. The server actively listens on a certain port, in the LISTEN state.
The client randomly initializes a sequence number (client_isn), places this number in the "Sequence Number" field of the TCP header, and sets the SYN flag to 1, indicating a SYN packet. It then sends the first SYN packet to the server, indicating a connection request. This packet does not contain application layer data, and the client then enters the SYN-SENT state.
After receiving the client's SYN packet, the server also randomly initializes its own sequence number (server_isn), fills this number in the "Sequence Number" field of the TCP header, and fills the "Acknowledgment Number" field with client_isn + 1. It then sets both the SYN and ACK flags to 1. Finally, it sends this packet to the client, which also does not contain application layer data, and the server then enters the SYN-RCVD state.
After the client receives the server's packet, it must respond with the final acknowledgment packet. First, the ACK flag in the TCP header of this acknowledgment packet is set to 1, and the "Acknowledgment Number" field is filled with server_isn + 1. Finally, the packet is sent to the server, and this packet can carry data from the client to the server. The client then enters the ESTABLISHED state.
After the server receives the client's acknowledgment packet, it also enters the ESTABLISHED state.

From the above process, it can be seen that the third handshake can carry data, while the first two handshakes cannot carry data, which is a common interview question.
Once the three-way handshake is completed, both parties are in the ESTABLISHED state, and the connection is established, allowing the client and server to send data to each other.

How to check TCP status in a Linux system?

In Linux, you can check it using the command netstat -napt.

What happens if the first handshake is lost?

If the client does not receive the server's SYN-ACK packet (the second handshake) for a long time, it will trigger the "timeout retransmission" mechanism and retransmit the SYN packet. The sequence number of the retransmitted SYN packet remains the same.
In Linux, the maximum retransmission count for the client's SYN packet is controlled by the tcp_syn_retries kernel parameter, which can be customized. The default value is generally 5. Each timeout duration is twice the previous one.

What happens if the second handshake is lost?

The client will retransmit the SYN packet, which is the first handshake. The maximum retransmission count is determined by the tcp_syn_retries kernel parameter.
The server will retransmit the SYN-ACK packet, which is the second handshake. The maximum retransmission count is determined by the tcp_synack_retries kernel parameter.
In Linux, the maximum retransmission count for SYN-ACK packets is controlled by the tcp_synack_retries kernel parameter, with a default value of 5.

What happens if the third handshake is lost?

If the third handshake is lost, and the server does not receive this acknowledgment packet for a long time, it will trigger the timeout retransmission mechanism and retransmit the SYN-ACK packet until it receives the third handshake or reaches the maximum retransmission count.
Note: ACK packets are not retransmitted. If an ACK is lost, the other party will retransmit the corresponding packet.

Why is three-way handshake necessary?#

Avoid Historical Connections
The primary reason for the three-way handshake is to prevent confusion caused by old duplicate connection initializations.
Consider a scenario where the client first sends a SYN (seq = 90) packet, then the client crashes, and this SYN packet is blocked by the network, so the server does not receive it. After the client restarts, it attempts to establish a connection with the server again by sending a SYN (seq = 100) packet (note! This is not a retransmission of SYN; the retransmitted SYN has the same sequence number).
Let's see how the three-way handshake prevents historical connections:
If the client continuously sends multiple SYN packets (all with the same four-tuple) to establish a connection under network congestion:
- An "old SYN packet" arrives at the server before the "latest SYN" packet, causing the server to respond with a SYN + ACK packet to the client, with an acknowledgment number of 91 (90+1).
- The client receives this and realizes that the expected acknowledgment number should be 100 + 1, not 90 + 1, so it responds with a RST packet.
- The server receives the RST packet and releases the connection.
- Once the latest SYN reaches the server, the client and server can complete the three-way handshake normally.

The "old SYN packet" mentioned above is referred to as a historical connection. The main reason TCP uses a three-way handshake to establish a connection is to prevent the initialization of "historical connections."
In the case of two-way handshakes, the server has no intermediate state to prevent historical connections from the client, potentially leading to the establishment of a historical connection and wasting resources.
2. Synchronize Initial Sequence Numbers
The sequence number is a key factor for reliable transmission, serving the following purposes:
- The receiver can eliminate duplicate data;
- The receiver can receive data in order based on the sequence numbers;
- It can identify which packets have been received by the other party (known through the sequence number in the ACK packet);

Thus, when the client sends a SYN packet carrying the "initial sequence number," the server needs to respond with an ACK packet to indicate that the client's SYN packet has been successfully received. Similarly, when the server sends the "initial sequence number" to the client, it must also receive a response from the client. This back-and-forth ensures that both parties' initial sequence numbers are reliably synchronized.
3. Avoid Resource Waste
If there are only "two-way handshakes," when the SYN packet from the client is blocked in the network and the client does not receive the ACK packet, it will resend the SYN. Without a third handshake, the server cannot determine whether the client has received its ACK packet, so it must actively establish a connection every time it receives a SYN.
If the SYN packet sent by the client is blocked in the network and multiple SYN packets are sent, the server will establish multiple redundant invalid connections upon receiving requests, leading to unnecessary resource waste.

Summary: The reasons for not using "two-way handshakes" and "four-way handshakes":
"Two-way handshake": Cannot prevent the establishment of historical connections, leading to resource waste on both sides, and cannot reliably synchronize the sequence numbers of both parties;
"Four-way handshake": The three-way handshake is already theoretically the minimum reliable connection establishment, so there is no need for more communication rounds.

Why must the initial sequence numbers be different each time a TCP connection is established?#

To prevent historical packets from being received by the next connection with the same four-tuple (mainly);
For security reasons, to prevent forged TCP packets with the same sequence number from being received by the other party;

The process is as follows:

The client and server establish a TCP connection, and when the client's data packet is blocked in the network, it times out and retransmits this data packet. Meanwhile, the server device loses power and restarts, causing the previously established connection with the client to disappear. When it receives the client's data packet, it sends a RST packet.
Immediately afterward, the client establishes a connection with the server with the same four-tuple as the previous connection.
After the new connection is established, the previously blocked data packet from the old connection arrives at the server, and the sequence number of this packet happens to be within the server's receiving window, so the packet will be received normally by the server, causing data confusion.

It can be seen that if the initial sequence numbers are the same each time a connection is established, it is easy to encounter the problem of historical packets being received by the next connection with the same four-tuple.

How is the initial sequence number ISN randomly generated?

The initial ISN is based on a clock, incrementing by 1 every 4 microseconds, and it takes 4.55 hours to complete a cycle.
RFC793 mentions the random generation algorithm for the initial sequence number ISN: ISN = M + F(localhost, localport, remotehost, remoteport).

M is a timer that increments by 1 every 4 microseconds.
F is a hash algorithm that generates a random value based on the source IP, destination IP, source port, and destination port. It is essential to ensure that the hash algorithm cannot be easily deduced from the outside, and using the MD5 algorithm is a good choice.

It can be seen that the random number is incremented based on the clock timer, making it virtually impossible to randomly generate the same initial sequence number.

What is a SYN attack? How to prevent SYN attacks?#

We know that establishing a TCP connection requires three-way handshakes. Suppose an attacker forges SYN packets with different IP addresses in a short time. Each time the server receives a SYN packet, it enters the SYN_RCVD state, but the ACK + SYN packets sent by the server cannot receive ACK responses from unknown IP hosts. Over time, this will fill the server's half-connection queue, preventing the server from serving legitimate users.

Half-connection queue and full-connection queue?
Also known as SYN queue and accept queue;

Normal process:

When the server receives a client's SYN packet, it creates a half-connection object and adds it to the kernel's "SYN queue";

It then sends a SYN + ACK to the client, waiting for the client to respond with an ACK packet;
After the server receives the ACK packet from the client, it removes a half-connection object from the "SYN queue" and creates a new connection object to place in the "Accept queue";

The application calls the accept() socket interface to retrieve the connection object from the "Accept queue."

Regardless of whether it is a half-connection queue or a full-connection queue, there is a maximum length limit, and packets will be discarded by default if the limit is exceeded.

The most direct manifestation of a SYN attack is to fill the TCP half-connection queue, causing subsequent SYN packets to be discarded when the TCP half-connection queue is full, preventing clients from establishing connections with the server.
To prevent SYN attacks, the following four methods can be employed:
1. Increase netdev_max_backlog
When the speed of the network card receiving packets exceeds the speed at which the kernel processes them, a queue is maintained to store these packets. The maximum value of this queue is controlled by the following parameter, which defaults to 1000. We should appropriately increase this parameter, for example, setting it to 10000:
2. Increase TCP Half-Connection Queue
To increase the TCP half-connection queue, the following three parameters must be increased simultaneously:

Increase net.ipv4.tcp_max_syn_backlog
Increase the backlog in the listen() function
Increase net.core.somaxconn

3. Enable net.ipv4.tcp_syncookies
Enabling the syncookies feature allows connections to be successfully established without using the SYN half-connection queue, effectively bypassing the SYN half-connection to establish connections.

It can be seen that when tcp_syncookies is enabled, even if a SYN attack causes the SYN queue to be full, normal connections can still be successfully established.
The net.ipv4.tcp_syncookies parameter mainly has the following three values:
0: Disable this feature;
1: Enable it only when the SYN half-connection queue is full;
2: Unconditionally enable the feature;
Thus, when dealing with SYN attacks, it is sufficient to set it to 1.
4. Reduce SYN+ACK Retransmission Count
When the server is under a SYN attack, there will be many TCP connections in the SYN_REVC state. TCP connections in this state will retransmit SYN+ACK. Once the retransmission exceeds the maximum count, the connection will be terminated.
Therefore, in the context of SYN attacks, we can reduce the retransmission count of SYN-ACK to speed up the disconnection of TCP connections in the SYN_REVC state.
The maximum retransmission count for SYN-ACK packets is controlled by the tcp_synack_retries kernel parameter (default value is 5).

TCP Connection Termination#

TCP Four-Way Handshake Process#

The client intends to close the connection and sends a packet with the TCP header's FIN flag set to 1, known as the FIN packet. The client then enters the FIN_WAIT_1 state.
After the server receives this packet, it sends an ACK acknowledgment packet to the client, then enters the CLOSE_WAIT state.
After the client receives the server's ACK acknowledgment packet, it enters the FIN_WAIT_2 state.
After the server finishes processing the data, it sends a FIN packet to the client, then enters the LAST_ACK state.
After the client receives the server's FIN packet, it sends an ACK acknowledgment packet back to the server, then enters the TIME_WAIT state.
After the server receives the ACK acknowledgment packet, it enters the CLOSE state, completing the connection closure on the server side.
After a period of 2MSL, the client automatically enters the CLOSE state, completing the connection closure on the client side.

You can see that each direction requires one FIN and one ACK, hence it is commonly referred to as a four-way handshake.
One point to note is that only the party actively closing the connection will have the TIME_WAIT state.

Why does the handshake require four steps?

The server usually needs to wait for the completion of data sending and processing, so the server's ACK and FIN are generally sent separately, thus requiring four steps for the handshake.

What happens if the first handshake is lost?

If the first handshake is lost, and the client does not receive the passive party's ACK for a long time, it will trigger the timeout retransmission mechanism and retransmit the FIN packet. The number of retransmissions is controlled by the tcp_orphan_retries parameter.
When the number of retransmissions of the FIN packet exceeds tcp_orphan_retries, it will stop sending FIN packets and wait for a period (the time is twice the previous timeout duration). If it still does not receive the second handshake, it will directly enter the close state.

What happens if the second handshake is lost?

ACK packets are not retransmitted, so if the server's second handshake is lost, the client will trigger the timeout retransmission mechanism and retransmit the FIN packet until it receives the server's second handshake or reaches the maximum retransmission count.

What happens if the third handshake is lost?

When the client receives the second handshake, which is the server's ACK packet, it enters the FIN_WAIT2 state. In this state, it needs to wait for the server to send the third handshake, which is the server's FIN packet.
For connections closed by the close function, since no further data can be sent or received, the FIN_WAIT2 state cannot last too long. The tcp_fin_timeout controls the duration of this state, with a default value of 60 seconds.
This means that for connections closed by calling close, if the FIN packet is not received within 60 seconds, the client's (active closing party) connection will be closed directly.
However, note that if the active closing party uses the shutdown function to close the connection, specifying only to close the sending direction while not closing the receiving direction, it means that the active closing party can still receive data.
In this case, if the active closing party does not receive the third handshake for a long time, its connection will remain in the FIN_WAIT2 state.
When the server (passive closing party) receives the client's (active closing party) FIN packet, the kernel will automatically reply with an ACK, and the connection will enter the LAST_ACK state, waiting for the client to return an ACK to confirm the connection closure.
If this ACK is not received for a long time, the server will retransmit the FIN packet, with the number of retransmissions controlled by the tcp_orphan_retries parameter, which is the same as the way the client retransmits the FIN packet.

What happens if the fourth handshake is lost?

When the client receives the server's third handshake, which is the FIN packet, it will send an ACK packet, which is the fourth handshake. At this point, the client's connection enters the TIME_WAIT state.
In a Linux system, the TIME_WAIT state will last for 2MSL before entering the closed state.
Then, the server (passive closing party) remains in the LAST_ACK state until it receives the ACK packet.
If the fourth handshake's ACK packet does not reach the server, the server will retransmit the FIN packet, with the number of retransmissions still controlled by the tcp_orphan_retries parameter introduced earlier.

Why is the waiting time for TIME_WAIT 2MSL?#

MSL is Maximum Segment Lifetime, the maximum lifetime of a packet, which is the longest time any packet can exist on the network. If it exceeds this time, the packet will be discarded. Since TCP packets are based on the IP protocol, and the IP header has a TTL field, which indicates the maximum number of hops a packet can take through routers, the value decreases by 1 for each router it passes. When this value reaches 0, the packet will be discarded, and an ICMP message will be sent to notify the source host.
The difference between MSL and TTL: MSL is measured in time, while TTL is measured in hops. Therefore, MSL should be greater than or equal to the time it takes for TTL to reach 0 to ensure that the packet has naturally disappeared.
The TTL value is generally 64, and Linux sets MSL to 30 seconds, meaning that Linux believes that the time taken for a packet to pass through 64 routers will not exceed 30 seconds. If it exceeds this, it is assumed that the packet has disappeared from the network.
TIME_WAIT waits for twice the MSL, which is reasonably explained as: there may be data packets from the sender in the network, and when these data packets are processed by the receiver, they will send a response back. Therefore, the round trip requires waiting for twice the time.

Why is the TIME_WAIT state necessary?#

There are two main reasons:

To prevent data from historical connections from being incorrectly received by subsequent connections with the same four-tuple;
To ensure that the party "closing the connection passively" can be closed correctly;

What are the dangers of excessive TIME_WAIT?#

On the server side, it occupies system resources such as file descriptors, memory resources, CPU resources, thread resources, etc.;
On the client side, it occupies port resources, which are also limited. The generally available port range is 32768 to 61000, which can also be specified through the net.ipv4.ip_local_port_range parameter.

How to optimize TIME_WAIT?#

Enable the net.ipv4.tcp_tw_reuse and net.ipv4.tcp_timestamps options;
This allows sockets in the TIME_WAIT state to be reused for new connections. One point to note is that the tcp_tw_reuse feature can only be used by the client (the connection initiator) because when this feature is enabled, the kernel will randomly find a connection in the TIME_WAIT state that has exceeded 1 second to reuse for the new connection.
net.ipv4.tcp_max_tw_buckets
This value defaults to 18000. When the number of connections in the TIME_WAIT state exceeds this value, the system will reset the subsequent TIME_WAIT connection states, which is a more aggressive method.
Use SO_LINGER in the program to forcefully close with RST.
If l_onoff is non-zero and l_linger is set to 0, then calling close will send a RST flag to the other party, and the TCP connection will skip the four-way handshake and the TIME_WAIT state, closing directly.

What are the reasons for a server to have a large number of TIME_WAIT states?#

The TIME_WAIT state only appears on the active closing party, so if the server has a large number of TCP connections in the TIME_WAIT state, it indicates that the server has actively closed many TCP connections.
In what scenarios will the server actively close connections?

First scenario: HTTP does not use long connections.
As long as either party's HTTP header contains Connection information, the HTTP long connection mechanism cannot be used. Thus, after completing an HTTP request/processing, the connection will be closed.
According to most web service implementations, regardless of which party disables HTTP Keep-Alive, the server actively closes the connection.
Second scenario: HTTP long connection timeout.
Assuming the timeout for the HTTP long connection is set to 60 seconds, nginx will start a "timer." If the client does not initiate a new request within 60 seconds after completing the last HTTP request, when the timer expires, nginx will trigger a callback function to close the connection, resulting in TIME_WAIT state connections on the server.
Third scenario: The number of requests for the HTTP long connection reaches the limit.
The nginx parameter keepalive_requests indicates the number of client requests that have been received and processed on a single HTTP long connection. If this reaches the maximum value set by this parameter, nginx will actively close this long connection, resulting in TIME_WAIT state connections on the server.

What are the reasons for a server to have a large number of CLOSE_WAIT states?
When a server has a large number of CLOSE_WAIT state connections, it indicates that the server's program has not called the close function to close the connection, which is usually a code issue.

Socket Programming#

How should Socket programming be done for TCP?#

The server and client initialize the socket and obtain file descriptors;
The server calls bind to bind the socket to a specified IP address and port;
The server calls listen to start listening;
The server calls accept to wait for client connections;
The client calls connect to initiate a connection request to the server's address and port;
The server's accept returns the file descriptor for the socket used for transmission;
The client calls write to send data; the server calls read to receive data;
When the client disconnects, it calls close, and when the server reads data, it will read EOF. After processing the data, the server calls close to indicate that the connection is closed.

It is important to note that when the server calls accept, it will return a socket for the completed connection, which is used for data transmission.
Thus, there are "two" sockets: one is the listening socket, and the other is the completed connection socket.
Once the connection is successfully established, both parties start to read and write data using the read and write functions, just like writing to a file stream.

What is the significance of the backlog parameter when listening?#

The Linux kernel maintains two queues:

Half-connection queue (SYN queue): Receives a SYN connection request and is in the SYN_RCVD state;
Full-connection queue (Accept queue): Completed the TCP three-way handshake process and is in the ESTABLISHED state;

In early Linux kernels, backlog referred to the size of the SYN queue, which is the size of the incomplete queue.
After Linux kernel 2.2, backlog became the length of the accept queue, which is the queue for established connections. Therefore, it is now generally considered that backlog refers to the accept queue.
However, the upper limit is the size of the kernel parameter somaxconn, meaning that the accept queue length = min(backlog, somaxconn).

What happens when the TCP half-connection queue and full-connection queue are full?
TCP Full Connection Queue Overflow
When the maximum full connection queue is exceeded, the server will drop subsequent incoming TCP connections.
TCP Half Connection Queue Overflow
If the half-connection queue is full and tcp_syncookies is not enabled, packets will be discarded;
If the full connection queue is full and there are more than 1 connection requests for retransmission of SYN+ACK packets, packets will be discarded;
If tcp_syncookies is not enabled and max_syn_backlog minus the current half-connection queue length is less than (max_syn_backlog >> 2), packets will be discarded.

At which step does accept occur in the three-way handshake?#

The client successfully returns from connect during the second handshake, while the server successfully returns from accept after the three-way handshake is completed.

What is the process of connection disconnection when the client calls close?#

The client calls close, indicating that there is no data to send, and sends a FIN packet to the server, entering the FIN_WAIT_1 state;
The server receives the FIN packet, and the TCP protocol stack will insert an EOF file end marker into the receive buffer for the FIN packet. The application can perceive this FIN packet through the read call. This EOF will be placed after other received data that is queued. This means the server needs to handle this exceptional situation, as EOF indicates no additional data will arrive on this connection. At this point, the server enters the CLOSE_WAIT state;
After processing the data, the server will naturally read EOF and then call close to close its socket, sending a FIN packet and entering the LAST_ACK state;
The client receives the server's FIN packet and sends an ACK confirmation packet back to the server, entering the TIME_WAIT state;
After the server receives the ACK confirmation packet, it enters the final CLOSE state;
After 2MSL, the client also enters the CLOSE state;

Can a TCP connection be established without accept?#

Yes.
The accept system call does not participate in the TCP three-way handshake process; it is only responsible for retrieving a socket from the TCP full connection queue that has already been established. The user layer can perform read and write operations on the socket obtained through the accept system call.

Can a TCP connection be established without listen?#

Yes.
A client can connect to itself to form a connection (TCP self-connection) or two clients can simultaneously request to establish connections with each other (TCP simultaneous open). In both cases, there is a common point: no server is involved, meaning that a TCP connection can be established without listen.
If a server is involved and has not called the listen function, it cannot find a socket listening on that port, resulting in a RST to terminate the connection.

TCP Reliability Mechanisms#

Retransmission Mechanism#

Timeout Retransmission#

One of the ways to implement the retransmission mechanism is to set a timer when sending data. If the specified time elapses without receiving the other party's ACK confirmation packet, the data will be retransmitted, which is commonly referred to as timeout retransmission.
TCP will perform timeout retransmission in the following two situations:

Packet loss
Acknowledgment loss

If the data that is retransmitted due to timeout experiences another timeout, TCP's strategy is to double the timeout interval.
This means that each time a timeout retransmission occurs, the next timeout interval will be set to twice the previous value. Two timeouts indicate a poor network environment, making frequent retransmissions inadvisable.
A problem with timeout-triggered retransmissions is that the timeout period may be relatively long.

Fast Retransmission#

TCP also has another mechanism called fast retransmission, which is driven by data rather than time.

In the diagram, the sender sends data packets 1, 2, 3, 4, and 5:

The first packet Seq1 arrives first, so it acknowledges 2;
Seq2 is not received for some reason, but Seq3 arrives, so it still acknowledges 2;
The subsequent Seq4 and Seq5 arrive, but it still acknowledges 2 because Seq2 has not been received;
The sender receives three ACK = 2 confirmations and knows that Seq2 has not been received, so it retransmits the lost Seq2 before the timer expires.
Finally, Seq2 is received, and since Seq3, Seq4, and Seq5 have all been received, it acknowledges 6.

Thus, the fast retransmission mechanism works by retransmitting lost packets when three identical ACK packets are received before the timer expires.
The fast retransmission mechanism only addresses the issue of timeout duration, but it still faces another issue: whether to retransmit one packet or all packets.

SACK Method#

SACK (Selective Acknowledgment) is a method that requires adding a SACK option in the TCP header. It allows the receiver to send information about the data that has been received to the sender, enabling the sender to know which data has been received and which has not. With this information, the sender can retransmit only the lost data.
When the sender receives three identical ACK confirmation packets, it triggers the fast retransmission mechanism. Through SACK information, it discovers that only the data in the range of 200 to 299 is lost, so it retransmits only that TCP segment.

Duplicate SACK#

Duplicate SACK, also known as D-SACK, primarily uses SACK to inform the sender which data has been received multiple times.
ACK Packet Loss:

Both ACK confirmations sent from the receiver to the sender are lost, so the sender times out and retransmits the first data packet (3000 to 3499).
The receiver finds that the data has been received multiple times and sends a SACK = 3000 to 3500, informing the sender that the data from 3000 to 3500 has already been received. Since the ACK has reached 4000, it indicates that all data before 4000 has been received, so this SACK represents D-SACK.
Thus, the sender knows that the data has not been lost; rather, the receiver's ACK confirmation packets have been lost.

Network Delay:

The data packet (1000 to 1499) is delayed in the network, causing the sender not to receive the ACK for 1500.
The subsequent three identical ACK confirmation packets trigger the fast retransmission mechanism, but after retransmission, the delayed data packet (1000 to 1499) arrives at the receiver.
Therefore, the receiver sends a SACK = 1000 to 1500, indicating that it has received duplicate packets since the ACK has reached 3000, making this SACK a D-SACK, indicating that duplicate packets have been received.
Thus, the sender knows that the reason for triggering fast retransmission is not due to lost packets sent out or lost ACK packets, but rather due to network delays.

It can be seen that D-SACK has several benefits:

It allows the sender to know whether the sent packets were lost or if the receiver's ACK packets were lost;
It indicates whether the sender's data packets were delayed in the network;
It shows whether the sender's data packets were duplicated in the network.

Sliding Window#

TCP introduces the concept of a window to address the issue that the longer the round-trip time of data packets, the lower the communication efficiency. The window size refers to the maximum amount of data that can be sent without waiting for an acknowledgment.
The implementation of the window is essentially a buffer space allocated by the operating system. The sender must retain the sent data in the buffer until the acknowledgment is received. If the acknowledgment is received on time, the data can be cleared from the buffer.

In the diagram, the ACK 600 acknowledgment packet is lost, but it does not matter because it can be confirmed by the next acknowledgment. As long as the sender receives ACK 700, it means that all data before 700 has been received by the receiver. This mode is called cumulative acknowledgment or cumulative acknowledgment.

Who determines the window size?

There is a field in the TCP header called Window, which indicates the window size.
This field informs the sender how much buffer space the receiver has available to receive data. Therefore, the sender can send data based on the receiver's processing capacity without overwhelming it.
Thus, the window size is usually determined by the receiver's window size.
The amount of data sent by the sender cannot exceed the receiver's window size; otherwise, the receiver will not be able to receive the data properly.

Sending Window
The data stream sent can be divided into the following four parts: sent and acknowledged | sent but not acknowledged | not sent but can be sent | not sendable, where the sending window = sent but not acknowledged + not sent but can be sent.

Receiving Window
The received data stream can be divided into: received | not received but ready to receive | not received and not ready to receive. The receiving window = the part that is not received but ready to receive.

Are the sizes of the receiving window and sending window equal?

They are not completely equal; the size of the receiving window is approximately equal to the size of the sending window.
This is because the sliding window is not fixed. For example, if the receiving application process reads data very quickly, the receiving window can quickly become available. The new receiving window size is communicated to the sender through the Window field in the TCP packet. Since this transmission process has latency, the receiving window and sending window are approximately equal.

Flow Control#

The sender cannot blindly send data to the receiver without considering the receiver's processing capacity.
If data is continuously sent to the receiver without regard for its ability to process, it may trigger the retransmission mechanism, leading to unnecessary waste of network traffic.
To address this phenomenon, TCP provides a mechanism that allows the sender to control the amount of data sent based on the receiver's actual receiving capacity, which is known as flow control.
Relationship between Operating System Buffer and Sliding Window
The number of bytes stored in the sending window and receiving window is kept in the operating system's memory buffer, which can be adjusted by the operating system.
When the application process cannot read the contents of the buffer in time, it will also affect our buffer.
If there is a reduction in the buffer followed by a shrinking window, packet loss may occur.
To prevent this situation, TCP stipulates that it is not allowed to reduce the buffer and shrink the window simultaneously; instead, it adopts a strategy of shrinking the window first and then reducing the buffer after a period, which can avoid packet loss.
Window Closure
If the window size is 0, it will prevent the sender from transmitting data to the receiver until the window becomes non-zero. This is known as window closure.
When the receiver informs the sender of the window size, it does so through the ACK packet.
When window closure occurs, after the receiver processes the data, it will send a non-zero window notification ACK packet to the sender. If this notification ACK packet is lost in the network, it will cause the sender to wait indefinitely for the non-zero window notification from the receiver, while the receiver will also wait for data from the sender. If no measures are taken, this mutual waiting process will lead to a deadlock.
To resolve this issue, TCP sets a persistent timer for each connection. As soon as one side of the TCP connection receives a zero window notification from the other side, it starts the persistent timer.
If the persistent timer times out, it will send a window probe packet, and the other party will provide its current receiving window size when acknowledging this probe packet.
The number of window probes is generally 3, with each probe occurring approximately every 30-60 seconds (this may vary by implementation). If after 3 attempts the receiving window is still 0, some TCP implementations will send a RST packet to terminate the connection.
Confused Window Syndrome
If the receiver is too busy to process the data in the receiving window, it will cause the sender's sending window to become smaller and smaller.
Eventually, if the receiver frees up a few bytes and informs the sender of the current window size, the sender will unhesitatingly send those few bytes, leading to the confused window syndrome.
It is important to note that our TCP + IP header has 40 bytes; to transmit those few bytes of data incurs such a large overhead, which is highly inefficient.
To resolve the confused window syndrome, two problems must be addressed:

Prevent the receiver from notifying the sender of a small window.
The receiver's usual strategy is as follows:
When the "window size" is less than min(MSS, buffer space/2), meaning it is less than the minimum of MSS and half the buffer size, it will notify the sender that the window is 0, thus preventing the sender from sending more data.
Once the receiver has processed some data and the window size is >= MSS or the receiver's buffer space has half available, it can open the window to allow the sender to send data.
Prevent the sender from sending small data.
The sender's usual strategy is:
Use the Nagle algorithm, which is based on delayed processing. The sender can only send data when either of the following two conditions is met:
Condition 1: Wait until the window size >= MSS and data size >= MSS;
Condition 2: Receive the ACK packet for previously sent data;
As long as neither of the above conditions is met, the sender will continue to accumulate data until the sending conditions are satisfied.

Congestion Control#

When the network becomes congested, continuing to send a large number of packets may lead to packet delays, losses, etc. In this case, TCP will retransmit the data, but retransmission will further burden the network, leading to greater delays and more packet losses, creating a vicious cycle that amplifies the issue.
Thus, congestion control was introduced to prevent the sender's data from filling the entire network.
To adjust the amount of data the sender sends, a concept called "congestion window" is defined. It dynamically changes based on the level of network congestion.
Congestion control algorithms:
1. Slow Start
When the sender receives an ACK, the size of the congestion window cwnd increases by 1. When slow start reaches the slow start threshold ssthresh:
- When cwnd < ssthresh, the slow start algorithm is used.
- When cwnd >= ssthresh, the "congestion avoidance algorithm" is used.
2. Congestion Avoidance
Every time an ACK is received, cwnd increases by 1/cwnd.
3. Congestion Occurrence
When network congestion occurs, meaning packet retransmissions happen, the congestion occurrence algorithm varies based on the different retransmission mechanisms.

Congestion occurrence algorithm for timeout retransmission

Set ssthresh to cwnd/2
Reset cwnd to 1 (restore to the initial value of cwnd, which I assume is 1)

Congestion occurrence algorithm for fast retransmission

Set cwnd = cwnd/2, meaning set it to half of the original value;
Set ssthresh = cwnd;
Enter fast recovery algorithm.

4. Fast Recovery

Set the congestion window cwnd = ssthresh + 3
Retransmit the lost packets;
If another duplicate ACK is received, increase cwnd by 1;
If a new data ACK is received, set cwnd to the value of ssthresh from the first step, as this ACK confirms that new data has been received, indicating that all data from the duplicated ACK has been received, and this recovery process has ended, allowing a return to the previous state, which means entering the congestion avoidance state again.