1、問題概述
雖然軟件底層模塊在網絡恢復后能自動重連上服務器,但會議因為網絡問題已經退出,需要重新加入會議。因為客戶特殊的網絡運行環境,會頻繁出現網絡抖動不穩定的情況,客戶要求必須要實現60秒內網絡恢復后能依然保持在會議中,保證會議流程不被中斷。
客戶堅持要實現這個特殊的功能點,項目已經接近尾聲,目前處于客戶試用階段,不實現該功能,項目無法通過驗收,客戶不給錢。
前方同事將當前問題及項目進展情況向研發部門領導反饋,研發部緊急召開討論會議,商討60秒不掉會的實現方案。這里面涉及到兩大類的網絡連接,一類是傳輸控制信令的TCP連接,另一類是傳輸音視頻碼流的UDP連接。UDP連接的問題不大,主要是TCP連接的斷鏈與重連問題,下面主要討論TCP連接相關問題。
在出現網絡不穩定掉會時,可能是系統TCPIP協議棧已經檢測到網絡異常,系統協議層已經將網絡斷開了;也可能軟件應用層的心跳機制檢測到網絡故障,斷開了與服務器的鏈接。對于系統TCPIP協議棧自身檢測出來的網絡異常,則可能存在兩種情況,一是TCPIP協議棧自身的心跳機制檢測出來的;二是TCP連接的丟包重傳機制檢測出異常。
對于應用層的心跳檢測機制,我們可以放大超時檢測時間。本文我們主要討論一下TCPIP協議棧的TCP連接的心跳、丟包重傳、連接超時等機制。在檢測到網絡異常后,我們底層可以自動發起重連或者信令發送觸發自動重連,業務模塊將會議相關資源保存不釋放,在網絡恢復后可以繼續保持在會議中,可以繼續接收到會議中的音視頻碼流,可以繼續進行會議中的一些操作!
2、TCPIP協議棧的心跳機制
2.1、TCP中的ACK機制
TCP建鏈時的三次握手流程如下所示:
之所以說TCP連接是可靠的,首先是發送數據前要建立連接,再就是收到數據后都會給對方恢復一個ACK包,表明我收到你的數據包了。對于數據發送端,如果數據發出去后沒有收到ACK包,則會觸發丟包重傳機制。
不管是建鏈時,還是建鏈后的數據收發時,都有ACK包,TCPIP協議棧的心跳包也不例外。
2.2、TCPIP協議棧的心跳機制說明
TCPIP協議棧有個默認的TCP心跳機制,這個心跳機制是和socket套接字(TCP套接字)綁定的,可以對指定的套接字開啟協議棧的心跳檢測機制。默認情況下,協議棧的心跳機制對socket套接字是關閉的,如果要使用需要人為開啟的。
在Windows中,默認是每隔2個小時發一次心跳包,客戶端程序將心跳包發給服務器后,接下來會有兩種情況:
1)網絡正常時:服務器收到心跳包,會立即回復ACK包,客戶端收到ACK包后,再等2個小時發送下一個心跳包。其中,心跳包發送時間間隔時間keepalivetime,Windows系統中默認是2小時,可配置。如果在2個小時的時間間隔內,客戶端和服務器有數據交互,客戶端會收到服務器的ACK包,也算作心跳機制的心跳包,2個小時的時間間隔會重新計時。2)網絡異常時:服務器收不到客戶端發過去的心跳包,沒法回復ACK,Windows系統中默認的是1秒超時,1秒后會重發心跳包。如果還收不到心跳包的ACK,則1秒后重發心跳包,如果始終收不到心跳包,則在發出10個心跳包就達到了系統的上限,就認為網絡出故障了,協議棧就會直接將連接斷開了。其中,發出心跳包收不到ACK的超時時間稱為keepaliveinterval,Windows系統中默認是1秒,可配置;收不到心跳包對應的ACK包的重發次數probe,Windows系統是固定的,是固定的10次,不可配置的。
所以TCPIP協議棧的心跳機制也能檢測出網絡異常,不過在默認配置下可能需要很久才能檢測出來,除非網絡異常出現在正在發送心跳包后等待對端的回應時,這種情況下如果多次重發心跳包都收不到ACK回應,協議棧就會判斷網絡出故障,主動將連接關閉掉。
2.3、修改TCPIP協議棧的默認心跳參數
TCPIP協議棧的默認心跳機制的開啟,不是給系統整個協議棧開啟心跳監測,而是對某個socket套接字開啟。
開啟心跳機制后,還可以修改心跳的時間參數。從代碼上看,先調用setsockopt給目標套接字開啟心跳監測機制,再調用WSAIoctl去修改心跳檢測的默認時間參數,相關代碼如下所示:
SOCKET socket;
// ......(中間代碼省略)
int optval = 1;
int nRet = setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, (const char *)&optval,
sizeof(optval));
if (nRet != 0)
return;
tcp_keepalive alive;
alive.onoff = TRUE;
alive.keepalivetime = 101000;
alive.keepaliveinterval = 21000;
DWORD dwBytesRet = 0;
nRet = WSAIoctl(socket, SIO_KEEPALIVE_VALS, &alive, sizeof(alive), NULL, 0,
&dwBytesRet, NULL, NULL);
if (nRet != 0)
return;
上面的代碼可以看到,先調用setsockopt函數,傳入SO_KEEPALIVE參數,打開TCP連接的心跳開關,此時心跳參數使用系統默認的心跳參數值。緊接著,調用WSAIoCtrl函數,傳入SIO_KEEPALIVE_VALS參數,同時將設置好時間值的心跳參數結構體傳進去。
下面對心跳參數結構體tcp_keepalive做個詳細的說明:(以Windows系統為例)
1)keepalivetime:默認2小時發送一次心跳保活包,比如發送第1個保活包之后,間隔2個小時后再發起下一個保活包。如果這期間有數據交互,也算是有效的保活包,這個時間段就不再發送保活包,發送下個保活包的時間間隔會從收發的最后一條數據的時刻開始重新從0計時。2)keepaliveinterval:發送保活包后,沒有收到對端的ack的超時時間默認為1秒。假設和對端的網絡出問題了,給對端發送第1個保活包,1秒內沒有收到對端的ack,則發第2個保活包,1秒內沒有收到對端的保活包,再發送下一個保活包,.....,直到發送第10個保活包后,1秒鐘還沒收到ack回應,則達到發送10次保活包的探測次數上限,則認為網絡出問題了。3)probe探測次數:Windows系統上的探測次數被固定為10次,不可修改。
MSDN上對心跳機制檢測出的網絡異常的說明如下:
If a connection is dropped as the result of keep-alives the error code WSAENETRESET is returned to any calls in progress on the socket, and any subsequent calls will fail with WSAENOTCONN.
因為保活次數達到上限導致連接被丟棄掉,所有正在調用中的套接字接口會返回WSAENETRESET錯誤碼,后續的套接字api函數的調用都會返回WSAENOTCONN。
3、libwebsockets開源庫中的心跳機制使用的就是TCPIP協議棧的心跳機制
我們的產品之前在使用websocket時,就遇到沒有設置心跳機制導致TCP長連接被網絡設備無故釋放的問題。
我們客戶端程序在登錄時,會去連接某業務的注冊服務器,建立的是websocket長連接。這個長連接一直保持著,只有使用該業務模塊的業務時才會使用到該連接,在該連接上進行數據交互。軟件登錄后,如果一直沒有操作該業務模塊的業務,這個長連接會一直處于閑置狀態,即這個連接上沒有數據交互。
結果在某次測試過程中出現了問題,排查下來發現,這個長連接因為長時間沒有數據交互,被中間的網絡設備關閉了。后來為了解決這個問題,我們在初始化websocket庫時設置心跳參數,這樣上述websocket長連接在空閑的時候能跑一跑心跳包,這樣就能確保該長連接不會因為長時間沒有跑數據被無故關閉的問題了。
我們在調用lws_create_context接口創建websockets會話上下文時,該接口的結構體參數lws_context_creation_info中,有設置心跳參數的字段:
/**
- struct lws_context_creation_info - parameters to create context with
- This is also used to create vhosts.... if LWS_SERVER_OPTION_EXPLICIT_VHOSTS
- is not given, then for backwards compatibility one vhost is created at
- context-creation time using the info from this struct.
- If LWS_SERVER_OPTION_EXPLICIT_VHOSTS is given, then no vhosts are created
- at the same time as the context, they are expected to be created afterwards.
- @port: VHOST: Port to listen on... you can use CONTEXT_PORT_NO_LISTEN to
suppress listening on any port, that's what you want if you are
not running a websocket server at all but just using it as a
client
- @iface: VHOST: NULL to bind the listen socket to all interfaces, or the
interface name, eg, "eth2"
If options specifies LWS_SERVER_OPTION_UNIX_SOCK, this member is
the pathname of a UNIX domain socket. you can use the UNIX domain
sockets in abstract namespace, by prepending an @ symbole to the
socket name.
- @protocols: VHOST: Array of structures listing supported protocols and a protocol-
specific callback for each one. The list is ended with an
entry that has a NULL callback pointer.
It's not const because we write the owning_server member
- @extensions: VHOST: NULL or array of lws_extension structs listing the
extensions this context supports. If you configured with
--without-extensions, you should give NULL here.
- @token_limits: CONTEXT: NULL or struct lws_token_limits pointer which is initialized
with a token length limit for each possible WSI_TOKEN_***
- @ssl_cert_filepath: VHOST: If libwebsockets was compiled to use ssl, and you want
to listen using SSL, set to the filepath to fetch the
server cert from, otherwise NULL for unencrypted
- @ssl_private_key_filepath: VHOST: filepath to private key if wanting SSL mode;
if this is set to NULL but sll_cert_filepath is set, the
OPENSSL_CONTEXT_REQUIRES_PRIVATE_KEY callback is called
to allow setting of the private key directly via openSSL
library calls
- @ssl_ca_filepath: VHOST: CA certificate filepath or NULL
- @ssl_cipher_list: VHOST: List of valid ciphers to use (eg,
"RC4-MD5:RC4-SHA:AES128-SHA:AES256-SHA:HIGH:!DSS:!aNULL"
or you can leave it as NULL to get "DEFAULT"
- @http_proxy_address: VHOST: If non-NULL, attempts to proxy via the given address.
If proxy auth is required, use format
"username:password@server:port"
- @http_proxy_port: VHOST: If http_proxy_address was non-NULL, uses this port at
the address
- @gid: CONTEXT: group id to change to after setting listen socket, or -1.
- @uid: CONTEXT: user id to change to after setting listen socket, or -1.
- @options: VHOST + CONTEXT: 0, or LWS_SERVER_OPTION_... bitfields
- @user: CONTEXT: optional user pointer that can be recovered via the context
pointer using lws_context_user
- @ka_time: CONTEXT: 0 for no keepalive, otherwise apply this keepalive timeout to
all libwebsocket sockets, client or server
- @ka_probes: CONTEXT: if ka_time was nonzero, after the timeout expires how many
times to try to get a response from the peer before giving up
and killing the connection
- @ka_interval: CONTEXT: if ka_time was nonzero, how long to wait before each ka_probes
attempt
- @provided_client_ssl_ctx: CONTEXT: If non-null, swap out libwebsockets ssl
implementation for the one provided by provided_ssl_ctx.
Libwebsockets no longer is responsible for freeing the context
if this option is selected.
- @max_http_header_data: CONTEXT: The max amount of header payload that can be handled
in an http request (unrecognized header payload is dropped)
- @max_http_header_pool: CONTEXT: The max number of connections with http headers that
can be processed simultaneously (the corresponding memory is
allocated for the lifetime of the context). If the pool is
busy new incoming connections must wait for accept until one
becomes free.
- @count_threads: CONTEXT: how many contexts to create in an array, 0 = 1
- @fd_limit_per_thread: CONTEXT: nonzero means restrict each service thread to this
many fds, 0 means the default which is divide the process fd
limit by the number of threads.
- @timeout_secs: VHOST: various processes involving network roundtrips in the
library are protected from hanging forever by timeouts. If
nonzero, this member lets you set the timeout used in seconds.
Otherwise a default timeout is used.
- @ecdh_curve: VHOST: if NULL, defaults to initializing server with "prime256v1"
- @vhost_name: VHOST: name of vhost, must match external DNS name used to
access the site, like "warmcat.com" as it's used to match
Host: header and / or SNI name for SSL.
- @plugin_dirs: CONTEXT: NULL, or NULL-terminated array of directories to
scan for lws protocol plugins at context creation time
- @pvo: VHOST: pointer to optional linked list of per-vhost
options made accessible to protocols
- @keepalive_timeout: VHOST: (default = 0 = 60s) seconds to allow remote
client to hold on to an idle HTTP/1.1 connection
- @log_filepath: VHOST: filepath to append logs to... this is opened before
any dropping of initial privileges
- @mounts: VHOST: optional linked list of mounts for this vhost
- @server_string: CONTEXT: string used in HTTP headers to identify server
software, if NULL, "libwebsockets".
/
struct lws_context_creation_info {
int port; / VH */
const char iface; / VH */
const struct lws_protocols protocols; / VH */
const struct lws_extension extensions; / VH */
const struct lws_token_limits token_limits; / context */
const char ssl_private_key_password; / VH */
const char ssl_cert_filepath; / VH */
const char ssl_private_key_filepath; / VH */
const char ssl_ca_filepath; / VH */
const char ssl_cipher_list; / VH */
const char http_proxy_address; / VH /
unsigned int http_proxy_port; / VH /
int gid; / context /
int uid; / context /
unsigned int options; / VH + context */
void user; / context /
int ka_time; / context /
int ka_probes; / context /
int ka_interval; / context */
#ifdef LWS_OPENSSL_SUPPORT
SSL_CTX provided_client_ssl_ctx; / context /
#else / maintain structure layout either way */
void provided_client_ssl_ctx;
#endif
short max_http_header_data; / context /
short max_http_header_pool; / context /
unsigned int count_threads; / context /
unsigned int fd_limit_per_thread; / context /
unsigned int timeout_secs; / VH */
const char ecdh_curve; / VH */
const char vhost_name; / VH */
const char * const plugin_dirs; / context */
const struct lws_protocol_vhost_options pvo; / VH /
int keepalive_timeout; / VH */
const char log_filepath; / VH */
const struct lws_http_mount mounts; / VH */
const char server_string; / context /
/ Add new things just above here ---^
- This is part of the ABI, don't needlessly break compatibility
- The below is to ensure later library versions with new
- members added above will see 0 (default) even if the app
- was not built against the newer headers.
*/
void *_unused[8];
};
其中的ka_time、ka_probes和ka_interval三個字段就是心跳相關的設置參數。我們初始化websockets上下文的代碼如下:
static lws_context* CreateContext()
{
lws_set_log_level( 0xFF, NULL );
lws_context* plcContext = NULL;
lws_context_creation_info tCreateinfo;
memset(&tCreateinfo, 0, sizeof tCreateinfo);
tCreateinfo.port = CONTEXT_PORT_NO_LISTEN;
tCreateinfo.protocols = protocols;
tCreateinfo.ka_time = LWS_TCP_KEEPALIVE_TIME;
tCreateinfo.ka_interval = LWS_TCP_KEEPALIVE_INTERVAL;
tCreateinfo.ka_probes = LWS_TCP_KEEPALIVE_PROBES;
tCreateinfo.options = LWS_SERVER_OPTION_DISABLE_IPV6;
plcContext = lws_create_context(&tCreateinfo);
return plcContext;
}
通過查閱libwebsockets開源庫代碼得知,此處設置的心跳使用的就是TCPIP協議棧的心跳機制,如下所示:
LWS_VISIBLE int
lws_plat_set_socket_options(struct lws_vhost *vhost, lws_sockfd_type fd)
{
int optval = 1;
int optlen = sizeof(optval);
u_long optl = 1;
DWORD dwBytesRet;
struct tcp_keepalive alive;
int protonbr;
#ifndef _WIN32_WCE
struct protoent *tcp_proto;
#endif
if (vhost->ka_time) {
/* enable keepalive on this socket */
// 先調用setsockopt打開發送心跳包(設置)選項
optval = 1;
if (setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE,
(const char *)&optval, optlen) < 0)
return 1;
alive.onoff = TRUE;
alive.keepalivetime = vhost->ka_time*1000;
alive.keepaliveinterval = vhost->ka_interval*1000;
if (WSAIoctl(fd, SIO_KEEPALIVE_VALS, &alive, sizeof(alive),
NULL, 0, &dwBytesRet, NULL, NULL))
return 1;
}
/* Disable Nagle */
optval = 1;
#ifndef _WIN32_WCE
tcp_proto = getprotobyname("TCP");
if (!tcp_proto) {
lwsl_err("getprotobyname() failed with error %dn", LWS_ERRNO);
return 1;
}
protonbr = tcp_proto->p_proto;
#else
protonbr = 6;
#endif
setsockopt(fd, protonbr, TCP_NODELAY, (const char *)&optval, optlen);
/* We are nonblocking... */
ioctlsocket(fd, FIONBIO, &optl);
return 0;
}
4、TCPIP丟包重傳機制
如果網絡出故障時,客戶端與服務器之間正在進行TCP數據交互,客戶端給服務器發送數據包后因為網絡故障收不到服務器的ACK包,就會觸發客戶端的TCP丟包重傳,丟包重傳機制也能判斷出網絡出現異常。
對于TCP連接,客戶端給服務器發送數據后沒有收到服務器的ACK包,會觸發丟包重傳。每次重傳的時間間隔會加倍,當重傳次數達到系統上限(Windows默認的上限是5次,Linux默認的上限是15次)后,協議棧就認為網絡出故障了,會直接將對應的連接關閉了。
所以當網絡出現故障時有數據交互,協議棧會在數十秒內檢測到網路出現異常,就會直接將連接直接關閉掉。丟包重傳機制的詳細描述如下所示:
對于丟包重傳機制,可以通過給PC插拔網線來查看,可以使用wireshark抓包看一下。快速插拔網線時(先拔掉網線,等待幾秒鐘再將網線插上),給服務器發送的操作指令會因為丟包重傳會收到數據的。
5、使用非阻塞socket和select接口實現connect連接的超時控制
5.1、MSDN上對connect和select接口的說明
對于tcp套接字,我們需要調用套接字函數connect去建立TCP連接。我們先來看看微軟MSDN上對套接字接口connect的描述:
On a blocking socket, the return value indicates success or failure of the connection attempt.
對于阻塞式的socket,通過connect的返回值就能確定有沒有連接成功,返回0表示連接成功。
With a nonblocking socket, the connection attempt cannot be completed immediately. In this case, connect will return SOCKET_ERROR, and WSAGetLastError will return WSAEWOULDBLOCK. In this case, there are three possible scenarios:
Use the select function to determine the completion of the connection request by checking to see if the socket is writeable.
對于非組賽式的socket,connect調用會立即返回,但連接操作還沒有完成。connect返回SOCKET_ERROR,對于非阻塞式socket,返回SOCKET_ERROR并不表示失敗,需要調用WSAGetLastError獲取connect函數執行后的LastError值,一般此時WSAGetLastError會返回WSAEWOULDBLOCK:
表明連接正在進行中。可以使用select接口檢測一下套接字是否可寫(套接字是否在writefds集合中),如果可寫,則表示連接成功。如果套接字在exceptfds集合中,則說明連接出現了異常,如下所示:
5.2、使用非阻塞socket和select實現連接超時的控制
對于阻塞式的socket,在Windows下,如果遠端的IP和Port不可達,則會阻塞75s后返回SOCKET_ERROR,表明連接失敗。所以當我們測試遠端的IP和Port是否可以連接時,我們不使用阻塞式的socket,而是使用非阻塞式socket,然后調用select,通過select添加連接超時時間,實現連接超時的控制。
select函數因為超時返回,會返回0;如果發生錯誤,則返回SOCKET_ERROR,所以判斷時要判斷select返回值,如果小于等于0,則是連接失敗,立即將套接字關閉掉。如果select返回值大于0,則該返回值是已經準備就緒的socket個數,比如連接成功的socket。我們判斷套接字是否在可寫集合writefds中,如果在該集合中,則表示連接成功。
根據MSDN上的相關描述,我們就能大概知道該如何實現connect的超時控制了,相關代碼如下:
bool ConnectDevice( char* pszIP, int nPort )
{
// 創建TCP套接字
SOCKET connSock = socket(AF_INET, SOCK_STREAM, 0);
if (connSock == INVALID_SOCKET)
{
return false;
}
// 填充IP和端口
SOCKADDR_IN devAddr;
memset(&devAddr, 0, sizeof(SOCKADDR_IN));
devAddr.sin_family = AF_INET;
devAddr.sin_port = htons(nPort);
devAddr.sin_addr.s_addr = inet_addr(pszIP);
// 將套接字設置為非阻塞式的,為下面的select做準備
unsigned long ulnoblock = 1;
ioctlsocket(connSock, FIONBIO, &ulnoblock);
// 發起connnect,該接口立即返回
connect(connSock, (sockaddr*)&devAddr, sizeof(devAddr));
FD_SET writefds;
FD_ZERO(&writefds);
FD_SET(connSock, &writefds);
// 設置連接超時時間為1秒
timeval tv;
tv.tv_sec = 1; //超時1s
tv.tv_usec = 0;
// The select function returns the total number of socket handles that are ready and contained
// in the fd_set structures, zero if the time limit expired, or SOCKET_ERROR(-1) if an error occurred.
if (select(0, NULL, &writefds, NULL, &tv) <= 0)
{
closesocket(connSock);
return false; //超時未連接上就退出
}
ulnoblock = 0;
ioctlsocket(connSock, FIONBIO, &ulnoblock);
closesocket(connSock);
return true;
}
-
IP協議
+關注
關注
3文章
85瀏覽量
21665 -
服務器
+關注
關注
12文章
9206瀏覽量
85561 -
TCP
+關注
關注
8文章
1362瀏覽量
79117 -
音視頻
+關注
關注
4文章
477瀏覽量
29896
發布評論請先 登錄
相關推薦
評論