Timeout Connecting to APNs

I have a personal Countly Community Edition deployment running on the latest `master`.

When adding an Apple authentication key to Countly to enable push notifications, this appears in the logs:

2020-01-13T01:31:18.928Z: WARN  [jobs:resource] Opening underlying resource res:validate:5e1bc866b4c98bb6fddd08e0:ql909er80gb from façade
DBG: no certificate, using bearer eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6Ijk3UVBBUTZWOVIifQ.eyJpc3MiOiJITDNRNDVMWDlOIiwiaWF0IjoxNTc4ODc5MDc4fQ.rW2KGricBUXg5F3sYIQ
80XMQsNCzdDpRTwo9tLGr_PvSwibwhqmuio04SZCFFKjauSj4LZrA3bKA9kSvj_c7mw
DBG: CONN 0x80fb9cd00: going into conn_loop in 0x80fb9cd00
DBG: CONN 0x80fb9cd00: conn_loop returned DBG: proxy @ 0: 0
DBG: no certificate, using bearer bearer eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6Ijk3UVBBUTZWOVIifQ.eyJpc3MiOiJITDNRNDVMWDlOIiwiaWF0IjoxNTc4ODc5MDc4fQ.rW2KGricBUXg
5F3sYIQ80XMQsNCzdDpRTwo9tLGr_PvSwibwhqmuio04SZCFFKjauSj4LZrA3bKA9kSvj_c7mw
INF: resolving
INF: resolving api.push.apple.com
INF: dns resolved: 17.188.166.201, has next? 1
INF: init_connection to api.push.apple.com in 0x802d2a000
INF: CONN 0x80365dd00: looping adresses of api.push.apple.com in 0x80365dd00
ERR: looping addresses
DBG: CONN 0x80fb9cd00: Starting connect timer
DBG: CONN 0x80fb9cd00: connecting to api.push.apple.com (17.188.166.201: 2 (2), 6) in 0x80fb9cd00
DBG: CONN 0x80fb9cd00: socket 26
DBG: CONN 0x80fb9cd00: setsockopt
DBG: CONN 0x80fb9cd00: ssl 0x80fdf2000
DBG: CONN 0x80fb9cd00: CONN 0x80fb9cd00: connect -1
DBG: CONN 0x80fb9cd00: CONN 0x80fb9cd00: uv_tcp_init 0
DBG: CONN 0x80fb9cd00: uv_tcp_open 0
DBG: CONN 0x80fb9cd00: uv_tcp_nodelay 0
DBG: CONN 0x80fb9cd00: uv_read_start 0
DBG: CONN 0x80fb9cd00: SSL_do_handshake -1
DBG: CONN 0x80fb9cd00: conn_thread_ssl_handle_error
DBG: CONN 0x80fb9cd00: conn_thread_ssl_handle_error wants read
DBG: CONN 0x80fb9cd00: conn_thread_ssl_flush_read_bio 517
DBG: CONN 0x80fb9cd00: conn_thread_uv_write_to_socket: 517
DBG: CONN 0x80fb9cd00: conn_thread_uv_write_to_socket result: 0
DBG: CONN 0x80fb9cd00: outing data for SSL finished ? 0 buffer size 0 flush? 1
INF: CONN 0x80fb9cd00: waiting for SSL handshake to complete for 17.188.166.201 in 0x80fb9cd00
0
1
2
WRN: CONN 0x80fb9cd00: -------------------- TIMEOUT -------------------------
WRN: CONN 0x80fb9cd00: ------------------ NOT CONNECTED -------------------------

The server is perfectly able to resolve and connect to Apple's APNs servers using eg. curl:

$ curl -v 'https://api.push.apple.com'
* Trying 17.188.166.91:443...
* TCP_NODELAY set
* Connected to api.push.apple.com (17.188.166.91) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /usr/local/share/certs/ca-root-nss.crt
CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Request CERT (13):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-CHACHA20-POLY1305
* ALPN, server accepted to use h2
* Server certificate:
* subject: CN=api.push.apple.com; OU=management:idms.group.533599; O=Apple Inc.; ST=California; C=US
* start date: Mar 14 17:50:10 2019 GMT
* expire date: Apr 12 17:50:10 2021 GMT
* subjectAltName: host "api.push.apple.com" matched cert's "api.push.apple.com"
* issuer: CN=Apple IST CA 2 - G1; OU=Certification Authority; O=Apple Inc.; C=US
* SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x800de5800)
> GET / HTTP/2
> Host: api.push.apple.com
> User-Agent: curl/7.66.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 1)!
< HTTP/2 405
< apns-id: 06B8C341-1F46-A061-6F43-DE02F31FBEB1
<
* Connection #0 to host api.push.apple.com left intact

It appears there may be some issue with Countly's HTTP/2 handling code.

 

I would like to add that prior to this, I've already had to fix a bug in Countly's HTTP/2 server code.

The issue this bug causes is that when submitting the key for Push Notifications, I get this error:

Couldn't update plugin: EAI_BADHINTS
And these logs:
2020-01-20T15:33:35.115Z: WARN [jobs:resource] Opening underlying resource res:validate:5e25c84edd2d113cacd20c13:r3mbqywove from façade
DBG: no certificate, using bearer eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6Ijk3UVBBUTZWOVIifQ.eyJpc3MiOiJITDNRNDVMWDlOIiwiaWF0IjoxNTc5NTM0NDE1fQ.u6d_6DZKvnZegYJVD5dPSobTsmGdb7d0vpovmt4-DCN4gPHe06pGL4nWr8pofHca8aeuNr0ivaClG478Bqkvzg
DBG: CONN 0x80fb7c800: going into conn_loop in 0x80fb7c800
DBG: CONN 0x80fb7c800: conn_loop returned DBG: proxy @ 0: 0
DBG: no certificate, using bearer bearer eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6Ijk3UVBBUTZWOVIifQ.eyJpc3MiOiJITDNRNDVMWDlOIiwiaWF0IjoxNTc5NTM0NDE1fQ.u6d_6DZKvnZegYJVD5dPSobTsmGdb7d0vpovmt4-DCN4gPHe06pGL4nWr8pofHca8aeuNr0ivaClG478Bqkvzg
INF: resolving
INF: resolving api.push.apple.com
WRN: dns resolve error, let's retry once EAI_BADHINTS
ERR: dns resolve error, no more retrying after EAI_BADHINTS
2020-01-20T15:33:35.168Z: WARN [jobs:resource] [15627]: Error while opening res:validate:5e25c84edd2d113cacd20c13:r3mbqywove by command of façade: "EAI_BADHINTS"
2020-01-20T15:33:35.169Z: WARN [jobs:resource] [façade]: Rejecting 5e25c84edd2d113cacd20c13
2020-01-20T15:33:35.169Z: WARN [jobs:manager] all pool resources done
2020-01-20T15:33:35.169Z: ERROR [jobs:job] [5e25c84edd2d113cacd20c13] Error in IPCFaçadeJob res:validate:5e25c84edd2d113cacd20c13:r3mbqywove: "EAI_BADHINTS" / undefined
2020-01-20T15:33:35.169Z: ERROR [jobs:resource] Error in job resource façade .run() promise EAI_BADHINTS undefined
EAI_BADHINTS
2020-01-20T15:33:35.171Z: ERROR [mgmt:apps] Error during plugin config updates for app 5ddd71ce7d4d3aa855c6316b: "EAI_BADHINTS" string, 12
DBG: closing api.push.apple.com
DBG: CONN 0x80fb7c800: stopping conn_loop
INF: CONN 0x80fb7c800: Disconnecting: 199
INF: wating for conn_loop to stop
The issue is resolved by this patch:
diff --git a/plugins/push/api/parts/apn/h2.cc b/plugins/push/api/parts/apn/h2.cc
index 0b7b6c06a..980fb9bc7 100644
--- a/plugins/push/api/parts/apn/h2.cc
+++ b/plugins/push/api/parts/apn/h2.cc
@@ -329,11 +329,7 @@ namespace apns {
obj->stats.state |= ST_RESOLVED | ST_ERROR_RECOVERABLE;
LOG_WARNING("dns resolve error, let's retry once " << uv_err_name(status));

- struct addrinfo hints;
- hints.ai_family = AF_INET;
- hints.ai_socktype = SOCK_STREAM;
- hints.ai_protocol = 0;
- hints.ai_flags = AI_ADDRCONFIG;
+ struct addrinfo hints = { .ai_family = AF_INET, .ai_socktype = SOCK_STREAM, .ai_flags = AI_ADDRCONFIG | AI_V4MAPPED };
uv_getaddrinfo(uv_default_loop(), handle, resolve_cb, obj->hostname.c_str(), "443", &hints);
}
} else {
@@ -375,12 +371,7 @@ namespace apns {
obj->handle_resolve = new uv_getaddrinfo_t;
obj->handle_resolve->data = persistentHandle;

- struct addrinfo hints;
- hints.ai_family = AF_UNSPEC;
- hints.ai_socktype = SOCK_STREAM;
- hints.ai_protocol = 0;
- hints.ai_flags = AI_ADDRCONFIG;
-
+ struct addrinfo hints = { .ai_family = AF_UNSPEC, .ai_socktype = SOCK_STREAM, .ai_flags = AI_ADDRCONFIG | AI_V4MAPPED };
uv_getaddrinfo(uv_default_loop(), obj->handle_resolve, resolve_cb, host.c_str(), port.c_str(), &hints);

info.GetReturnValue().Set(promise);

The cause is the fact that the `hints` object was not properly initialized and so was left with undefined values in its members.  `getaddrinfo` is defined to return `EAI_BADHINTS` when any of its fields are non-0 with exception to the family, sockettype, protocol and flags.

This bug is *very bad* and it makes me very suspicious of all other server code, on a buggy level as well as on a security level.  This type of bad code tends to allow arbitrary memory access and is very much inexcusable for production level code.

Consequently, I am suspicious that the rest of the server code is also malfunctioning and would appreciate any assistance with diagnosing and debugging to root out any other native code issues that may be tripping up here.

I would also strongly recommend that all native code is reviewed and audited by someone with an expert level knowledge in C / C++ & security.

0

Comments

2 comments
  • The issue is caused by the fact that when the Countly server is not set up to connect to the APNs through a HTTP proxy, it performs a `connect()` but does not follow that up with a `select()` before it starts writing encrypted SSL state to the socket.  As a result, the network writes SSL data before having finished the TCP handshake.

    I really do not see how this code even exists in this state, it is far from production ready, never mind quality.

    commit c4566717fa86d582614f4b7cab04646a42c8298d
    Author: Countly <countly@satura.lyndir.com>
    Date:   Wed Jan 22 13:30:47 2020 -0500
    
        connect() must select() when EINPROGRESS before writing.
    
    diff --git a/plugins/push/api/parts/apn/conn.cc b/plugins/push/api/parts/apn/conn.cc
    index 66a88046d..48642adf4 100644
    --- a/plugins/push/api/parts/apn/conn.cc
    +++ b/plugins/push/api/parts/apn/conn.cc
    @@ -513,8 +513,40 @@ namespace apns {
     
     		if (obj->proxyport == 0) {
     			val = connect(obj->fd, obj->address->ai_addr, obj->address->ai_addrlen);
    -			LOG_DEBUG("CONN " << uv_thread_self() << ": CONN " << uv_thread_self() << ": connect " << val);
    -			if (val != 0 && errno != EINPROGRESS) {
    +			LOG_DEBUG("CONN " << uv_thread_self() << ": CONN " << uv_thread_self() << ": connect " << val << ", errno: " << errno);
    +			if (errno == EINPROGRESS) {
    +				fd_set set;
    +				struct timeval timeout = { .tv_sec = 4 };
    +	
    +				FD_ZERO (&set);
    +				FD_SET (obj->fd, &set);
    +	
    +				/* select returns 0 if timeout, 1 if input available, -1 if error. */
    +				val = select(FD_SETSIZE, NULL, &set, NULL, &timeout);
    +				if (val == 0) {
    +					obj->send_error("Connect timeout");
    +					if (obj->tcp_init_sem) {
    +						uv_sem_post(obj->tcp_init_sem);
    +					}
    +					SSL_free(obj->ssl);
    +					obj->ssl = nullptr;
    +					close(obj->fd);
    +					obj->fd = 0;
    +					return;
    +				} else if (val < 0) {
    +					std::ostringstream out;
    +					out << "Cannot connect to server: " << strerror(val);
    +					obj->send_error(out.str());
    +					if (obj->tcp_init_sem) {
    +						uv_sem_post(obj->tcp_init_sem);
    +					}
    +					SSL_free(obj->ssl);
    +					obj->ssl = nullptr;
    +					close(obj->fd);
    +					obj->fd = 0;
    +					return;
    +				}
    +			} else if (val != 0) {
     				std::ostringstream out;
     				out << "connect() failed: " << strerror(val);
     				obj->send_error(out.str());
    @@ -610,8 +642,6 @@ namespace apns {
     
     		stats.reading = true;
     
    -		char buf[1024 * 10];
    -
     		if(!SSL_is_init_finished(ssl)) {
     			int r = SSL_connect(ssl);
     			if(r < 0) {
    @@ -621,6 +651,7 @@ namespace apns {
     		} else {
     			// connect, check if there is encrypted data, or we need to send app data
     			int s;
    +			char buf[1024 * 10];
     			do {
     				int ng = 0;
     
    0
    Comment actions Permalink
  • Please note that this patch is just a quick fix by copying code from the PROXY section.  I am not taking ownership over it.

    I stand by the fact that this native code needs to be thoroughly audited, reviewed and/or rewritten.

    0
    Comment actions Permalink

Please sign in to leave a comment.