SQLNET and TCP Keepalive Settings

In this article I would like talk about TNS connection issues. One of the application was facing connection issues to database. Application was configured on windows machine and database server was hosted on Linux machine. Application team was continuously complaining about connection drop issues.

Once application connection gets established, after sometime it was getting connection drop issue and it has impacted their transactions.

Part of initial investigation, below found in alert log and listener log,

Fatal NI connect error 12170.

VERSION INFORMATION:

TNS for Linux: Version 11.2.0.4.0 – Production

Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 – Production

TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 – Production

Tracing not turned on.

Tns error struct:

ns main err code: 12535

TNS-12535: TNS:operation timed out

ns secondary err code: 12560

nt main err code: 505

TNS-00505: Operation timed out

nt secondary err code: 110

nt OS err code: 0

Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=XX.XX.XX.XX)(PORT=XXXX))

Key observation from alert log is, once the connection gets established from some host, it was getting dropped reporting above errors. It was failing with same IP Address, username and port.

Post seeing above errors in alert log, started analyzing different ORACLE TNS related parameters and OS/Firewall settings.

Let’s understand more around these parameters.

Listener.ora: INBOUND_CONNECT_TIMEOUT_<listener_name>=240

sqlnet.ora: SQLNET.INBOUND_CONNECT_TIMEOUT=600

SQLNET.EXPIRE_TIME= 5

TCP Keepalive at Operating system level

SQLNET.INBOUND_CONNECT_TIMEOUT

From Oracle DB Server 10.2.0.1 onwards the default value of parameter SQLNET.INBOUND_CONNECT_TIMEOUT is 60 seconds, hence if the client is not able authenticate within 60 secs, the warning would appear in the alert log and the client connection will be terminated.

To understand what is causing this issue, following checks can be done

The default value of 60 seconds is good enough in most conditions for the database server to authenticate a client connection. If it is taking longer, then it’s worth checking the following items before implementing the workaround:

1. Check whether local connection on the database server is successful & quick.

2. If local connections are quick ,then check for underlying network delay with the help of your network administrator.

3. Check whether your Database performance has degraded in anyway.

4. Check alert log for any critical errors for eg, ORA-600 or ORA-7445 and get them resolved first.

These critical errors might have triggered the slowness of the database server.

It is often necessary to increase the values for INBOUND CONNECT TIMEOUT at both the listener and the database in order to resolve this issue. It is usually advisable to set the database (sqlnet.ora) value slightly higher than the listener (listener.ora). The authentication process is more demanding for the database than the listener.

How to check whether inbound timeout is active for the listener and database server:

For example, INBOUND_CONNECT_TIMEOUT_<listener_name> =110

You can check whether the parameter is active or not by simply doing telnet to the listener port.

$ telnet <database server IP> <listener port>

for eg.

$ telnet XXX.XX.XX.XX XXXX

The telnet session should disconnect after 110 seconds which indicates that the inbound connection timeout for the listener is active.

SQLNET.EXPIRE_TIME=n – Dead Connection Detection

Where n is some number of minutes. Preferably below 10.

e.g.

SQLNET.EXPIRE_TIME=5

When this parameter is enabled, a small 10 byte probe packet will be sent from node to node at specific intervals. In our example, at roughly 5 minute intervals after the timer is armed, the probe packet will be sent until the timer is disarmed. This should prevent an idle time termination by the firewall.

Set this parameter in the sqlnet.ora file.

When the timer expires, SQL*Net on the server sends a “probe” packet to the client. (In the case of a database link, the destination of the link constitutes the server side of the connection.) The probe is essentially an empty SQL*Net packet and does not represent any form of SQL*Net level data, but it creates data traffic on the underlying protocol.

If the client end of the connection is still active, the probe is discarded, and the timer mechanism is reset. If the client has terminated abnormally, the server will receive an error from the send call issued for the probe, and SQL*Net on the server will signal the operating system to release the connection’s resources.

Sometimes, connections are being dropped by the firewall due to inactivity and we look for solution without removing firewall. I’ve faced sometimes even setting up tcp keepalive parameters does not come in effect, so that could happen because of some internal bug.

There are two benefits with DCD

1. If the SQLNET.EXPIRE_TIME is less than the FW connection idle time-out, then the firewall will consider this packet as activity, and the idle time-out (firewall blackout) will never happen until both the client and the server processes are alive.

2. If the SQLNET.EXPIRE_TIME (let’s say a little bit higher) than the FW idle limit, then , as soon as the blackout happens , the RDBMS will know and will close the connection.

TCP KeepAlive

There are below 3 parameters associated with OS TCP Keepalive,

net.ipv4.tcp_keepalive_time = 300

net.ipv4.tcp_keepalive_probes = 3

net.ipv4.tcp_keepalive_intvl = 20

You can enable TCP KeepAlive option at the Operating System(OS) level. Once TCP keepalive option is enabled and configured, a small probe packet will be sent to the other end at every predefined in-activity interval and it expects an ACK from the other end. And ACK will be returned only when the other end is alive and is reachable. If ACK is not returned, then after some retry, the OS will close the end points and will release the resources allocated for that. The application which is listening on that particular socket will recieve the error, so that application can take necessary action upon receiving the error signal from the OS.

When a communication is blacked out by the firewall, the probe will not reach its other end, and then the OS will close the socket end points and the application will be notified of the exception.

Steps to configure TCP KeepAlive depends on a specific Operating Systems. You will have to refer the appropriate OS documentation for it.

TCP KeepAlive is applicable for all network applications running on that particular Operating System.

For 12c,

There are 3 parameters associated with an operating system’s TCP keepalive (These are Linux parameters, but other operating system have similarly named parameters)

TCP_KEEPIDLE (the amount of time until the first keepalive packet is sent)

TCP_KEEPCNT (the number of probes to send)

TCP_KEEPINTVL (the interval between keepalive packets)

The sqlnet.ora parameter SQLNET.EXPIRE_TIME now sets the socket option TCP_KEEPALIVE (the same as the kernel parameter TCP_KEEPIDLE).

If sqlnet.expire_time=1 then TCP_KEEPALIVE will be set to 60 seconds. The other parameters KEEPINTVL and KEEPCNT get set to 6 and 10 respectively (which is very

reasonable). This means that, at minimum, DCD can only be set to 2 minutes.

The cause of the problem may be that the connection is actually idle or it is running a very long piece of SQL commands. During research, it was found that sometimes, servers do not send syn/ack packet even after client sent too many syn packets. Some firewalls do not work properly with TCP Window Scaling and cause the network activity to be lost.

Solution

Set below parameters, you have to come up with right number based on your application behavior and requirement.

listener.ora: INBOUND_CONNECT_TIMEOUT_<listener_name>=240

sqlnet.ora: SQLNET.INBOUND_CONNECT_TIMEOUT=600

SQLNET.EXPIRE_TIME= 5

In addition to above, also set TCS Keep alive parameter at OS level to appropriate value both on application and database server. For example, in this scenario setting it to 4hrs worked effectively.

Above solution worked successfully and application team is no longer having connection issues.

引用：

https://prashantatridba.wordpress.com/tag/sqlnet-expire_time/