Kafka Producer NetworkException and Timeout Exceptions

2021-04-12 19:26

阅读:516

We have faced similar problem. Many NetworkExceptions in the logs and from time to time TimeoutException.

Cause

Once we gathered TCP logs from production it turned out that some of the TCP connections to Kafka brokers (we have 3 broker nodes) were dropped without notifying clients after like 5 minutes of being idle (no FIN flags on TCP layer). When client was trying to re-use this connection after that time, then RST flag was returned. We could easily match those connections resets in TCP logs with NetworkExceptions in application logs.

As for TimeoutException, we could not do the same matching as by the time we found the cause, this type of error was not occurring anymore. However we confirmed in a separate test that dropping TCP connection could also result in TimeoutException. I guess this is because of the fact that Java Kafka Client is using Java NIO Socket Channel under the hood. All the messages are being buffered and then dispatched once connection is ready. If connection will not be ready within timeout (30 seconds), then messages will expire resulting in TimeoutException.

Solution

For us the fix was to reduce connections.max.idle.ms on our clients to 4 minutes. Once we applied it, NetworkExceptions were gone from our logs.

We are still investigating what is dropping the connections.

Edit

The cause of the problem was AWS NAT Gateway which was dropping outgoing connections after 350 seconds.

https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-troubleshooting.html#nat-gateway-troubleshooting-timeout


评论


亲,登录后才可以留言!