The Eureka client does not unregister properly from the registry on application shutdown but stays registered with a DOWN status.

The Eureka client is taking the following actions on application shutdown: (1) it first detects a local status change and trigger an asynchronous replication to the registry (2) the spring shutdown sequence continues and the EurekaClient is ultimately closed which triggers a delete call on the registry to unregister the application.

However, it appears the delete call (2) reaches the registry before the asynchronous replication (1). Although the delete properly unregistered the application, the replication re-register it if a DOWN status.

Happens with Brixton.SR7 but Camden should suffer from the same problem too.

Comment From: brenuart

We could temporarily solve the issue as follows: - apply fixes related to https://github.com/spring-cloud/spring-cloud-netflix/issues/1534 - make sure EurekaDiscoveryClientConfiguration.stop() is not invoked by the spring shutdown sequence.

Side questions - I'm wondering: - why EurekaDiscoveryClientConfigurationneeds to react on ContextClosedEvent to stop the client in addition to its stop() lifecycle method ? - the EurekaClient is produced as a bean in the spring context like any other. Why can't we just rely on its stop/shutdown method and let Spring close it when needed?

Comment From: spencergibb

I've seen this before, it's hit or miss because of the timing. Are you concerned that the ContextClosedEvent causes the out of orderness?

Comment From: brenuart

Note: explanation below is based on Brixton.SR7 - haven't check Camden yet.

ContextClosedEvent doesn't seem to be the root cause but EurekaDiscoveryClientConfiguration.stop() is a good candidate. The only purpose of that method is to change the application status to DOWN before setting the stopped flag to TRUE. This status change triggers the async replication that later conflicts with the unregistration when the client's shutdown method is actually invoked.

However, one should note that com.netflix.discovery.DiscoveryClient also sets the status to DOWN on shutdown but unregister itself first from the ApplicationInfoManager: this way the new status is not replicated before the client unregisters from the registry.

About the ContextClosedEvent, I wonder why its is needed at all. We could simply rely on Spring calling DiscoveryClient.shutdown() on the client instance when the context is closed.

This would also solve another issue: bean dependencies are currently such that the DiscoveryClient may be shutdown before beans having reference to it. This is so because shutdown is handled by the configuration class and not by the bean itself.

Let me submit a PR to illustrate the proposed changes.

Comment From: brenuart

Update: many things have changed in Camden compared to Brixton.SR7. The explanations and proposed fixes above are not relevant anymore (and I'm not able to provide any PR against current code base)

For those who are interested, here is what we did to solve the issue under Brixton.SR7 As explained above, the Eureka client sets the app status to DOWN before it unregisters itself from the registry. Our solution consists in an AspectJ Aspect to close the eureka client ourselves first before letting the call proceed to the original implementation.

An exemple of this aspect can be found at https://gist.github.com/brenuart/bd90075f94226cd55fd36d3f6058a353

Comment From: brenuart

The problem is still there in Dalston.SR1. Although the code has been refactored since this issue is created, the same root causes are still present: the replication of the DOWN status hits the server after the unregistration.

There is not much that can be done to have 100% guarantee: the Eureka client itself does not take guarantee there is no in-flight requests when unregister is invoked.

To (temporarily) solve that issue, we used to register a custom Jersey ClientFilter in the Eureka client to intercept outgoing requests. When it is time to shutdown the client, our filter blocks all requests except the unregister. If needed I can share the workaround.

Comment From: ryanjbaxter

If this is a timing issue we can't really predict I am not sure what we can do

Comment From: IsNull

@brenuart Could this issue be the reason for significantly longer down times when replacing an instance? I.e. when deploying a new version of a service, the old instances are shut down by the ifrastructure as soon as the new service is alive on the web port. Even though there would always be a service available, due to Eurekas distributed nature the clients take quite a while before routing traffic to the new instance.

Would a cleaner handling, i.e. removing of the old instance help in this scenario? If so I would be interested in your ClientFilter Workaround if you mind sharing.

Comment From: brenuart

Problem may be gone since https://github.com/Netflix/eureka/pull/876. I haven't test it yet tho..

Comment From: abracadv8

No, I believe it is still present in https://github.com/Netflix/eureka/issues/973

The PR to fix this has gone unacknowledged since September - https://github.com/Netflix/eureka/pull/974

Comment From: spencergibb

@abracadv8 the PR is waiting for the author, not netflix BTW.

Comment From: abracadv8

Correct, thanks for the clarification. I submitted a PR based on the original user's fork - https://github.com/Netflix/eureka/pull/1023

Comment From: cdprete

How was this solved? I've somehow the same issue in the latest version. See https://github.com/spring-cloud/spring-cloud-netflix/issues/4488