I have my application using Spring Boot 3.5.0 + Spring Cloud Gateway + Java 24 working fine. Then I upgraded to Spring Boot 3.5.1 and it starts failing and I see 3.5.2 is released to fix the regression bug.
However, with Spring Boot 3.5.2 also, it is still failing.
Sample project: https://github.com/feature-tracker/api-gateway/tree/feature/sb-3.5.2
GitHub Action pipeline: https://github.com/feature-tracker/api-gateway/actions/runs/15769850227/job/44452706183
Comment From: philwebb
Sorry about the mess of the releases today @sivaprasadreddy. This is a duplicate of #46040 and I'm trying to get another unscheduled patch out.
Comment From: sivaprasadreddy
No worries, thanks a lot for all your hard work.
Comment From: chrisgleissner
Just wondering, will there be an investigation why this regression was not found before releasing 3.5.1?
I have been using Spring, and then Spring Boot, for over 20 years now. I cannot recall a similar quick succession of patch releases.
I appreciate sometimes bug slip through, but I am curious how it happened in this case and which lessons can be learned.
Comment From: philwebb
@chrisgleissner I've done quite a bit of introspection as to why this one slipped though the net. In the history of the project I think we've only ever made one unscheduled release, and never had to do two in quick succession. I think there were a few issues that came together causing a perfect storm, but probably the most critical was that didn't give enough time to allow the fixes to be validated.
For a bit more background, we generally would only make a change like the one that caused this issue in a milestone or RC. This gives us time to get feedback and removes any pressure because milestones aren't intended to be production ready. In this instance, we needed to make the change in a patch release because we found performance optimizations added in 3.5 actually caused slowdowns in some instances. Issue #44867 covers the optimizations we made in 3.5 RC1 and #45994 along with #45968 show the problems it caused.
In hindsight, at this point I should have restored the 3.4 code, however, from the analysis done for #44867 I knew had the opportunity to still improve performance so I made changes that went further than they perhaps should have. I was honestly pretty keen to keep at least some of the earlier work I'd done.
Anyway, part of the fix in 8f14dca164 includes some logic here that attempts to speed things up by using raw ConfigurationPropertyNames
and cache the result. The idea is to save repeated calls to the stream()
method which is hopefully quicker and produces less garbage. A second micro optimization was to reuse already calculated arrays from the parent and that's where the NPE crept in. I missed adding a test to catch that because you need to filter the ConfigurationPropertySource
filter method twice, from a SpringConfigurationPropertySource
and have first first call apply to at least one element.
Without that specific test to catch the issue, and compounded by the fact that only folks using 3.5.1-SNAPSHOT with a certain setup would catch it, means it just wasn't found in time.
Once it was discovered, it put us in quite a difficult situation. We know that this bug is likely to occur and folks are already reporting it. There's no workaround that we can recommend and it feels bad to expect folks to wait until 3.5.2, especially given the Spring Framework CVE we're picking up. At this point, it seems like making an unscheduled 3.5.x release is our best option.
I fix the issue in #46032 and add a test and I'm confident that I've fixed the issue.
Probably my biggest regret here is not pausing, waiting for a SNAPSHOT build and getting Spring Cloud to retest. We don't really have a good way to signal to folks that a release shouldn't be used. Once something is on Maven central it's consumed and not everyone reads the blog or looks at the release notes. This adds a lot of pressure to get the fix our quickly which is exactly what I tried to do.
Unfortunately, 3.5.2 uncovered another even worse bug in #46039. This one is really really nasty because it doesn't fail hard, it just results in unbound data. In turns out this bug was actually lurking since the original Binder
code was written. We just never had a code path to trigger it before. The reason 3.5.2 triggers it is because of this line in IndexedElementsBinder
. This was another small optimization designed to prevent a full scan of every name being needed when binding objects in a list. This update causes Binder.Context.withSource(...)
to be called multiple times and with a different source. Because the old source wasn't restored, filtering remained when it shouldn't resulting in null
bound elements.
We have a lot of tests for the Binder
, but we didn't have a test that bound a list of objects where the object itself contains a further list of items to be bound. None of our smoketests has such a setup either.
In terms of lessons, perhaps there are a few that we can take away:
1) Reverting something that only improves performance in a patch release should probably just go back to original code and take the hit 2) Trying to get Spring projects onto patch SNAPSHOTs early would help catch these issues 3) Taking the time to do a full round of retests is probably worthwhile when the original problem causes hard failures. Those issues are actually better than errors that don't cause an exception.
Hopefully it will be another few years before we need to do this again :)