IGNITE-28806 Fix dynamic cache group affinity init during first local join#13263
IGNITE-28806 Fix dynamic cache group affinity init during first local join#13263oleg-vlsk wants to merge 3 commits into
Conversation
| private boolean skipNotStartedDynamicGroupOnFirstLocalJoin( | ||
| GridDhtPartitionsExchangeFuture fut, | ||
| CacheGroupDescriptor desc, | ||
| @Nullable CacheGroupContext grp, | ||
| boolean newAff | ||
| ) { | ||
| if (grp != null) | ||
| return false; | ||
|
|
||
| if (newAff) | ||
| return false; | ||
|
|
||
| if (!firstLocalJoinExchange(fut)) | ||
| return false; | ||
|
|
||
| AffinityTopologyVersion grpStartTopVer = desc.startTopologyVersion(); | ||
|
|
||
| if (grpStartTopVer == null) | ||
| return false; | ||
|
|
||
| return grpStartTopVer.after(fut.initialVersion()); | ||
| } |
There was a problem hiding this comment.
Can we make smth like
/** */
private boolean skipNotStartedDynamicGroup(
GridDhtPartitionsExchangeFuture fut,
CacheGroupDescriptor desc,
@Nullable CacheGroupContext grp,
boolean newAff
) {
if (grp != null || newAff)
return false;
AffinityTopologyVersion grpStartTopVer = desc.startTopologyVersion();
return grpStartTopVer != null && grpStartTopVer.after(fut.initialVersion());
}To make this check based on the cache group start version instead of the first join?
There was a problem hiding this comment.
Yes, makes sense. If we remove the first-join-check, the fix will be broader. Corrections added.
| private static final String ERR_MSG = "Invalid exchange futures state"; | ||
|
|
||
| /** */ | ||
| private static final int SRV_NODES = 3; | ||
|
|
||
| /** */ | ||
| private static final int CLIENT_THREADS = 32; | ||
|
|
||
| /** */ | ||
| private static final int RESTART_CNT = 10; | ||
|
|
||
| /** */ | ||
| private static final int FIRST_CLIENT_PORT = 10800; | ||
|
|
||
| /** */ | ||
| private static final int CACHE_CFG_CNT = 30; | ||
|
|
||
| /** */ | ||
| private final AtomicBoolean stopClients = new AtomicBoolean(); | ||
|
|
||
| /** */ | ||
| private final AtomicInteger clientCreateSuccesses = new AtomicInteger(); | ||
|
|
||
| /** */ | ||
| private final AtomicReference<Throwable> caughtErr = new AtomicReference<>(); | ||
|
|
||
| /** */ | ||
| private final CountDownLatch caughtLatch = new CountDownLatch(1); | ||
|
|
||
| /** */ | ||
| private final Map<Integer, ClientCacheConfiguration> cacheConfs = new HashMap<>(); | ||
|
|
||
| /** */ | ||
| private Thread.UncaughtExceptionHandler oldUncaughtHnd; | ||
|
|
||
| /** */ | ||
| private IgniteInternalFuture<?> startFut; | ||
|
|
||
| /** */ | ||
| private IgniteInternalFuture<?> clientFut; |
There was a problem hiding this comment.
I think this test is too heavy for this fix. It looks more like a stress/probabilistic test than a deterministic regression test. Even if the bug appears again, this test can still pass if the race does not happen during the run
I suggest reverting this test in a separate commit, in case another reviewer asks to restore it later. Instead, we can add a smaller happy-path test to check that we do not skip extra cache groups by mistake
…rtedDynamicGroup and add DynamicCacheStartExchangeTest (as per review)
… join
Thank you for submitting the pull request to the Apache Ignite.
In order to streamline the review of the contribution
we ask you to ensure the following steps have been taken:
The Contribution Checklist
The description explains WHAT and WHY was made instead of HOW.
The following pattern must be used:
IGNITE-XXXX Change summarywhereXXXX- number of JIRA issue.(see the Maintainers list)
the
green visaattached to the JIRA ticket (see tabPR Checkat TC.Bot - Instance 1 or TC.Bot - Instance 2)Notes
If you need any help, please email dev@ignite.apache.org or ask anу advice on http://asf.slack.com #ignite channel.