Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-2062] Ensure the Orchestrator always has DagManager remove FlowSpec - even when orchestration has failed #3944

Merged

Conversation

phet
Copy link
Contributor

@phet phet commented May 6, 2024

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

the Orchestrator + DagManager MUST remove adhoc flows from the FlowCatalog that fail compilation or violate concurrent execs. otherwise gaas will continue to return '409 Conflict' to each subsequent attempt to subsequently create an adhoc flow with the same flowGroup+flowName. this is despite the fact that the flow (still remaining in the FlowCatalog, when it shouldn't be) already has the status of FAILED, which is a "final status".

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

added test cases (herein)

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@@ -45,6 +45,7 @@ dependencies {
}

compile externalDependency.gson
compile externalDependency.lombok
Copy link
Contributor Author

@phet phet May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclear how this was missed... but not being there resulted in MANY compilation warnings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, how did you find that this was missing

@@ -57,6 +58,8 @@ public static class NoLongerLeasingStatus extends LeaseAttemptStatus {}
current LeaseObtainedStatus via the completeLease method from a caller without access to the {@link MultiActiveLeaseArbiter}.
*/
@Data
// avoid - warning: Generating equals/hashCode implementation but without a call to superclass, even though this class does not extend java.lang.Object
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another preventable source of compilation warnings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is to prevent using any future fields added to LeaseAttemptStatus, then why not add callSuper=false to NoLongerLeasingStatus?

Copy link
Contributor Author

@phet phet May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warnings at least don't indicate that it's necessary over there and I believe that's because warning arose since LeaseObtainedStatus and LeasedToAnotherStatus are @Data, whereas NoLongerLeasingStatus is not.

since not @Data, lombok is not synthesizing .equals for NoLongerLeasingStatus

Comment on lines +261 to +264
} finally {
// remove from the flow catalog, regardless of whether the flow was successfully validated and permitted to exec (concurrently)
this.dagManager.removeFlowSpecIfAdhoc(flowSpec);
}
Copy link
Contributor Author

@phet phet May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too bad the diff above doesn't clearly indicate it was solely an indentation change to add the try ... finally here. the purpose of which is to ensure FlowCatalog cleanup, come whatever may

Comment on lines +226 to +235
// WARNING: this `spiedDagManager` WILL NOT BE the one used by the `Orchestrator`: its DM has apparently already been
// provided to the `Orchestrator` ctor, prior to this replacement here of `GobblinServiceManager.dagManager`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documenting for posterity, since unfortunately, this was quite troublesome to debug

private static final String TEST_USER = "testUser";
private static final String TEST_PASSWORD = "testPassword";
private static final String TEST_TABLE = "quotas";

@BeforeClass
@BeforeMethod
Copy link
Contributor Author

@phet phet May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately lots of "testing smells" here.

first off, class-level test init doesn't play well w/ verifying mock interactions, since we want a fresh count for the exec of each test method. relatedly it also complicates - if not outright foreclosing on - parallel test method execution.

secondly, the @Test(dependsOnMethod = ...) structuring is a testing anti-pattern that here obscured that setup-style init had been formulated instead as test methods; see - createTopologySpec(). a major clue of something wrong might have been that that particular "@test" method does not even exercise Orchestrator, our class-under-test!

for now, I've fixed this suite to use per-method setup/teardown best practices, but left it as a TODO to figure out where createTopologySpec actually belongs. createFlowSpec and deleteFlowSpec also deserve attention in this same regard.

Copy link
Contributor

@Will-Lo Will-Lo May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's anything wrong with @BeforeClass explicitly, it has its uses when doing expensive setup e.g. DB or Kafka setup and this is done in some other classes. I would argue that you can have both BeforeClass and BeforeMethod depending on your needs, though for this class specifically BeforeMethod makes more sense for mock setup.

Copy link
Contributor Author

@phet phet May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and now realize, after you pointed it out, that I didn't contextualize overall test init best practice, so much as denounce what's broken about using solely @BeforeClass init.

to encapsulate a fuller picture now, I'd advise

Prefer @{Before,After}Method test init to @{Before,After}Class. when mocks are in use, the latter w/o the former MOST LIKELY indicates the wrong choice, whereas having only method, but not class-level init is quite fine. In fact the most common acceptable rationale for class-level init is to amortize payment of initialization tax just once per class, rather than incurring once per test case.

truly, at this moment, I can't think of any non-performance-motivated argument for using @BeforeClass over just adding everything into @BeforeMethod. can you?

Copy link
Contributor

@Will-Lo Will-Lo May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah its better to avoid state between tests when possible so beforeMethod works much better here

Copy link
Contributor Author

@phet phet May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting pt. of comparison: after the PR check test failures gave:

java.io.IOException: Failure creation table TestDagStateStore

due to:

java.sql.SQLNonTransientConnectionException: Too many connections

I realized the ITestMetastoreDatabase needed to be closed.

When I first put .close() within @AfterMethod, the 7 OrchestratorTest cases took 1m49s.

I then defined an @AfterClass for .close() and those same 7 tests (all passing with either definition) took only 33s. so definitely sticking w/ @BeforeClass + @AfterClass for that one!

this.topologyCatalog.addListener(orchestrator);
this.flowCatalog.addListener(orchestrator);
FlowCompilationValidationHelper flowCompilationValidationHelper = new FlowCompilationValidationHelper(config, sharedFlowMetricsSingleton, mock(UserQuotaManager.class), mockFlowStatusGenerator);
this.dagMgrNotFlowLaunchHandlerBasedOrchestrator = new Orchestrator(ConfigUtils.propertiesToConfig(orchestratorProperties),
Copy link
Contributor Author

@phet phet May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to avoid confusion, as I now pass Optional.absent() for the FlowLaunchHandler. nothing about the previously-existing tests below strictly validated that code path, so I made this entire OrchestratorTest class specific to the legacy DagManager version of Orchestrator.

when I say:

nothing about the previously-existing tests below...

I found only one test, doNotRegisterMetricsAdhocFlows(), that actually belongs in this class. while it certainly is helpful, it's also quite tangential to the core operation of Orchestrator::orchestrate.

the three new tests I added mostly cover the DagManager-based orchestrate. as for the lack of validation for the other code path, that suggests a missing test: e.g. to verify that orchestrate invokes FlowLaunchHandler::handleFlowLaunchTriggerEvent. at that time we should STILL ALSO verify the equivalent of doNotRegisterMetricsAdhocFlows()

cc: @umustafi and @arjun4084346

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make a note to add the other code path for FlowLaunchHandler baed orchestrate. If those reside in this test class, then following the methodology discussed about we should have a separate setupClass method used for that test.

@codecov-commenter
Copy link

codecov-commenter commented May 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 38.72%. Comparing base (a6f648b) to head (3b38d07).
Report is 8 commits behind head on master.

Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3944       +/-   ##
=============================================
- Coverage     50.18%   38.72%   -11.47%     
+ Complexity     5908     1595     -4313     
=============================================
  Files          1075      388      -687     
  Lines         41226    15952    -25274     
  Branches       4625     1577     -3048     
=============================================
- Hits          20691     6177    -14514     
+ Misses        18732     9282     -9450     
+ Partials       1803      493     -1310     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

public synchronized void addDagAndRemoveAdhocFlowSpec(FlowSpec flowSpec, Dag<JobExecutionPlan> dag, boolean persist, boolean setStatus)
throws IOException {
addDag(dag, persist, setStatus);
public synchronized void removeFlowSpecIfAdhoc(FlowSpec flowSpec) throws IOException {
// Only the active dagManager should delete the flowSpec
if (isActive) {
deleteSpecFromCatalogIfAdhoc(flowSpec);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be in GSJS because all hosts in multi-active scheduler state needed access to the flowSpec in catalog see this PR desc: #3846. We were running into no spec found in the DagActionStoreChangeMonitor error if an inactive host deleted the flowSpec before active host processed CDC stream message. For MA execution we want the spec in catalog until all hosts have received the CDC stream message and retrieved spec -> need to think about how to achieve this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we are not removing it from flow catalog there, just removing it from scheduledFlowSpecs

Copy link
Contributor

@umustafi umustafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch of this bug and thanks for fixing these tests! Looks like a pain. These changes work for the single active DM for now i think we should merge it in and then discuss if it makes sense for multi-active dagProcEngine

@phet phet force-pushed the multi-leader-adhoc-cleanup-from-catalog branch from 3b38d07 to 5593a86 Compare June 12, 2024 01:17
@arjun4084346 arjun4084346 merged commit 8433913 into apache:master Jun 12, 2024
6 checks passed
Blazer-007 pushed a commit to Blazer-007/gobblin that referenced this pull request Jun 17, 2024
…ve `FlowSpec` - even when orchestration has failed (apache#3944)

* Ensure the `Orchestrator` always has `DagManager` remove `FlowSpec` - even when orchestration has failed
* Address several compilation warnings related to lombok
* Ensure `OrchestratorTest` cleans up its `ITestMetastoreDatabase`
* Avoid NPE in `GobblinServiceManagerTest::cleanUp`
* Remove out-dated and confusing comment in `GobblinServiceJobScheduler`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants