Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preserving allocation domain transform on aliased output #3970

Merged
merged 37 commits into from
Mar 5, 2025
Merged

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Feb 26, 2025

Fixes #3957

The issue was that when we specify output alias, we need to update the (allocation domain and contiguity) of the alias to be identical to the source.

@jjsjann123 jjsjann123 changed the title Fix 3957 preserving allocation domain transform on aliased output Feb 26, 2025
@jjsjann123
Copy link
Collaborator Author

!test

Copy link

github-actions bot commented Feb 26, 2025

Review updated until commit 0fe131d

Description

  • Added selfAllocationReplay to preserve allocation domain and contiguity.

  • Ensured memory layout consistency for aliased outputs.

  • Added a Python test for non-contiguous input handling.


Changes walkthrough 📝

Relevant files
Enhancement
fusion.cpp
Preserve allocation domain in aliasing                                     

csrc/fusion.cpp

  • Included transform_replay.h.
  • Added selfAllocationReplay call in aliasOutputToInput.
  • +11/-0   
    remove_bcast_squeeze.cpp
    Preserve allocation domain in replacement                               

    csrc/preseg_passes/remove_bcast_squeeze.cpp

  • Included transform_replay.h.
  • Added selfAllocationReplay call in maybeDoReplacement.
  • +9/-0     
    transform_replay.cpp
    Implement and enhance self allocation replay                         

    csrc/transform_replay.cpp

  • Added selfAllocationReplay function.
  • Enhanced error messages in fullSelfReplay.
  • +104/-1 
    transform_replay.h
    Declare self allocation replay function                                   

    csrc/transform_replay.h

    • Declared selfAllocationReplay function.
    +6/-0     
    Tests
    test_python_frontend.py
    Add test for non-contiguous input handling                             

    tests/python/test_python_frontend.py

    • Added a test case for inplace update on non-contiguous inputs.
    +41/-0   

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    Ensure that the TransformReplay::selfAllocationReplay function correctly handles all edge cases, especially when dealing with symbolic IDs and reduction IDs.

    // We need to replay the allocation domain and contiguity of input on output,
    // this is because these two share the data buffer and should interpret it
    // identically. Technically these two don't have to be identical, but rather
    // compatible.
    NVF_CHECK(
        output->isA<TensorView>() && input->isA<TensorView>(),
        "aliasing output to input is only supported for TensorView");
    TransformReplay::selfAllocationReplay(
        input->as<TensorView>()->domain(), output->as<TensorView>()->domain());
    
    Code Complexity

    The selfAllocationReplay function is quite complex. Consider breaking it down into smaller, more manageable functions to improve readability and maintainability.

      const TensorDomain* self,
      TensorDomain* new_self) {
    FUSER_PERF_SCOPE("TransformReplay::selfAllocationReplay");
    
    // NOTE: We could also have reduction IDs involved in transformation that
    // leads to allocation domain, so technically we should have included
    // reduction IDs in the replay as well. The reason that we skipped them here
    // is because this function is used by `RemoveBcastSqueeze`, where we could
    // have mismatch reduction IDs on the logical between `self` and
    // `new_self`.
    auto new_self_logical = TensorDomain::noReductions(new_self->logical());
    auto self_logical = TensorDomain::noReductions(self->logical());
    
    NVF_ERROR(
        new_self_logical.size() == self_logical.size(),
        "Invalid number of IterDomains provided: ",
        new_self_logical.size(),
        " vs ",
        self_logical.size());
    
    // Map for replay
    id_map axis_map;
    {
      int64_t i = 0;
      for (IterDomain* id : self_logical) {
        // Note: we don't want to check for equal `isRFactorProduct`, since we
        // could replay Allocation of the output of a reduction to a later
        // consumer tensor, which would not have the rfactor flag on.
        IterDomain* new_id = new_self_logical[i];
        // Note: this function can be used prior to concretization, where we might
        // have unresolved symbolic ID, where the broadcast flag might mismatch.
        // We skip the check if either id or new_id is symbolic and expect a
        // correct user program.
        NVF_ERROR(
            new_id->isSymbolic() || id->isSymbolic() ||
                new_id->isBroadcast() == id->isBroadcast(),
            "Axes ",
            id,
            " and ",
            new_id,
            " do not match for self replay.");
        axis_map[id] = new_id;
        i++;
      }
    }
    
    // Replay producer dimensions.
    const std::vector<IterDomain*>& self_allocation = self->maybeAllocation();
    const std::vector<std::optional<bool>>& self_contiguity = self->contiguity();
    const std::vector<IterDomain*>& self_allocation_no_reduction =
        TensorDomain::noReductions(self_allocation);
    
    // we replay only non-reduction IDs. The reason is that, we might have
    // non-mapping reduction IDs between self and new_self. This is used in
    // `RemoveBcastSqueeze`.
    ReplaySelf replay(self_allocation_no_reduction, axis_map);
    std::vector<IterDomain*> new_alloc_domain;
    std::vector<std::optional<bool>> new_contiguity;
    new_alloc_domain.reserve(self_allocation.size());
    new_contiguity.reserve(self_allocation.size());
    
    {
      // Push back the reduction IDs that are not mapped
      for (auto id : new_self->logical()) {
        if (id->isReduction()) {
          new_alloc_domain.push_back(id);
          // NOLINTNEXTLINE (modernize-use-emplace)
          new_contiguity.push_back(std::nullopt);
        }
      }
    
      // Pushing the mapped IDs and corresponding contiguity flags
      for (size_t i : c10::irange(self_allocation.size())) {
        IterDomain* id = self_allocation[i];
        if (id->isReduction()) {
          continue;
        }
        auto it = replay.getReplay().find(id);
        NVF_ERROR(
            it != replay.getReplay().end(), "failed to replay IterDomain: ", id);
        // The possibility of a mismatch is when one of the IDs are symbolic. We
        // need to ensure that new_contiguity is consistent with new_alloc_domain,
        // otherwise the later setAllocationDomain would fail checks.
        if (it->second->isBroadcast() == self_contiguity[i].has_value()) {
          // whether we resolve to true or false shouldn't matter since it's going
          // to be concretized as a broadcast dimension
          new_contiguity.push_back(
              it->second->isBroadcast() ? std::nullopt
                                        : std::make_optional(true));
        } else {
          new_contiguity.push_back(self_contiguity[i]);
        }
        new_alloc_domain.push_back(it->second);
      }
    }
    
    return new_self->setAllocationDomain(new_alloc_domain, new_contiguity);
    Test Coverage

    Ensure that the new test case covers all relevant scenarios and edge cases, especially those involving non-contiguous inputs and aliased outputs.

    # See https://github.com/NVIDIA/Fuser/issues/3957
    # This test checks that our alias update keeps the output layout consistency
    def test_inplace_update_on_non_contiguous_inputs(self):
        inputs = [
            torch.randn(5, dtype=torch.float32, device="cuda:0").as_strided(
                (2, 2), (1, 3)
            ),
        ]
    
        def fusion_func(fd: FusionDefinition) -> None:
            T0 = fd.define_tensor(
                shape=[2, 2],
                contiguity=[False, True],
                dtype=DataType.Float,
                is_cpu=False,
                stride_order=[0, 1],
            )
            S1 = fd.define_scalar(0.00000, dtype=DataType.Double)
            T2 = fd.ops.gt(T0, S1)
            S3 = fd.define_scalar(0.00000, dtype=DataType.Double)
            T4 = fd.ops.where(T2, T0, S3)
            T5 = fd.ops.cast(T4, dtype=DataType.Float)
            T6 = fd.ops.set(T5)
            fd.add_output(T6, T0)
            fd.add_output(T6)
    
        ref_inp = inputs[0].clone()
    
        # is_clonable is not supported yet, because the print out would explicitly mark output
        # with stride_order and overwrite its contiguity flag. This would violates the memory
        # layout required by the alias.
        nvf_out, _ = self.exec_nvfuser(
            fusion_func,
            inputs,
            is_clonable=False,
        )
    
        assert len(nvf_out) == 1
        self.assertEqual(nvf_out[0], inputs[0])
        self.assertEqual(nvf_out[0], ref_inp.relu())

    csrc/fusion.cpp Outdated
    NVF_CHECK(output->isA<TensorView>() && input->isA<TensorView>(),
    "aliasing output to input is only supported for TensorView");
    auto new_domain = TransformReplay::selfAllocationReplay(output->as<TensorView>()->domain(), input->as<TensorView>()->domain());
    output->as<TensorView>()->setAllocationDomain(new_domain->allocation(), new_domain->contiguity());
    Copy link
    Collaborator Author

    @jjsjann123 jjsjann123 Feb 26, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    when we are aliasing output to input, the two needs to have identical allocation domain and contiguity. (technically we only needed compatible allocation domain, but it's easier to have a fixed target).

    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123
    Copy link
    Collaborator Author

    keeping my 🤞

    @jjsjann123
    Copy link
    Collaborator Author

    !test --diff

    @jjsjann123
    Copy link
    Collaborator Author

    !test --diff

    @jjsjann123
    Copy link
    Collaborator Author

    !test --diff

    @jjsjann123
    Copy link
    Collaborator Author

    !test --diff

    @@ -233,6 +233,97 @@ TensorDomain* TransformReplay::fullSelfReplay(
    new_self_root->contiguity());
    }

    TensorDomain* TransformReplay::selfAllocationReplay(
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I tagged @wujingyue because I think you asked for something similar to this?

    Though I'm not totally sure if this does what you want, because I'm addressing some requirement from remove_bcast_squeeze, which I think we can also refactor this one and put some logic on the user side, if we needed a more generic replay later.

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM overall -- there are some details that I want to make sure I understand to have a good mental model

    @@ -233,6 +233,97 @@ TensorDomain* TransformReplay::fullSelfReplay(
    new_self_root->contiguity());
    }

    TensorDomain* TransformReplay::selfAllocationReplay(
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Note: this can probably be consolidated with the existing

    TensorDomain* fullReplay(
    and #3950

    @@ -389,6 +390,16 @@ TensorView* maybeDoReplacement(TensorView* orig) {
    }
    }

    // If we are replacing an output, we need to preserve the memory layout by
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can we do this check in Fusion::replaceOutput?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    That check needs to be opt-in. We have places where we are explicitly altering memory layout. i.e. mark alias prepare and sharding propagation that would intentionally change the memory layout.

    But it would be nice to have a check on that.

    So the check would actually be checking for compatible transforms. i.e. the two transforms might not necessarily be the same, but they should be equivalent... This feels like something @wujingyue and @zasdfgbnm were discussing together.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Opened this to track #3994

    @jjsjann123 jjsjann123 requested a review from wujingyue February 28, 2025 20:21
    @jjsjann123
    Copy link
    Collaborator Author

    !test --diff

    @jjsjann123
    Copy link
    Collaborator Author

    I'm surprised seeing a segfault with the python_bc_advisory thing. I didn't change any python API. No idea where that segfault is coming from.

    @jjsjann123
    Copy link
    Collaborator Author

    !test --diff

    NVF_ERROR(
    it != replay.getReplay().end(),
    "Error during replay, didn't replay an axis.");
    if (it->second->isBroadcast() == self_contiguity[i].has_value()) {
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This feels like a bug or a misuse. Is there a good use case for is-broadcast to be inconsistent between self and new_self?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This is the same reason with the comment above about symbolic IDs. If we are dealing with symbolic ID, we just need to make sure that the contiguity flag on the given ID is consistent with itself.

    Later during concretization, it will fix the consistency globally.

    @jjsjann123
    Copy link
    Collaborator Author

    !test --diff

    @jjsjann123
    Copy link
    Collaborator Author

    Failure doesn't seem to be relevant. We also have all test cleared in 4e6d81e.

    I'm merging this PR as-is.

    @jjsjann123 jjsjann123 merged commit 37bc4fa into main Mar 5, 2025
    55 of 59 checks passed
    @jjsjann123 jjsjann123 deleted the fix_3957 branch March 5, 2025 02:21
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    Stride mismatch error with correct contiguity info
    3 participants