FEAT-#7523: Improve formal definition of the automatic switching algorithm (#7524)

sfc-gh-jkew · sfc-gh-mvashishtha · web-flow · commit a68dab4ad71e · 2025-04-24T13:23:51.000-07:00
We add the move_to_me_cost function as something to be consulted during automatic switching. This allows for the /other/ query compiler to have more of a say in a potential data migration. This also helps to formalize the questions being asked of each participating query compiler, specifically the move_to_cost can be precisely defined as just the transmission and serialization cost of data movement. We also allow ourselves to disregard transmission cost, or the move_to_cost when the current engine is simply unable to execute the current workload. We also modify the Backend environment variable to allow for setting and getting the choices in order to constrain the set of engines considered during automatic switching. In a future commit we will implement a default function similar to what is configured in the tests. A separate future commit will add a public method to set the active backends.  ## What do these changes do?  - [x] first commit message and PR title follow format outlined [here](https://modin.readthedocs.io/en/latest/development/contributing.html#commit-message-formatting) > **_NOTE:_** If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title. - [x] passes `flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py` - [x] passes `black --check modin/ asv_bench/benchmarks scripts/doc_checker.py` - [x] signed commit with `git commit -s`  - [x] Resolves #7523 - [x] tests added and passing - [x] module layout described at `docs/development/architecture.rst` is up-to-date  --------- Co-authored-by: Mahesh Vashishtha <mahesh.vashishtha@snowflake.com>
diff --git a/docs/development/architecture.rst b/docs/development/architecture.rst
@@ -89,12 +89,40 @@ Dataframe.
 In the interest of reducing the pandas API, the Query Compiler layer closely follows the
 pandas API, but cuts out a large majority of the repetition.
 
+Automatic Engine Switching and Casting
+""""""""""""""""""""""""""""""""""""""
+
 QueryCompilers which are derived from QueryCompilerCaster can participate in automatic casting when
 different query compilers, representing different underlying engines, are used together in a
 function. A relative "cost" of casting is used to determine which query compiler everything should
 be moved to. Each query compiler must implement the functions, `move_to_cost`, `move_to_me_cost`, 
 `max_cost` and `stay_cost` to provide information and query costs associated with different decision
-points in cost opimization.
+points in cost opimization. With the exception of `max_cost` these methods need to return a 
+QCCoercionCost in the range of 0-1000.
+
+These functions have precise meanings:
+
+* `move_to_cost` is the transmission cost of moving the data, including known serialization costs
+  from the perspective of that particular compiler. Colloquially, the question being asked of the
+  query compiler is, "What is the normalized cost of moving my data to the other engine?"
+* `move_to_me_cost` is the execution cost for the data and operation on the proposed *destination*
+  query compiler. Since this method is called before the data has been migrated this is a class
+  method and the destination query_compiler may have very limited information on the possible cost
+  after migration. Factors that may be considered here include available memory, cpu, and the
+  unique characteristics of the engine. The question being asked is, "If this data were moved to
+  me, what would be the normalized execution cost to perform that operation?"
+* `stay_cost` is the execution cost on the current query compilier ( where the data is ). The question
+  asked of the query compiler is, "If I were to keep this data on my engine, what would be the normalized
+  execution cost?"
+* `max_cost` is the maximum cost allowed by this query compiler across all data movements. This method
+  sets a normalized upper bound for situations where multiple data frames from different engines all
+  need to move to the same engine. The value returned by this method can exceed 
+  QCCoercionCost.COST_IMPOSSIBLE
+
+There are generally two places where automatic casting is considered: When two or more DataFrames on
+different engines are participating in an operation ( such as pd.concat ) or at registered functions
+for particular engines through the `register_function_for_pre_op_switch` and 
+`register_function_for_post_op_switch` methods.
 
 Core Modin Dataframe
 """"""""""""""""""""
diff --git a/modin/config/envvars.py b/modin/config/envvars.py
@@ -490,6 +490,42 @@ def add_option(cls, choice: str) -> NoReturn:
             "Cannot add an option to Backend directly. Use Backend.register_backend instead."
         )
 
+    @classmethod
+    def set_active_backends(cls, new_choices: tuple) -> None:
+        """
+        Set the active backends available for manual and automatic switching.
+
+        Other backends may have been registered, and those backends remain registered, but the
+        set of engines that can be used is dynamically modified.
+
+        Parameters
+        ----------
+        new_choices : tuple
+            Choices to add.
+
+        Raises
+        ------
+        ValueError
+            Raises a ValueError when the set of new_choices are not already registered
+        """
+        if not all(i in cls._BACKEND_TO_EXECUTION for i in new_choices):
+            raise ValueError(
+                "Active backend choices {new_choices} are not all registered."
+            )
+        cls.choices = new_choices
+
+    @classmethod
+    def get_active_backends(cls) -> tuple[str, ...]:
+        """
+        Get the active backends available for manual and automatic switching.
+
+        Returns
+        -------
+        tuple[str, ...]
+            returns the active set of backends for switching
+        """
+        return cls.choices
+
     @classmethod
     def get_backend_for_execution(cls, execution: Execution) -> str:
         """
diff --git a/modin/core/storage_formats/base/query_compiler.py b/modin/core/storage_formats/base/query_compiler.py
@@ -322,6 +322,9 @@ def move_to_cost(
         decision points. Values returned must be within the acceptable
         range of QCCoercionCost
 
+        The question is: What are the transfer costs associated with
+        moving this data to the other_qc_type?
+
         Parameters
         ----------
         other_qc_type : QueryCompiler Class
@@ -360,6 +363,9 @@ def stay_cost(
         the other engine, where as the cost returned by 'stay_cost'
         may be simply the cost of running the operation locally.
 
+        The question is: What is the cost of running this operation on
+        the current dataframe?
+
         Values returned must be within the acceptable range of
         QCCoercionCost
 
@@ -389,11 +395,17 @@ def move_to_me_cost(
         operation: Optional[str] = None,
     ) -> Optional[int]:
         """
-        Return the coercion costs from other_qc to this qc type.
+        Return the execution and hidden coercion costs from other_qc.
+
+        This can be implemented as a class method version of stay_cost, though
+        since this class is not yet instantiated it may have a different
+        implementation. It may also include hidden transport or serialization
+        costs.
+
+        Values returned must be within the acceptable range of QCCoercionCost.
 
-        This is called for forced casting decision points, where one or more
-        DataFrames from different engines must interoperate. Values returned
-        must be within the acceptable range of QCCoercionCost
+        The question is: What is the cost of executing this operation if it
+        were to move to this query compiler?
 
         Parameters
         ----------
diff --git a/modin/core/storage_formats/pandas/query_compiler_caster.py b/modin/core/storage_formats/pandas/query_compiler_caster.py
@@ -35,6 +35,7 @@
 from modin.config import context as config_context
 from modin.core.storage_formats.base.query_compiler import (
     BaseQueryCompiler,
+    QCCoercionCost,
 )
 from modin.core.storage_formats.base.query_compiler_calculator import (
     BackendCostCalculator,
@@ -484,7 +485,7 @@ def _get_backend_for_auto_switch(
         api_cls_name=class_of_wrapped_fn,
         operation=function_name,
     )
-    for backend in Backend._BACKEND_TO_EXECUTION:
+    for backend in Backend.get_active_backends():
         if backend in ("Ray", "Unidist", "Dask"):
             # Disable automatically switching to these engines for now, because
             # 1) _get_prepared_factory_for_backend() currently calls
@@ -502,17 +503,35 @@ def _get_backend_for_auto_switch(
             api_cls_name=class_of_wrapped_fn,
             operation=function_name,
         )
-        if move_to_cost is not None and stay_cost is not None:
-            move_stay_delta = move_to_cost - stay_cost
+        other_execute_cost = move_to_class.move_to_me_cost(
+            input_qc,
+            api_cls_name=class_of_wrapped_fn,
+            operation=function_name,
+        )
+        if (
+            move_to_cost is not None
+            and stay_cost is not None
+            and other_execute_cost is not None
+        ):
+            if stay_cost >= QCCoercionCost.COST_IMPOSSIBLE:
+                # We cannot execute the workload on the current engine
+                # disregard the move_to_cost and just consider whether
+                # the other engine can execute the workload
+                move_stay_delta = other_execute_cost - stay_cost
+            else:
+                # We can execute this workload if we need to, consider
+                # move_to_cost/transfer time in our decision
+                move_stay_delta = (move_to_cost + other_execute_cost) - stay_cost
             if move_stay_delta < 0 and (
                 min_move_stay_delta is None or move_stay_delta < min_move_stay_delta
             ):
                 min_move_stay_delta = move_stay_delta
                 best_backend = backend
             logging.info(
                 f"After {class_of_wrapped_fn} function {function_name}, "
-                + f"considered moving to backend {backend} with move_to_cost "
-                + f"{move_to_cost}, stay_cost {stay_cost}, and move-stay delta "
+                + f"considered moving to backend {backend} with "
+                + f"(transfer_cost {move_to_cost} + other_execution_cost {other_execute_cost}) "
+                + f", stay_cost {stay_cost}, and move-stay delta "
                 + f"{move_stay_delta}"
             )
     if best_backend == starting_backend:
diff --git a/modin/tests/pandas/native_df_interoperability/test_compiler_caster.py b/modin/tests/pandas/native_df_interoperability/test_compiler_caster.py