Skip to content

Commit 5c00dba

Browse files
FEAT-#7459: Add methods to get and set backend. (#7460)
Add `get_backend()` to get the backend for a dataframe or series. Add `set_backend()`, and its alias `move_to()`, to set the backend of a dataframe or series. To implement `set_backend()`, extend `FactoryDispatcher` so that it can dispatch I/O operations to the backend that the user chooses instead of always using `modin.config.Backend`. `set_backend()` can then use `FactoryDispatcher.from_pandas(backend=new_backend)` to get a query compiler with the given backend. This commit also updates the documentation for "native" execution mode to reflect the updated guidance of using `Backend` to control execution. It also adds examples of using `get_backend()` and `set_backend()`. Signed-off-by: sfc-gh-mvashishtha <[email protected]>
1 parent 14589cd commit 5c00dba

File tree

16 files changed

+480
-85
lines changed

16 files changed

+480
-85
lines changed

.github/actions/run-core-tests/group_2/action.yml

+2-1
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ runs:
1818
modin/tests/pandas/dataframe/test_udf.py \
1919
modin/tests/pandas/dataframe/test_window.py \
2020
modin/tests/pandas/dataframe/test_pickle.py \
21-
modin/tests/pandas/test_repartition.py
21+
modin/tests/pandas/test_repartition.py \
22+
modin/tests/pandas/test_backend.py
2223
echo "::endgroup::"
2324
shell: bash -l {0}

docs/usage_guide/advanced_usage/index.rst

+3
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,9 @@ Additional APIs
3939
Modin also supports these additional APIs on top of pandas to improve user experience.
4040

4141
- :py:meth:`~modin.pandas.DataFrame.modin.to_pandas` -- convert a Modin DataFrame/Series to a pandas DataFrame/Series.
42+
- :py:meth:`~modin.pandas.DataFrame.get_backend` -- Get the ``Backend`` :doc:`configuration variable </flow/modin/config>` of this ``DataFrame``.
43+
- :py:meth:`~modin.pandas.DataFrame.move_to` -- Move data and execution for this ``DataFrame`` to the given ``Backend`` :doc:`configuration variable </flow/modin/config>`. This method is an alias for ``DataFrame.set_backend``.
44+
- :py:meth:`~modin.pandas.DataFrame.set_backend` -- Move data and execution for this ``DataFrame`` to the given ``Backend`` :doc:`configuration variable </flow/modin/config>`. This method is an alias for ``DatFrame.move_to``.
4245
- :py:func:`~modin.pandas.io.from_pandas` -- convert a pandas DataFrame to a Modin DataFrame.
4346
- :py:meth:`~modin.pandas.DataFrame.modin.to_ray` -- convert a Modin DataFrame/Series to a Ray Dataset.
4447
- :py:func:`~modin.pandas.io.from_ray` -- convert a Ray Dataset to a Modin DataFrame.

docs/usage_guide/optimization_notes/index.rst

+44-21
Original file line numberDiff line numberDiff line change
@@ -314,7 +314,7 @@ Copy-pastable example, showing how mixing pandas and Modin DataFrames in a singl
314314
# Possible output: TypeError
315315
316316
317-
Using pandas to execute queries in Modin's ``"native"`` execution mode
317+
Using pandas to execute queries with Modin's ``"Pandas"`` backend
318318
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
319319

320320
By default, Modin distributes the data in a dataframe (or series) and attempts
@@ -323,39 +323,62 @@ to process data for different partitions in parallel.
323323
However, for certain scenarios, such as handling small datasets, Modin's
324324
parallel execution may introduce unnecessary overhead. In such cases, it's more
325325
efficient to use serial execution with a single, unpartitioned pandas dataframe.
326-
You can enable this kind of "native" execution by setting Modin's
327-
``StorageFormat`` and ``Engine``
328-
:doc:`configuration variables </flow/modin/config>` to ``"Native"``.
326+
You can enable this kind of local pandas execution by setting Modin's
327+
``Backend``
328+
:doc:`configuration variable </flow/modin/config>` to ``"Pandas"``.
329329

330-
DataFrames created while Modin's global execution mode is set to ``"Native"``
331-
will continue to use native execution even if you switch the execution mode
330+
DataFrames created while Modin's global backend is set to ``"Pandas"``
331+
will continue to use native execution even if you switch the global backend
332332
later. Modin supports interoperability between distributed Modin DataFrames
333-
and those using native execution.
333+
and those using the pandas backend.
334334

335-
Here is an example of using native execution:
335+
Here is an example of using the pandas backend.
336336

337337
.. code-block:: python
338338
339339
import modin.pandas as pd
340340
from modin import set_execution
341-
from modin.config import StorageFormat, Engine
341+
from modin.config import Backend
342342
343343
# This dataframe will use Modin's default, distributed execution.
344-
df_distributed_1 = pd.DataFrame([0])
345-
assert df_distributed_1._query_compiler.engine != "Native"
344+
original_backend = Backend.get()
345+
assert original_backend != "Pandas"
346+
distributed_df_1 = pd.DataFrame([0])
346347
347-
# Set execution to "Native" for native execution.
348-
original_engine, original_storage_format = set_execution(
349-
engine="Native",
350-
storage_format="Native"
351-
)
352-
native_df = pd.DataFrame([1])
353-
assert native_df._query_compiler.engine == "Native"
348+
# Set backend to "Pandas" for local pandas execution.
349+
Backend.put("Pandas")
350+
modin_on_pandas_df = pd.DataFrame([1])
351+
assert modin_on_pandas_df.get_backend() == "Pandas"
354352
355353
# Revert to default settings for distributed execution
356-
set_execution(engine=original_engine, storage_format=original_storage_format)
357-
df_distributed_2 = pd.DataFrame([2])
358-
assert df_distributed_2._query_compiler.engine == original_engine
354+
Backend.put(original_backend)
355+
distributed_df_2 = pd.DataFrame([2])
356+
assert distributed_df_2.get_backend() == original_backend
357+
358+
You can also use the pandas backend for some dataframes while using different
359+
backends for other dataframes. You can switch the backend of an individual
360+
dataframe or series with ``set_backend()`` or its synonym ``move_to()``.
361+
Here's an example of switching the backend for an individual dataframe.
362+
363+
.. code-block:: python
364+
365+
import modin.pandas as pd
366+
367+
# This dataframe will use Modin's default, distributed execution.
368+
original_backend = Backend.get()
369+
assert original_backend != "Pandas"
370+
distributed_df_1 = pd.DataFrame([0])
371+
372+
pandas_df_1 = distributed_df_1.move_to("Pandas")
373+
assert pandas_df_1.get_backend() == "Pandas"
374+
pandas_df_1 = pandas_df_1.sort_values(0)
375+
assert pandas_df_1.get_backend() == "Pandas"
376+
377+
new_df = pandas_df_1.move_to(original_backend)
378+
assert new_df.get_backend() == original_backend
379+
380+
new_df.set_backend("Pandas", inplace=True)
381+
assert new_df.get_backend() == "Pandas"
359382
360383
Operation-specific optimizations
361384
""""""""""""""""""""""""""""""""

environment-dev.yml

+2
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ dependencies:
1212
- psutil>=5.8.0
1313

1414
# optional dependencies
15+
# NOTE Keep the ray and dask dependencies in sync with the Linux and Windows
16+
# Unidist environment dependencies.
1517
# ray==2.5.0 broken: https://github.com/conda-forge/ray-packages-feedstock/issues/100
1618
- ray-core>=2.1.0,!=2.5.0
1719
- pyarrow>=10.0.1

modin/config/envvars.py

+15-8
Original file line numberDiff line numberDiff line change
@@ -429,12 +429,7 @@ def put(cls, value: str) -> None:
429429
value : str
430430
Backend value to set.
431431
"""
432-
value = cls.normalize(value)
433-
if value not in cls.choices:
434-
raise ValueError(
435-
f"Unknown backend '{value}'. Please register the backend with Backend.register_backend()"
436-
)
437-
execution = cls._BACKEND_TO_EXECUTION[value]
432+
execution = cls.get_execution_for_backend(value)
438433
set_execution(execution.engine, execution.storage_format)
439434

440435
@classmethod
@@ -532,12 +527,24 @@ def get_execution_for_backend(cls, backend: str) -> Execution:
532527
execution : Execution
533528
The execution for the given backend
534529
"""
535-
if backend not in cls._BACKEND_TO_EXECUTION:
530+
if not isinstance(backend, str):
531+
raise TypeError(
532+
"Backend value should be a string, but instead it is "
533+
+ f"{repr(backend)} of type {type(backend)}."
534+
)
535+
normalized_value = cls.normalize(backend)
536+
if normalized_value not in cls.choices:
537+
backend_choice_string = ", ".join(f"'{choice}'" for choice in cls.choices)
538+
raise ValueError(
539+
f"Unknown backend '{backend}'. Available backends are: "
540+
+ backend_choice_string
541+
)
542+
if normalized_value not in cls._BACKEND_TO_EXECUTION:
536543
raise ValueError(
537544
f"Backend '{backend}' has no known execution. Please "
538545
+ "register an execution for it with Backend.register_backend()."
539546
)
540-
return cls._BACKEND_TO_EXECUTION[backend]
547+
return cls._BACKEND_TO_EXECUTION[normalized_value]
541548

542549
@classmethod
543550
def get(cls) -> str:

modin/core/execution/dispatching/factories/dispatcher.py

+70-26
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,14 @@
1717
Dispatcher routes the work to execution-specific functions.
1818
"""
1919

20-
from modin.config import Engine, IsExperimental, StorageFormat
20+
from typing import Union
21+
22+
from pandas._libs.lib import NoDefault, no_default
23+
24+
from modin.config import Backend, Engine, IsExperimental, StorageFormat
2125
from modin.core.execution.dispatching.factories import factories
22-
from modin.utils import _inherit_docstrings, get_current_execution
26+
from modin.core.storage_formats.base import BaseQueryCompiler
27+
from modin.utils import _inherit_docstrings
2328

2429

2530
class FactoryNotFoundError(AttributeError):
@@ -110,28 +115,39 @@ class FactoryDispatcher(object):
110115
def get_factory(cls) -> factories.BaseFactory:
111116
"""Get current factory."""
112117
if cls.__factory is None:
113-
from modin.pandas import _update_engine
114118

115-
Engine.subscribe(_update_engine)
116-
Engine.subscribe(cls._update_factory)
117-
StorageFormat.subscribe(cls._update_factory)
118-
return cls.__factory
119+
from modin.pandas import _initialize_engine
120+
121+
Engine.subscribe(
122+
lambda engine_parameter: _initialize_engine(engine_parameter.get())
123+
)
124+
Backend.subscribe(cls._update_factory)
125+
return_value = cls.__factory
126+
return return_value
119127

120128
@classmethod
121-
def _update_factory(cls, *args):
129+
def _get_prepared_factory_for_backend(cls, backend) -> factories.BaseFactory:
122130
"""
123-
Update and prepare factory with a new one specified via Modin config.
131+
Get factory for the specified backend.
124132
125133
Parameters
126134
----------
127-
*args : iterable
128-
This parameters serves the compatibility purpose.
129-
Does not affect the result.
135+
backend : str
136+
Backend name.
137+
138+
Returns
139+
-------
140+
factories.BaseFactory
141+
Factory for the specified backend.
130142
"""
131-
factory_name = get_current_execution() + "Factory"
143+
execution = Backend.get_execution_for_backend(backend)
144+
from modin.pandas import _initialize_engine
145+
146+
_initialize_engine(execution.engine)
147+
factory_name = f"{execution.storage_format}On{execution.engine}Factory"
132148
experimental_factory_name = "Experimental" + factory_name
133149
try:
134-
cls.__factory = getattr(factories, factory_name, None) or getattr(
150+
factory = getattr(factories, factory_name, None) or getattr(
135151
factories, experimental_factory_name
136152
)
137153
except AttributeError:
@@ -145,26 +161,54 @@ def _update_factory(cls, *args):
145161
raise FactoryNotFoundError(
146162
msg.format(factory_name, experimental_factory_name)
147163
)
148-
cls.__factory = StubFactory.set_failing_name(factory_name)
164+
factory = StubFactory.set_failing_name(factory_name)
149165
else:
150166
try:
151-
cls.__factory.prepare()
167+
factory.prepare()
152168
except ModuleNotFoundError as err:
153-
# incorrectly initialized, should be reset to None again
154-
# so that an unobvious error does not appear in the following code:
155-
# "AttributeError: 'NoneType' object has no attribute 'from_non_pandas'"
156-
cls.__factory = None
157169
raise ModuleNotFoundError(
158170
f"Make sure all required packages are installed: {str(err)}"
159171
) from err
160-
except BaseException:
161-
cls.__factory = None
162-
raise
172+
return factory
163173

164174
@classmethod
165-
@_inherit_docstrings(factories.BaseFactory._from_pandas)
166-
def from_pandas(cls, df):
167-
return cls.get_factory()._from_pandas(df)
175+
def _update_factory(cls, *args):
176+
"""
177+
Update and prepare factory with a new one specified via Modin config.
178+
179+
Parameters
180+
----------
181+
*args : iterable
182+
This parameters serves the compatibility purpose.
183+
Does not affect the result.
184+
"""
185+
cls.__factory = cls._get_prepared_factory_for_backend(Backend.get())
186+
187+
@classmethod
188+
def from_pandas(
189+
cls, df, backend: Union[str, NoDefault] = no_default
190+
) -> BaseQueryCompiler:
191+
"""
192+
Create a Modin query compiler from a pandas DataFrame.
193+
194+
Parameters
195+
----------
196+
df : pandas.DataFrame
197+
The pandas DataFrame to convert.
198+
backend : str or NoDefault, default: NoDefault
199+
The backend to use for the resulting query compiler. If NoDefault,
200+
use the current global default ``Backend`` from the Modin config.
201+
202+
Returns
203+
-------
204+
BaseQueryCompiler
205+
A Modin query compiler that wraps the input pandas DataFrame.
206+
"""
207+
return (
208+
cls.get_factory()
209+
if backend is no_default
210+
else cls._get_prepared_factory_for_backend(backend)
211+
)._from_pandas(df)
168212

169213
@classmethod
170214
@_inherit_docstrings(factories.BaseFactory._from_arrow)

modin/pandas/__init__.py

+11-11
Original file line numberDiff line numberDiff line change
@@ -101,10 +101,10 @@
101101

102102
from modin.config import Parameter
103103

104-
_is_first_update = {}
104+
_engine_initialized = {}
105105

106106

107-
def _update_engine(publisher: Parameter):
107+
def _initialize_engine(engine_string: str):
108108
from modin.config import (
109109
CpuCount,
110110
Engine,
@@ -116,25 +116,25 @@ def _update_engine(publisher: Parameter):
116116
# Set this so that Pandas doesn't try to multithread by itself
117117
os.environ["OMP_NUM_THREADS"] = "1"
118118

119-
if publisher.get() == "Ray":
120-
if _is_first_update.get("Ray", True):
119+
if engine_string == "Ray":
120+
if not _engine_initialized.get("Ray", False):
121121
from modin.core.execution.ray.common import initialize_ray
122122

123123
initialize_ray()
124-
elif publisher.get() == "Dask":
125-
if _is_first_update.get("Dask", True):
124+
elif engine_string == "Dask":
125+
if not _engine_initialized.get("Dask", False):
126126
from modin.core.execution.dask.common import initialize_dask
127127

128128
initialize_dask()
129-
elif publisher.get() == "Unidist":
130-
if _is_first_update.get("Unidist", True):
129+
elif engine_string == "Unidist":
130+
if not _engine_initialized.get("Unidist", False):
131131
from modin.core.execution.unidist.common import initialize_unidist
132132

133133
initialize_unidist()
134-
elif publisher.get() not in Engine.NOINIT_ENGINES:
135-
raise ImportError("Unrecognized execution engine: {}.".format(publisher.get()))
134+
elif engine_string not in Engine.NOINIT_ENGINES:
135+
raise ImportError("Unrecognized execution engine: {}.".format(engine_string))
136136

137-
_is_first_update[publisher.get()] = False
137+
_engine_initialized[engine_string] = True
138138

139139

140140
from modin.pandas import arrays, errors

modin/pandas/base.py

+31-1
Original file line numberDiff line numberDiff line change
@@ -67,17 +67,19 @@
6767
)
6868
from pandas.core.indexes.api import ensure_index
6969
from pandas.core.methods.describe import _refine_percentiles
70+
from pandas.util._decorators import doc
7071
from pandas.util._validators import (
7172
validate_ascending,
7273
validate_bool_kwarg,
7374
validate_percentile,
7475
)
7576

7677
from modin import pandas as pd
78+
from modin.config import Backend, Execution
7779
from modin.error_message import ErrorMessage
7880
from modin.logging import ClassLogger, disable_logging
7981
from modin.pandas.accessor import CachedAccessor, ModinAPI
80-
from modin.pandas.utils import is_scalar
82+
from modin.pandas.utils import GET_BACKEND_DOC, SET_BACKEND_DOC, is_scalar
8183
from modin.utils import _inherit_docstrings, expanduser_path_arg, try_cast_to_pandas
8284

8385
from .utils import _doc_binary_op, is_full_grab_slice
@@ -4388,3 +4390,31 @@ def __array_ufunc__(
43884390

43894391
# namespace for additional Modin functions that are not available in Pandas
43904392
modin: ModinAPI = CachedAccessor("modin", ModinAPI)
4393+
4394+
@doc(SET_BACKEND_DOC, class_name=__qualname__)
4395+
def set_backend(self, backend: str, inplace: bool = False) -> Optional[Self]:
4396+
# TODO(https://github.com/modin-project/modin/issues/7467): refactor
4397+
# to avoid this cyclic import in most places we do I/O, e.g. in
4398+
# modin/pandas/io.py
4399+
from modin.core.execution.dispatching.factories.dispatcher import (
4400+
FactoryDispatcher,
4401+
)
4402+
4403+
pandas_self = self._query_compiler.to_pandas()
4404+
query_compiler = FactoryDispatcher.from_pandas(df=pandas_self, backend=backend)
4405+
if inplace:
4406+
self._update_inplace(query_compiler)
4407+
return None
4408+
else:
4409+
return self.__constructor__(query_compiler=query_compiler)
4410+
4411+
move_to = set_backend
4412+
4413+
@doc(GET_BACKEND_DOC, class_name=__qualname__)
4414+
def get_backend(self) -> str:
4415+
return Backend.get_backend_for_execution(
4416+
Execution(
4417+
engine=self._query_compiler.engine,
4418+
storage_format=self._query_compiler.storage_format,
4419+
)
4420+
)

0 commit comments

Comments
 (0)