[RFC] Introducing JupyterNotebook into OpenSearch Dashboards #9537

yyfamazon · 2025-03-13T09:06:17Z

BACKGROUND

OpenSearch currently supports several scripting languages, such as Painless, Mustache, and Expressions. While each has merits, they also come with learning curves that may be unfamiliar to Python users. Python’s ecosystem offers extensive data processing, machine learning, and analytical libraries. By leveraging python’s ecosystem, it will reduce adoption barriers and empower a broader segment of the community to write custom logic for tasks such as scoring documents, executing specialized aggregations, customizing ingestion pipelines and so on. Enhancing OpenSearch's data querying and analytical user experience, we’d like to provide a notebook-style way of working for data analytics in OpenSearch. Jupyter notebooks integrate code and markdown text into a single document, allowing users to analyze and interpret their results all in one place.

BENEFITS

Python’s syntax advantage
a. Widely used programming language with easy-to-understand syntax.
b. In-place calculation is efficient, especially when dealing with large data structures.
c. Liner regression method for finding a relationship between data-points and to draw a line of linear regression, which help to predict the outcome of future events.
d. Built-in math functions, including an extensive math module, that allows you to perform mathematical tasks on numbers, which PPL/SQL query cannot do.
Python’s packages examples
a. Pandas — data manipulation, analysis and aggregation
b. Numpy — general-purpose array-processing package for scientific computing with Python
c. Scikit-Learn — Include wide range of machine learning models, pre-processing, cross-validation and visualization algorithms and all accessible with simple interface.
d. Dask — scale up computations for handling large datasets.
PPL/SQL Run PPL/SQL in cell/Jupyter to get opensearch’s data, translate data to pandas dataframes, and generate charts/tables, then maybe import back into opensearch as a visualization if necessary.
Jupyter’s Ecosystem and visualization tool such as seaborn, matplotlib, pycharts....

DEEP Integration

Providing an embedded jupyter-note book does not provide any real value. The both analytics frameworks should be integrated together to achieve 1+1>2.

PPL and Pandas

We are considering to support the execution of PPL inside notebook cells. To follow the native jupyter manner, the format should be like %ppl source=index | where field=value | stat COUNT() by key . After the execution of ppl, the result will be returned to the python kernel and be transformed to a pandas Dataframe. Pandas Dataframe is a table-like data structure which is already serving as a standard in python-based data analytics. It not only have native interface for display (table) in the notebook, but also compatible with diverse visualization libraries.

DEEPER Integration

With the increasing popularity of python ecosystem, Pandas dataframe itself is also becoming another data analytics interface, or say, a language. That inspires us that we can make Pandas dataframe a front-end delegate of an opensearch index with a fixed mapping. All pandas operations can be further interpret into PPL and be executed inside the opensearch engine. PPL is now having ambitious roadmap and planned to support advanced operations such as join/lookup/subqueries, which is potentially fitting pandas operators. Also, python supports to abstract function calls to logical plan building, so it will be very easy to implement the interpretation based on the pandas lib instead of refactoring it.

APPROACHES

There are two approaches that can lead us to jupyter integration: python kernel in the browser and python kernel in the backend. Note that these two approaches are not conflicting, jupyter notebook support python kernel selection in the UI. The users can switch which python kernel they want to use.

Approach 1 : Embedding JupyterLite (Python Kernel in Browsers) into opensearch

JupyterLite is a WebAssembly (Wasm)-based distribution of Jupyter that runs Python directly in the browser using Pyodide. Backed by in-browser language kernels, without having to start the Python Jupyter Server on the host machine. With in-browser distributions, there is no need to provision the execution environment in the backend. Since the application is mostly a set of static files, it scales more easily, and it is also easier to deploy.

PROs

No Backend Dependency – Python runs completely in the browser using WebAssembly, so no server or backend infrastructure is needed.
Easy Deployment – Since it’s browser-based, it can be automatically deployed as a static app (just HTML/JS files). No requirement of user actions.
Security – Code execution is sandboxed in the browser, reducing server-side security risks.
Offline Session Capability – Can work offline once the resources are cached locally.

CONs

Performance Limits – WebAssembly-based Python (Pyodide) is slower than native Python execution, especially for CPU-heavy or I/O-bound tasks. If we are expecting some simple analysis in the front end, this won’t be a limit.
Limited Package Support – Pyodide supports only a subset of Python packages (no direct C extensions, limited support for native libraries).
Memory and Resource Limits – Browsers have memory and execution limits, which can restrict complex computations.

Approach 2 : Integrating Python Kernels into OpenSearch

The principle of integrating Jupyter Python kernel into OpenSearch is mainly to use Python as an interface layer to interact more efficiently with the OpenSearch system. A Jupyter Python kernel is the backend process that executes Python code in a Jupyter environment. It's responsible for running the code, handling inputs and outputs, and communicating with the Jupyter frontend (like Jupyter Notebook, JupyterLab, or JupyterLite).

PROs

Full Python Support – You can use any Python package (including C extensions and compiled libraries).
Better Performance – Native Python execution is faster and more flexible than WebAssembly-based Python.
Flexible resources – Backend servers have much bigger RAM to conduct more complex computation.

CONs

Complexity of Deployment – Although we can let the OS backend to automatically deoloy Python environment, it self may has requirements which need manual help, e.g. dependency installation.
Resource Usage – More load on the server since execution happens server-side.
Scalability Challenges – Increased demand on server resources may limit scalability.
Security Risks – Exposing Python execution on the server increases the attack surface (sandboxing and input validation are critical).

A very prototype DEMO

A simple Demo — Embed JupyterLite into opensearch: (note that the notebook can’t access opensearch data now, this is just for demo purpose)

The text was updated successfully, but these errors were encountered:

yyfamazon · 2025-03-13T09:10:01Z

Also please see this if you are interested: opensearch-project/OpenSearch#17432

SuZhou-Joe assigned yyfamazon Mar 13, 2025

ruanyl added the RFC Substantial changes or new features that require community input to garner consensus. label Mar 13, 2025

opensearch-infra bot added this to OpenSearch Roadmap Mar 13, 2025

github-project-automation bot moved this to New in OpenSearch Roadmap Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Introducing JupyterNotebook into OpenSearch Dashboards #9537

[RFC] Introducing JupyterNotebook into OpenSearch Dashboards #9537

yyfamazon commented Mar 13, 2025

yyfamazon commented Mar 13, 2025

[RFC] Introducing JupyterNotebook into OpenSearch Dashboards #9537

[RFC] Introducing JupyterNotebook into OpenSearch Dashboards #9537

Comments

yyfamazon commented Mar 13, 2025

BACKGROUND

BENEFITS

DEEP Integration

PPL and Pandas

DEEPER Integration

APPROACHES

Approach 1 : Embedding JupyterLite (Python Kernel in Browsers) into opensearch

Approach 2 : Integrating Python Kernels into OpenSearch

A very prototype DEMO

yyfamazon commented Mar 13, 2025