Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Introducing JupyterNotebook into OpenSearch Dashboards #9537

Open
yyfamazon opened this issue Mar 13, 2025 · 1 comment
Open

[RFC] Introducing JupyterNotebook into OpenSearch Dashboards #9537

yyfamazon opened this issue Mar 13, 2025 · 1 comment
Assignees
Labels
RFC Substantial changes or new features that require community input to garner consensus.

Comments

@yyfamazon
Copy link

BACKGROUND

OpenSearch currently supports several scripting languages, such as Painless, Mustache, and Expressions. While each has merits, they also come with learning curves that may be unfamiliar to Python users. Python’s ecosystem offers extensive data processing, machine learning, and analytical libraries. By leveraging python’s ecosystem, it will reduce adoption barriers and empower a broader segment of the community to write custom logic for tasks such as scoring documents, executing specialized aggregations, customizing ingestion pipelines and so on. Enhancing OpenSearch's data querying and analytical user experience, we’d like to provide a notebook-style way of working for data analytics in OpenSearch. Jupyter notebooks integrate code and markdown text into a single document, allowing users to analyze and interpret their results all in one place.

BENEFITS

  • Python’s syntax advantage
    a. Widely used programming language with easy-to-understand syntax.
    b. In-place calculation is efficient, especially when dealing with large data structures.
    c. Liner regression method for finding a relationship between data-points and to draw a line of linear regression, which help to predict the outcome of future events.
    d. Built-in math functions, including an extensive math module, that allows you to perform mathematical tasks on numbers, which PPL/SQL query cannot do.
  • Python’s packages examples
    a. Pandas — data manipulation, analysis and aggregation
    b. Numpy — general-purpose array-processing package for scientific computing with Python
    c. Scikit-Learn — Include wide range of machine learning models, pre-processing, cross-validation and visualization algorithms and all accessible with simple interface.
    d. Dask — scale up computations for handling large datasets.
  • PPL/SQL Run PPL/SQL in cell/Jupyter to get opensearch’s data, translate data to pandas dataframes, and generate charts/tables, then maybe import back into opensearch as a visualization if necessary.
  • Jupyter’s Ecosystem and visualization tool such as seaborn, matplotlib, pycharts....

Image

Image

DEEP Integration

Providing an embedded jupyter-note book does not provide any real value. The both analytics frameworks should be integrated together to achieve 1+1>2.

PPL and Pandas

We are considering to support the execution of PPL inside notebook cells. To follow the native jupyter manner, the format should be like %ppl source=index | where field=value | stat COUNT() by key . After the execution of ppl, the result will be returned to the python kernel and be transformed to a pandas Dataframe. Pandas Dataframe is a table-like data structure which is already serving as a standard in python-based data analytics. It not only have native interface for display (table) in the notebook, but also compatible with diverse visualization libraries.

DEEPER Integration

With the increasing popularity of python ecosystem, Pandas dataframe itself is also becoming another data analytics interface, or say, a language. That inspires us that we can make Pandas dataframe a front-end delegate of an opensearch index with a fixed mapping. All pandas operations can be further interpret into PPL and be executed inside the opensearch engine. PPL is now having ambitious roadmap and planned to support advanced operations such as join/lookup/subqueries, which is potentially fitting pandas operators. Also, python supports to abstract function calls to logical plan building, so it will be very easy to implement the interpretation based on the pandas lib instead of refactoring it.

APPROACHES

There are two approaches that can lead us to jupyter integration: python kernel in the browser and python kernel in the backend. Note that these two approaches are not conflicting, jupyter notebook support python kernel selection in the UI. The users can switch which python kernel they want to use.

Approach 1 : Embedding JupyterLite (Python Kernel in Browsers) into opensearch

JupyterLite is a WebAssembly (Wasm)-based distribution of Jupyter that runs Python directly in the browser using Pyodide. Backed by in-browser language kernels, without having to start the Python Jupyter Server on the host machine. With in-browser distributions, there is no need to provision the execution environment in the backend. Since the application is mostly a set of static files, it scales more easily, and it is also easier to deploy.

Image

Image

PROs

  • No Backend Dependency – Python runs completely in the browser using WebAssembly, so no server or backend infrastructure is needed.
  • Easy Deployment – Since it’s browser-based, it can be automatically deployed as a static app (just HTML/JS files). No requirement of user actions.
  • Security – Code execution is sandboxed in the browser, reducing server-side security risks.
  • Offline Session Capability – Can work offline once the resources are cached locally.

CONs

  • Performance Limits – WebAssembly-based Python (Pyodide) is slower than native Python execution, especially for CPU-heavy or I/O-bound tasks. If we are expecting some simple analysis in the front end, this won’t be a limit.
  • Limited Package Support – Pyodide supports only a subset of Python packages (no direct C extensions, limited support for native libraries).
  • Memory and Resource Limits – Browsers have memory and execution limits, which can restrict complex computations.

Approach 2 : Integrating Python Kernels into OpenSearch

The principle of integrating Jupyter Python kernel into OpenSearch is mainly to use Python as an interface layer to interact more efficiently with the OpenSearch system. A Jupyter Python kernel is the backend process that executes Python code in a Jupyter environment. It's responsible for running the code, handling inputs and outputs, and communicating with the Jupyter frontend (like Jupyter Notebook, JupyterLab, or JupyterLite).

PROs

  • Full Python Support – You can use any Python package (including C extensions and compiled libraries).
  • Better Performance – Native Python execution is faster and more flexible than WebAssembly-based Python.
  • Flexible resources – Backend servers have much bigger RAM to conduct more complex computation.

CONs

  • Complexity of Deployment – Although we can let the OS backend to automatically deoloy Python environment, it self may has requirements which need manual help, e.g. dependency installation.
  • Resource Usage – More load on the server since execution happens server-side.
  • Scalability Challenges – Increased demand on server resources may limit scalability.
  • Security Risks – Exposing Python execution on the server increases the attack surface (sandboxing and input validation are critical).

A very prototype DEMO

A simple Demo — Embed JupyterLite into opensearch: (note that the notebook can’t access opensearch data now, this is just for demo purpose)

Image

@yyfamazon
Copy link
Author

Also please see this if you are interested: opensearch-project/OpenSearch#17432

@ruanyl ruanyl added the RFC Substantial changes or new features that require community input to garner consensus. label Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Substantial changes or new features that require community input to garner consensus.
Projects
Status: New
Development

No branches or pull requests

2 participants