Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common or shared variables for different notebooks in a pipeline #3179

Open
arundeep78 opened this issue Aug 22, 2023 · 9 comments
Open

Common or shared variables for different notebooks in a pipeline #3179

arundeep78 opened this issue Aug 22, 2023 · 9 comments

Comments

@arundeep78
Copy link

This probably is more a "how-to" issue than an "issue", but I am not sure where it falls.

I have elyra installed in kubernetes with enterprise gateway with configuration that every notebook execution in interactive mode starts a new pod in kubernetes.

I want to to know, how can I share common vairables in multiple notebooks using Elyra at development as well as scheduling those in pipeline. I am talking python kernels here.

I will take this standard pipeline from Elyra website as an example.
image

In this case lets assume there is a database table name "t_save_data_here". This is to be used in all the notebooks and scripts.

Currently, I am declaring a variable in all the notebooks to connect to it.
e.g. table_name = "t_save_data_here".

But how can I do it by just declaring it once e.g. in a parameters.py, which can be imported as a first line in all notebooks and I get all global variables defined? This way e.g. if the table name changes, I just to need to change it once.

I read through documentation and some threads,but it mostly talks about pipeline parameters, env variables or exporting "output" from previous stage to next.
These become very specific to pipeline config and won't be available when I am developing a given notebook or script...

Is there a documentation that I could not find ?
Or is this is just wrong way of trying to solve the problem? Is there a better way?

@lresende
Copy link
Member

There are individual node properties and pipeline "global" properties... you can find it as different tabs on properties panel

@arundeep78
Copy link
Author

@lresende thanks for response. I beleive I was not clear with my message, when I mentioned about pipeline parameters.

Yes, these parameters are available, but only when scheduled as pipeline. But this the final stage when someone has developed all those notebooks . How would those notebooks get those variables, when they are not running as pipeline?

In the example above, lets say I am developing Part 1 notebook interactively. In my case especially a new kubernetes pod is started for that notebook. It does not have access to any other file that one can see in the explorer.

The way in a normal python/jupyter environment, I can create a simple .py file with some configs and can access it any other notebook. How can I do it in this case ?

@kevin-bates
Copy link
Member

hi @arundeep78. You might try looking at using volume mounts, either conditional or unconditional, by adjusting the kernel-pod template. This, in combination with KERNEL_WORKING_DIR (which points into the mount location) would allow you to simulate local pipeline invocations that don't leverage EG.

@lresende
Copy link
Member

Yes, these parameters are available, but only when scheduled as pipeline. But this the final stage when someone has developed all those notebooks . How would those notebooks get those variables, when they are not running as pipeline?

We would usually specify environment variables, and in the notebook, lookup the env vars with default values for local runs.

os.getenv("MY_VAR", "default_local_value")

This way, you will get the correct value from MY_VAR when running as pipelines and otherwise default to the local value.

@arundeep78
Copy link
Author

@lresende Sorry, I probably do not understand the whole architecture completely.

I start an IBm Elyra environment and start a Jupyterlab interface. This has some directory structure and notebooks.
I open a given notebook, select a given kernel ( in this case a python3 kernel). Now when this kernel starts through enterprise gateway (EG), it pulls up the kernel definition defined in the configuration and starts that kernel as a kubernetes pod and connects to it. I am refering to this digram from Jupyter Enterprise gateway

image

If I want to use environment variables, then I would have to customize this image to get those variables in the interactive development phase. Which if I would do won't make sense.

As in the jupyterlab environment I may have different pipelines confugured in different folders e.g..
pipeline-finance and pipeline-news . In both cases I have multiple notebooks, which refers to variable name table_name. If I set that in the image then it will fail.

Either I am mising something or people just do not use it this way and configure a working notebook which runs independently in its own environment.

@arundeep78
Copy link
Author

hi @arundeep78. You might try looking at using volume mounts, either conditional or unconditional, by adjusting the kernel-pod template. This, in combination with KERNEL_WORKING_DIR (which points into the mount location) would allow you to simulate local pipeline invocations that don't leverage EG.

@kevin-bates thanks. I will read that documentation and will come back. But just to clarify, we don't have 'local pipelines'.

We develop notebooks interactively using EG, which starts the kernel from its standard image. in our case kernel named "python_kubernetes". Once this notebook is working, we schedule it through Elyra interface on Airflow. As notebooks have all variables inside without dependencies on any environment variables and so, it runs fine. Only challenge we have is, scenarios when functionality is splitted in multiple notebooks, we are duplicating those variables in all notebooks.

@kevin-bates
Copy link
Member

@arundeep78 - Thanks for the additional information. So your notebook "nodes" must be requiring resources that are not available locally, yet are also available in airflow. And also, combining them onto one server (as would be the case if you, say, used Jupyter Hub to host each elyra-server) using that server's kernels locally, would still have insufficient resources - is that correct? If that's the case, then, yeah, looking into the volume mounts approach is probably your best bet. Seems like you might be able to use unconditional volume mounts and just enter the necessary instructions into the kernel-pod templates directly (rather than relying on KERNEL_ values.)

@arundeep78
Copy link
Author

@kevin-bates what do you mean when you say "insufficient resources" for nodes? are you refering to CPU, memory etc. or ?

Anyways, we do not lack compute power.

In Elyra development environment we have a github repo that synch all notebooks to Elyra. In that we have multiple folders that contains all notebooks relevant for a single pipeline example below.
image

image

During development of lets say notebook1, Elyra will start a kernel which will not have access to "common_paramters.py", which may contain common variables that are needed by notebook2 and notebook3 as well.

I assume if I schedule it as a pipleine and add common_paramters.py as dependencies for all notebooks in the pipeline then it will be availabe on the node where this notebook will run and I should be able to import to as from common_parameters import * or simmilar.

But, during the interactive development of these notebooks, the kernel that has been started as kubernetes pod, does not have access to this 'common_parameters.py' file, This means I cannot just develop a notebook in interactive mode and use the same in pipeline (unless there is a way).

These paramters I cannot define in kernel images, as their values is different for each notebook. e.g. for pipeline1 table_name is t_db_table1 and so on.

Also, these I am not sure if they are really environment variables. These are more like "application vairables" as in application defined by pipeline and environment variable is something like database connection, which can be different for development or production, but still use the same variables for table name.

I hope I made the situation more clear than before as to what I am trying to achieve to find a way to reach there.

@kevin-bates
Copy link
Member

My comment use essentially asking "why do you need to use EG to develop your notebooks?". If you have enough resources locally you don't need EG and the (local) files are available to all notebooks (nodes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants