-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expanded readme setup steps. #6
base: main
Are you sure you want to change the base?
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
INSTALL_COMMAND=$(cat << EOM | ||
sudo apt update | ||
sudo apt install -y nfs-common nfs-kernel-server nfs-server net-tools tmux python3-ipyparallel | ||
python -m pip install --user virtualenv; virtualenv venv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we switch to uv, specifically uv venv
, it's a great tool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do thanks! We can also test it out this upcoming week.
python -m pip install --user virtualenv; virtualenv venv | ||
source venv/bin/activate | ||
cd ~/ | ||
pip install -U "jax[tpu]" ipyparallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uv pip install
git clone https://github.com/jax-ml/jax-llm-examples.git | ||
fi | ||
cd jax-llm-examples/deepseek_r1_jax | ||
pip install -e . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uv pip install -e .
|
||
> After you've confirmed this setup works, you can utilize main.ipynb to run inference. | ||
|
||
### Troubleshooting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an awesome section!
Worst case you may need to restart the engines via:\ | ||
tpu_exec 0 0 "$CONTROLLER_CMD". | ||
- You've encountered OOM on a run that shouldn't have run out of memory.\ | ||
Solution: you have again likely not cleared pre-existing sessions and still have weights loaded in memory.\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libtpu
is an exclusive process, you won't be able to run two jax processes with tpu memory usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'll actually fail to jax.distributed.initialize
I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, do you have any other thoughts on why I encountered OOMs but clearing the engines fixed the issue?
@@ -441,6 +465,12 @@ tpu_exec 1 15 "$ENGINE_CMD" # all workers except worker 0 | |||
``` | |||
|
|||
#### Jupyter Notebook | |||
```bash | |||
# Start SSH Tunnel on Worker0. | |||
tpu_exec 0 0 "tmux new -d -s engine 'source venv/bin/activate && jupyter notebook --no-browser --port=8888'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I typically use vscode to use jupyter notebook, we probably shouldn't jypyter notebook start here in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also probably shouldn't call the server "engine" on worker0, since that'd be 1 too many engines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, maybe I haven't used that vscode feature/this is overkill.
Added a few missing explicit steps and reorganized the commands to run without error.