Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanded readme setup steps. #6

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gpolovets1
Copy link

Added a few missing explicit steps and reorganized the commands to run without error.

Copy link

google-cla bot commented Mar 14, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

INSTALL_COMMAND=$(cat << EOM
sudo apt update
sudo apt install -y nfs-common nfs-kernel-server nfs-server net-tools tmux python3-ipyparallel
python -m pip install --user virtualenv; virtualenv venv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we switch to uv, specifically uv venv, it's a great tool

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do thanks! We can also test it out this upcoming week.

python -m pip install --user virtualenv; virtualenv venv
source venv/bin/activate
cd ~/
pip install -U "jax[tpu]" ipyparallel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uv pip install

git clone https://github.com/jax-ml/jax-llm-examples.git
fi
cd jax-llm-examples/deepseek_r1_jax
pip install -e .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uv pip install -e .


> After you've confirmed this setup works, you can utilize main.ipynb to run inference.

### Troubleshooting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an awesome section!

Worst case you may need to restart the engines via:\
tpu_exec 0 0 "$CONTROLLER_CMD".
- You've encountered OOM on a run that shouldn't have run out of memory.\
Solution: you have again likely not cleared pre-existing sessions and still have weights loaded in memory.\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libtpu is an exclusive process, you won't be able to run two jax processes with tpu memory usage

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'll actually fail to jax.distributed.initialize I think

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, do you have any other thoughts on why I encountered OOMs but clearing the engines fixed the issue?

@@ -441,6 +465,12 @@ tpu_exec 1 15 "$ENGINE_CMD" # all workers except worker 0
```

#### Jupyter Notebook
```bash
# Start SSH Tunnel on Worker0.
tpu_exec 0 0 "tmux new -d -s engine 'source venv/bin/activate && jupyter notebook --no-browser --port=8888'"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically use vscode to use jupyter notebook, we probably shouldn't jypyter notebook start here in general

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also probably shouldn't call the server "engine" on worker0, since that'd be 1 too many engines

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, maybe I haven't used that vscode feature/this is overkill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants