Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify distributed key-value store usage #450

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -167,8 +167,28 @@
"\n",
"Not only can we synchronize data within a machine, with the key-value store we can facilitate inter-machine communication. To use it, one can create a distributed kvstore by using the following command: (Note: distributed key-value store requires `MXNet` to be compiled with the flag `USE_DIST_KVSTORE=1`, e.g. `make USE_DIST_KVSTORE=1`.)\n",
"\n",
"In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the aliveness of each node.\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the aliveness of each node.\n",
"In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the status of each node.\n",

"\n",
"To use the distributed key-value store, we must first start a scheduler process and at least one server process. When the MXNet library is imported in a process, it checks what the process's role is through the `DMLC_ROLE` environment variable. Starting a server or scheduler is as simple as importing MXNet with the appropriate environment variables set.\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"To use the distributed key-value store, we must first start a scheduler process and at least one server process. When the MXNet library is imported in a process, it checks what the process's role is through the `DMLC_ROLE` environment variable. Starting a server or scheduler is as simple as importing MXNet with the appropriate environment variables set.\n",
"To use the distributed key-value store, we must first start a scheduler process and at least one server process. When the MXNet library is imported in a process, it checks what the process role is through the `DMLC_ROLE` environment variable. Starting a server or scheduler is as simple as importing MXNet with the appropriate environment variables set.\n",

"\n",
"```python\n",
"# start scheduler\n",
"scheduler_env = os.environ.copy()\n",
"scheduler_env.update({\"DMLC_ROLE\": \"scheduler\",\"DMLC_PS_ROOT_PORT\": \"9090\",\"DMLC_PS_ROOT_URI\": \"<scheduler-ip>\",\"DMLC_NUM_SERVER\": \"1\",\"DMLC_NUM_WORKER\": \"2\",\"PS_VERBOSE\": \"2\"})\n",
"subprocess.Popen('python -c \"import mxnet\"', shell=True, env=scheduler_env)\n",
"\n",
"# start server\n",
"server_env = os.environ.copy()\n",
"server_env.update({\"DMLC_ROLE\": \"server\",\"DMLC_PS_ROOT_PORT\": \"9090\",\"DMLC_PS_ROOT_URI\": \"<scheduler-ip>\",\"DMLC_NUM_SERVER\": \"1\",\"DMLC_NUM_WORKER\": \"2\",\"PS_VERBOSE\": \"2\"})\n",
"subprocess.Popen('python -c \"import mxnet\"', shell=True, env=server_env)\n",
"```\n",
"\n",
"To use a distributed key-value store from a worker process, just create the store as follows:\n",
"\n",
"```python\n",
"store = kv.create('dist')\n",
"# setting the optimizer instructs the server on how to update weights that are pushed to it\n",
"store.set_optimizer(mxnet.optimizer.SGD())\n",
"```\n",
"\n",
"Now if we run the code from the previous section on two machines at the same time, then the store will aggregate the two ndarrays pushed from each machine, and after that, the pulled results will be: \n",
Expand All @@ -178,9 +198,7 @@
" [ 6. 6. 6.]]\n",
"```\n",
"\n",
"In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the aliveness of each node.\n",
"\n",
"It's up to users which machines to run these processes on. But to simplify the process placement and launching, MXNet provides a tool located at [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py). \n",
"It's up to users which machines to run the worker, scheduler, and server processes on. But to simplify the process placement and launching, MXNet provides a tool located at [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py). \n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"It's up to users which machines to run the worker, scheduler, and server processes on. But to simplify the process placement and launching, MXNet provides a tool located at [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py). \n",
"It's up to users to decide which machines to run the worker, scheduler, and server processes on. But to simplify the process placement and launching, MXNet provides a tool located at [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py). \n",

"\n",
"Assume there are two machines, A and B. They are ssh-able, and their IPs are saved in a file named `hostfile`. Then we can start one worker in each machine through: \n",
"\n",
Expand Down