diff --git a/chapter07_distributed-learning/training-with-multiple-machines.ipynb b/chapter07_distributed-learning/training-with-multiple-machines.ipynb index 7ef7cbc..04b0a87 100644 --- a/chapter07_distributed-learning/training-with-multiple-machines.ipynb +++ b/chapter07_distributed-learning/training-with-multiple-machines.ipynb @@ -167,8 +167,28 @@ "\n", "Not only can we synchronize data within a machine, with the key-value store we can facilitate inter-machine communication. To use it, one can create a distributed kvstore by using the following command: (Note: distributed key-value store requires `MXNet` to be compiled with the flag `USE_DIST_KVSTORE=1`, e.g. `make USE_DIST_KVSTORE=1`.)\n", "\n", + "In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the aliveness of each node.\n", + "\n", + "To use the distributed key-value store, we must first start a scheduler process and at least one server process. When the MXNet library is imported in a process, it checks what the process's role is through the `DMLC_ROLE` environment variable. Starting a server or scheduler is as simple as importing MXNet with the appropriate environment variables set.\n", + "\n", + "```python\n", + "# start scheduler\n", + "scheduler_env = os.environ.copy()\n", + "scheduler_env.update({\"DMLC_ROLE\": \"scheduler\",\"DMLC_PS_ROOT_PORT\": \"9090\",\"DMLC_PS_ROOT_URI\": \"\",\"DMLC_NUM_SERVER\": \"1\",\"DMLC_NUM_WORKER\": \"2\",\"PS_VERBOSE\": \"2\"})\n", + "subprocess.Popen('python -c \"import mxnet\"', shell=True, env=scheduler_env)\n", + "\n", + "# start server\n", + "server_env = os.environ.copy()\n", + "server_env.update({\"DMLC_ROLE\": \"server\",\"DMLC_PS_ROOT_PORT\": \"9090\",\"DMLC_PS_ROOT_URI\": \"\",\"DMLC_NUM_SERVER\": \"1\",\"DMLC_NUM_WORKER\": \"2\",\"PS_VERBOSE\": \"2\"})\n", + "subprocess.Popen('python -c \"import mxnet\"', shell=True, env=server_env)\n", + "```\n", + "\n", + "To use a distributed key-value store from a worker process, just create the store as follows:\n", + "\n", "```python\n", "store = kv.create('dist')\n", + "# setting the optimizer instructs the server on how to update weights that are pushed to it\n", + "store.set_optimizer(mxnet.optimizer.SGD())\n", "```\n", "\n", "Now if we run the code from the previous section on two machines at the same time, then the store will aggregate the two ndarrays pushed from each machine, and after that, the pulled results will be: \n", @@ -178,9 +198,7 @@ " [ 6. 6. 6.]]\n", "```\n", "\n", - "In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the aliveness of each node.\n", - "\n", - "It's up to users which machines to run these processes on. But to simplify the process placement and launching, MXNet provides a tool located at [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py). \n", + "It's up to users which machines to run the worker, scheduler, and server processes on. But to simplify the process placement and launching, MXNet provides a tool located at [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py). \n", "\n", "Assume there are two machines, A and B. They are ssh-able, and their IPs are saved in a file named `hostfile`. Then we can start one worker in each machine through: \n", "\n",