Skip to content

Bug: successor.local_gradient_for_argument(self) could be corrupted #52

Open
@Banyc

Description

@Banyc

Issue

local_gradient_for_argument might read a corrupted float value from the successor.

Say we have the computation graph:

    h
    |
    |
    f
  • $h$ has parameters $w$
  • $f$ has parameters $v$
  • $h : \mathbb{R}^m \to \mathbb{R}$
  • $f : \mathbb{R}^n \to \mathbb{R}$
  • $h$ is the successor of $f$
  • $E$ is the graph
  • $\frac{\partial E}{\partial w} = \frac{\partial{E}}{\partial{h}} \frac{\partial{h}}{\partial{w}}$
  • $\frac{\partial E}{\partial f} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial f}$
  • $\frac{\partial E}{\partial v} = \frac{\partial E}{\partial f} \cdot \frac{\partial f}{\partial v}$
  • when doing backpropagation, the steps will be
    1. $h$ computes $\frac{\partial E}{\partial w}$ and caches $\frac{\partial E}{\partial h}$, and $\frac{\partial h}{\partial w}$
    2. $h$ updates $w$ to $w'$
    3. $f$ computes $\frac{\partial E}{\partial f}$ and $\frac{\partial h}{\partial f}$ is cached
      1. $\frac{\partial h}{\partial f}$ is not yet in cache, so $h$ will have to compute it now
      2. $\frac{\partial h}{\partial f}$ is computed based on the new parameter $w'$
        • This is the problem!
        • $\frac{\partial h}{\partial f}$ is corrupted
      3. $\frac{\partial h}{\partial f}$ is in cache now
      4. $\frac{\partial E}{\partial f}$ is computed by looking both $\frac{\partial E}{\partial h}$ and $\frac{\partial h}{\partial f}$ in cache
      5. $\frac{\partial E}{\partial f}$ is in cache now
    4. $f$ computes $\frac{\partial f}{\partial v}$ and caches it
    5. $f$ computes $\frac{\partial E}{\partial v}$ with $\frac{\partial f}{\partial v}$ and the corrupted $\frac{\partial h}{\partial f}$
    6. $f$ updates $v$ based on the corrupted $\frac{\partial E}{\partial v}$

Solutions

I can come up with two solutions

  • compute local_gradient ($\frac{\partial h}{\partial f}$) at the beginning of do_gradient_descent_step before parameters ($w$) is modified
  • the successor $h$ distributes local_gradient ($\frac{\partial h}{\partial f}$) and global_gradient ($\frac{\partial E}{\partial h}$) to $f$ before parameters ($w$) is modified

PS

Thank you for your book and the sample code so that I could deeper understand the neural network!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions