Skip to content

Commit

Permalink
latex test
Browse files Browse the repository at this point in the history
  • Loading branch information
Si1w committed Dec 31, 2024
1 parent 6730cbb commit 0964652
Show file tree
Hide file tree
Showing 3 changed files with 52 additions and 19 deletions.
2 changes: 1 addition & 1 deletion _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ plugins:
markdown: kramdown

kramdown:
math_engine: null
math_engine: mathjax
input: GFM

permalink: /blog/:year/:month/:day/:title
Expand Down
2 changes: 1 addition & 1 deletion _includes/latex.html
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@
}
});
</script>
<script src="https://cdn.jsdelivr.net/npm/mathjax@2.7.9/MathJax.js?config=TeX-AMS_HTML" type="text/javascript"></script>
<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-MML-AM_CHTML"></script>
67 changes: 50 additions & 17 deletions _posts/2024-12-31-LRGD.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,9 @@ $$

So when we design a learning algorithm, we need to know that how to represent the hypothesis. In the case of linear regression, the hypothesis is represented as:

$$h(x) = \theta_0 + \theta_1 x$$
$$
h(x) = \theta_0 + \theta_1 x
$$

Suppose there are more than one feature, for example, the number of bedrooms

Expand All @@ -46,11 +48,15 @@ Size | Bedrooms | Price

Given that $x_1$ is the size of the house, $x_2$ is the number of bedrooms, the hypothesis is represented as:

$$h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2$$
$$
h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2
$$

or normally we write it as:

$$h(x) = \sum_{i=0}^{n} \theta_i x_i$$
$$
h(x) = \sum_{i=0}^{n} \theta_i x_i
$$

where $n$ is the number of features, $x_i$ is the feature value, in this case $n=2$ and $x_0 = 1$.

Expand Down Expand Up @@ -81,7 +87,8 @@ What we will do is to choose $\theta$ s.t. $h(x)$ is close to $y$ for our traini
In the linear regression, also called **ordinary least squares**, we will want to minimize the squared of the difference between the hypothesis and the actual output:

$$
\min_{\theta} J(\theta) = \min_{\theta} \frac{1}{2}\sum_{i}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2$$
\min_{\theta} J(\theta) = \min_{\theta} \frac{1}{2}\sum_{i}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2
$$

Here, $1/2$ is just for the convenience of the derivative by convention. $J(\theta)$ is called the **cost function** or **squared error function**.

Expand Down Expand Up @@ -109,11 +116,15 @@ Start with some initial $\theta$ (Say $\theta = \vec{0}$)

Keep changing $\theta$ to reduce $J(\theta)$ by repeating the following equation until convergence:

$$ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$$
$$
\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)
$$

where $\alpha$ is the learning rate in $(0, 1)$, and $\frac{\partial}{\partial \theta_j} J(\theta)$ is the partial derivative of the cost function $J(\theta)$ with respect to the parameter $\theta_j$.

$$\frac{\partial}{\partial \theta_j} J(\theta) = (h_{\theta}(x) - y)x_j = \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)}$$
$$
\frac{\partial}{\partial \theta_j} J(\theta) = (h_{\theta}(x) - y)x_j = \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)}
$$

The method looks at every example in the entire training set on every step, and is called **batch gradient descent**.

Expand All @@ -133,7 +144,8 @@ In stochastic gradient descent, it will never converge, but it will get close to

# Normal Equation

$$\nabla_{\theta} J(\theta) =
$$
\nabla_{\theta} J(\theta) =
\begin{bmatrix}
\frac{\partial}{\partial \theta_0} J(\theta) \\
\frac{\partial}{\partial \theta_1} J(\theta) \\
Expand All @@ -143,7 +155,8 @@ $$

Given that $\theta \in \mathbb{R}^{n+1}$, $J(\theta)$ is the cose function.

$$X =
$$
X =
\begin{bmatrix}
1 & (x^{(1)})^{T} \\
1 & (x^{(2)})^{T} \\
Expand All @@ -154,7 +167,8 @@ $$

Here, $X$ is called the **design matrix**. $(x^{(i)})^{T}$ represents all the feature values of the $i^{th}$ training example.

$$X \theta =
$$
X \theta =
\begin{bmatrix}
1 & (x^{(1)})^{T} \\
1 & (x^{(2)})^{T} \\
Expand Down Expand Up @@ -197,26 +211,45 @@ $$

To minimize the cose function $J(\theta)$, we set the gradient to zero:

$$X^{T} X \theta = X^{T} \vec{y}$$
$$
X^{T} X \theta = X^{T} \vec{y}
$$

which is the **normal equation**. so

$$\theta = (X^{T} X)^{-1} X^{T} \vec{y}$$
$$
\theta = (X^{T} X)^{-1} X^{T} \vec{y}
$$

**TIPS:** Given that $A \in \mathbb{R}^{n \times n}$

$$tr(A) = \sum_{i=1}^{n} A_{ii} = tr(A^{T})$$
$$\nabla_A tr(AA^{T}) = 2A$$
$$
tr(A) = \sum_{i=1}^{n} A_{ii} = tr(A^{T})
$$

$$
\nabla_A tr(AA^{T}) = 2A
$$

Given that $B \in \mathbb{R}^{n \times m}$, then

$$tr(AB) = tr(BA) $$
$$\nabla_A tr(AB) = B^{T}$$
$$\nabla_A tr(AA^{T}B) = BA + B^{T}A$$
$$
tr(AB) = tr(BA)
$$

$$
\nabla_A tr(AB) = B^{T}
$$

$$
\nabla_A tr(AA^{T}B) = BA + B^{T}A
$$

Given that $C \in \mathbb{R}^{m \times n}$, then

$$tr(ABC) = tr(CAB) = tr(BCA)$$
$$
tr(ABC) = tr(CAB) = tr(BCA)
$$

Given that $f(A): \mathbb{R}^{n \times n} \to \mathbb{R}$, then

Expand Down

0 comments on commit 0964652

Please sign in to comment.