latex test

Si1w · Dec 31, 2024 · 0964652 · 0964652
1 parent 6730cbb
commit 0964652
Show file tree

Hide file tree

Showing 3 changed files with 52 additions and 19 deletions.
diff --git a/_config.yml b/_config.yml
@@ -55,7 +55,7 @@ plugins:
 markdown: kramdown
 
 kramdown:
-  math_engine: null
+  math_engine: mathjax
   input: GFM
 
 permalink: /blog/:year/:month/:day/:title

diff --git a/_includes/latex.html b/_includes/latex.html
@@ -12,4 +12,4 @@
         }
     });
 </script>
-<script src="https://cdn.jsdelivr.net/npm/mathjax@2.7.9/MathJax.js?config=TeX-AMS_HTML" type="text/javascript"></script>
+<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-MML-AM_CHTML"></script>
diff --git a/_posts/2024-12-31-LRGD.md b/_posts/2024-12-31-LRGD.md
@@ -32,7 +32,9 @@ $$
 
 So when we design a learning algorithm, we need to know that how to represent the hypothesis. In the case of linear regression, the hypothesis is represented as:
 
-$$h(x) = \theta_0 + \theta_1 x$$
+$$
+h(x) = \theta_0 + \theta_1 x
+$$
 
 Suppose there are more than one feature, for example, the number of bedrooms
 
@@ -46,11 +48,15 @@ Size | Bedrooms | Price
 
 Given that $x_1$ is the size of the house, $x_2$ is the number of bedrooms, the hypothesis is represented as:
 
-$$h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2$$
+$$
+h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2
+$$
 
 or normally we write it as:
 
-$$h(x) = \sum_{i=0}^{n} \theta_i x_i$$
+$$
+h(x) = \sum_{i=0}^{n} \theta_i x_i
+$$
 
 where $n$ is the number of features, $x_i$ is the feature  value, in this case $n=2$ and $x_0 = 1$.
 
@@ -81,7 +87,8 @@ What we will do is to choose $\theta$ s.t. $h(x)$ is close to $y$ for our traini
 In the linear regression, also called **ordinary least squares**, we will want to minimize the squared of the difference between the hypothesis and the actual output:
 
 $$
-\min_{\theta} J(\theta) = \min_{\theta} \frac{1}{2}\sum_{i}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2$$
+\min_{\theta} J(\theta) = \min_{\theta} \frac{1}{2}\sum_{i}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2
+$$
 
 Here, $1/2$ is just for the convenience of the derivative by convention. $J(\theta)$ is called the **cost function** or **squared error function**.
 
@@ -109,11 +116,15 @@ Start with some initial $\theta$ (Say $\theta = \vec{0}$)
 
 Keep changing $\theta$ to reduce $J(\theta)$ by repeating the following equation until convergence:
 
-$$ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$$
+$$ 
+\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)
+$$
 
 where $\alpha$ is the learning rate in $(0, 1)$, and $\frac{\partial}{\partial \theta_j} J(\theta)$ is the partial derivative of the cost function $J(\theta)$ with respect to the parameter $\theta_j$.
 
-$$\frac{\partial}{\partial \theta_j} J(\theta) = (h_{\theta}(x) - y)x_j  =  \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)}$$
+$$
+\frac{\partial}{\partial \theta_j} J(\theta) = (h_{\theta}(x) - y)x_j  =  \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)}
+$$
 
 The method looks at every example in the entire training set on every step, and is called **batch gradient descent**.
 
@@ -133,7 +144,8 @@ In stochastic gradient descent, it will never converge, but it will get close to
 
 # Normal Equation
 
-$$\nabla_{\theta} J(\theta) = 
+$$
+\nabla_{\theta} J(\theta) = 
 \begin{bmatrix}
 \frac{\partial}{\partial \theta_0} J(\theta) \\
 \frac{\partial}{\partial \theta_1} J(\theta) \\
@@ -143,7 +155,8 @@ $$
 
 Given that $\theta \in \mathbb{R}^{n+1}$, $J(\theta)$ is the cose function.
 
-$$X = 
+$$
+X = 
 \begin{bmatrix}
 1 & (x^{(1)})^{T} \\
 1 & (x^{(2)})^{T} \\
@@ -154,7 +167,8 @@ $$
 
 Here, $X$ is called the **design matrix**. $(x^{(i)})^{T}$ represents all the feature values of the $i^{th}$ training example.
 
-$$X  \theta = 
+$$
+X  \theta = 
 \begin{bmatrix}
 1 & (x^{(1)})^{T} \\
 1 & (x^{(2)})^{T} \\
@@ -197,26 +211,45 @@ $$
 
 To minimize the cose function $J(\theta)$, we set the gradient to zero:
 
-$$X^{T} X \theta = X^{T} \vec{y}$$
+$$
+X^{T} X \theta = X^{T} \vec{y}
+$$
 
 which is the **normal equation**. so
 
-$$\theta = (X^{T} X)^{-1} X^{T} \vec{y}$$
+$$
+\theta = (X^{T} X)^{-1} X^{T} \vec{y}
+$$
 
 **TIPS:** Given that $A \in \mathbb{R}^{n \times n}$ 
 
-$$tr(A) = \sum_{i=1}^{n} A_{ii} = tr(A^{T})$$
-$$\nabla_A tr(AA^{T}) = 2A$$
+$$
+tr(A) = \sum_{i=1}^{n} A_{ii} = tr(A^{T})
+$$
+
+$$
+\nabla_A tr(AA^{T}) = 2A
+$$
 
 Given that $B \in \mathbb{R}^{n \times m}$, then
 
-$$tr(AB) = tr(BA) $$
-$$\nabla_A tr(AB) = B^{T}$$
-$$\nabla_A tr(AA^{T}B) = BA + B^{T}A$$
+$$
+tr(AB) = tr(BA)
+$$
+
+$$
+\nabla_A tr(AB) = B^{T}
+$$
+
+$$
+\nabla_A tr(AA^{T}B) = BA + B^{T}A
+$$
 
 Given that $C \in \mathbb{R}^{m \times n}$, then
 
-$$tr(ABC) = tr(CAB) = tr(BCA)$$
+$$
+tr(ABC) = tr(CAB) = tr(BCA)
+$$
 
 Given that $f(A): \mathbb{R}^{n \times n} \to \mathbb{R}$, then