Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion R/manova.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ library(tidyverse)
library(knitr)
library(emmeans)

knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
knitr::opts_chunk$set(echo = TRUE)
pottery <- read.csv("../data/manova1.csv")
pottery
```
Expand Down
2 changes: 1 addition & 1 deletion R/wilcoxonsr_hodges_lehman.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ library(coin)
library(DOS2)
library(MASS)

knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
knitr::opts_chunk$set(echo = TRUE)

blood_p <- read.csv("../data/blood_pressure.csv", dec = ".")[1:240,1:5]

Expand Down
17 changes: 17 additions & 0 deletions _freeze/Clustering_Knowhow/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"hash": "1ca6a8639bee36b0c97f44c1c8821342",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Clustering_knowhow\"\nauthor: \"Niladri Dasgupta\"\ndate: \"2024-08-12\"\noutput: html_document\n---\n\n\n\n\n\n## **What is clustering?**\n\nClustering is a method of segregating unlabeled data or data points into different groups/clusters such that similar data points fall in the same cluster than those which differ from the others. The similarity measures are calculated using distance based metrics like Euclidean distance, Cosine similarity, Manhattan distance, etc.\n\nFor Example, In the graph given below, we can clearly see that the data points can be grouped into 3 clusters\n\n![](images/Clustering/clustering_ex.PNG)\n<br>\n\n## **Type of Clustering Algorithm**\n\nSome of the popular clustering algorithms are:\n\n1. Centroid-based Clustering (Partitioning methods)\n2. Density-based Clustering (Model-based methods)\n3. Connectivity-based Clustering (Hierarchical clustering)\n4. Distribution-based Clustering\n\n### 1.Centroid-based Clustering (Partitioning methods)\n\nPartitioning methods group data points on the basis of their closeness. The similarity measure chosen for these algorithms are Euclidean distance, Manhattan Distance or Minkowski Distance.\n\nThe primary drawback for these algorithms is we need to pre define the number of clusters before allocating the data points to a group.\n\nOne of the popular centroid based clustering technique is K means Clustering.\n<br>\n\n#### **K Means Clustering**\n\nK means is an iterative clustering algorithm that works in these 5 steps: \n\n1. Specify the desired number of clusters K: Let us choose k=2 for these 5 data points in 2-D space.\n\n ![](images/Clustering/kmeans_1.png)\n\n2. Randomly assign each data point to a cluster: Let’s assign three points in cluster 1, shown using orange color, and two points in cluster 2, shown using grey color.\n\n ![](images/Clustering/kmeans_2.png)\n\n3. Compute cluster centroids: Centroids correspond to the arithmetic mean of data points assigned to the cluster. The centroid of data points in the orange cluster is shown using the orange cross, and those in the grey cluster using a grey cross. \n\n ![](images/Clustering/kmeans_3.png)\n\n4. Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid \n\n ![](images/Clustering/kmeans_4.png)\n\n5. Re-computing the centroids for both clusters.\n\n ![](images/Clustering/kmeans_5.png)\n\n\nWe will repeat the 4th and 5th steps until no further switching of data points between two clusters for two successive repeats.\n<br>\n\n\n#### K-Means Clustering in R\n\n\n**Step 1: Load packages**\n\nFirst, we’ll load below packages that contain several useful functions regarding k-means clustering in R.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(cluster) #Contain cluster function\nlibrary(dplyr) #Data manipulation\nlibrary(ggplot2) #Plotting function\nlibrary(readr) #Read and write excel/csv files\nlibrary(factoextra) #Extract and Visualize the Results of Multivariate Data Analyses\n```\n:::\n\n\n\n**Step 2: Load Data**\n\nWe have used the “Mall_Customer” dataset in R for this case study.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#Loading the data\ndf <- read_csv(\"data/Mall_Customers.csv\")\n\n#Structure of the data\nstr(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nspc_tbl_ [200 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)\n $ CustomerID : chr [1:200] \"0001\" \"0002\" \"0003\" \"0004\" ...\n $ Genre : chr [1:200] \"Male\" \"Male\" \"Female\" \"Female\" ...\n $ Age : num [1:200] 19 21 20 23 31 22 35 23 64 30 ...\n $ Annual Income (k$) : num [1:200] 15 15 16 16 17 17 18 18 19 19 ...\n $ Spending Score (1-100): num [1:200] 39 81 6 77 40 76 6 94 3 72 ...\n - attr(*, \"spec\")=\n .. cols(\n .. CustomerID = col_character(),\n .. Genre = col_character(),\n .. Age = col_double(),\n .. `Annual Income (k$)` = col_double(),\n .. `Spending Score (1-100)` = col_double()\n .. )\n - attr(*, \"problems\")=<externalptr> \n```\n\n\n:::\n:::\n\n\n\ndataset consists of 200 customers data with their age, annual income and Spending score. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#Rename the columns\ndf <- df %>% \n rename(\"Annual_Income\"= `Annual Income (k$)`, \"Spending_score\"= `Spending Score (1-100)`)\n\n#remove rows with missing values\ndf <- na.omit(df)\n\n#scale each variable to have a mean of 0 and sd of 1\ndf1 <- df %>% \n mutate(across(where(is.numeric), scale))\n\n#view first six rows of dataset\nhead(df1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 5\n CustomerID Genre Age[,1] Annual_Income[,1] Spending_score[,1]\n <chr> <chr> <dbl> <dbl> <dbl>\n1 0001 Male -1.42 -1.73 -0.434\n2 0002 Male -1.28 -1.73 1.19 \n3 0003 Female -1.35 -1.70 -1.71 \n4 0004 Female -1.13 -1.70 1.04 \n5 0005 Female -0.562 -1.66 -0.395\n6 0006 Female -1.21 -1.66 0.999\n```\n\n\n:::\n:::\n\n\n<br>\n\nWe have separated the CustomerID and Genre from the dataset. The reason for removing these variables from the cluster dataset as Kmeans can handle only numerical variables. \nTo create cluster with categorical or ordinal variable we can use k-Medoid clustering.\n<br>\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf1 <- df1[,4:5]\n```\n:::\n\n\n\n**Step 3: Find the Optimal Number of Clusters**\n\nTo perform k-means clustering in R we can use the built-in kmeans() function, which uses the following syntax:\n\n \n kmeans(data, centers, iter.max, nstart)\n where:\n - data: Name of the dataset.\n - centers: The number of clusters, denoted k.\n - iter.max (optional): The maximum number of iterations allowed. Default value is 10.\n - nstart (optional): The number of initial configurations. Default value is 1.\n\n\n \n- Centers is the k of K Means. centers = 5 would results in 5 clusters being created. We need to **predefine the k** before the cluster process starts. \n- iter.max is the number of times the algorithm will repeat the cluster assignment and update the centers / centroids. Iteration stops after this many iterations even if the convergence criterion is not satisfied\n- nstart is the number of times the initial starting points are re-sampled. \nIt means at the initialization of Clusters you need to specify how many clusters you want and the algorithm will randomly find same number of centroids to initialize. nstart gives you an edge to initialize the centroids through re sampling. \nFor example if total number of cluster is 3 and nstart=25 then it extracts 3 sets of data, 25 times, and for each of these times, the algorithm is run (up to iter.max # of iterations) and the cost function (total sum of the squares) is evaluated and finally 3 centroids with lowest cost function are chosen to start the clustering process.\n\n\nTo find the best number of clusters/centroids there are two popular methods as shown below.\n\n[**A. Elbow Method:**]{.underline}\n\nIt has two parts as explained below-\n\n- WSS: The Within Sum of Squares (WSS) is the sum of distance between the centroids and every other data points within a cluster. Small WSS indicates that every data point is close to its nearest centroids.\n\n- Elbow rule/method: Here we plot out the WSS score against the number of K. Because with the number of K increasing, the WSS will always decrease; however, the magnitude of decrease between each k will be diminishing, and the plot will be a curve which looks like an arm that curled up. In this way, we can find out which point falls on the elbow.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nwss<- NULL\n\n#Feeding different centroid/cluster and record WSS\n\nfor (i in 1:10){\n fit = kmeans(df1,centers = i,nstart=25)\n wss = c(wss, fit$tot.withinss)\n}\n\n#Visualize the plot\nplot(1:10, wss, type = \"o\", xlab='Number of clusters(k)')\n```\n\n::: {.cell-output-display}\n![](Clustering_Knowhow_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n:::\n\n\n\nBased on the above plot at k=5 we can see an “elbow” where the sum of squares begins to “bend” or level off so the ideal number of clusters should be 5.\n\n\nThe above process to compute the “Elbow method” has been wrapped up in a single function (fviz_nbclust):\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfviz_nbclust(df1, kmeans, method = \"wss\",nstart=25)\n```\n\n::: {.cell-output-display}\n![](Clustering_Knowhow_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\n\n\n[**B. Silhouette Method:**]{.underline}\n\nThe silhouette coefficient or silhouette score is a measure of how similar a data point is within-cluster (intra-cluster) compared to other clusters (inter-cluster). \nThe Silhouette Coefficient is calculated using the mean *intra-cluster distance (a)* and the *mean nearest-cluster distance (b)* for each sample. The Silhouette Coefficient for a sample is *(b - a) / max(a, b)*\n\nHere we will plot the silhouette width/coefficient for different number of clusters and will choose the point where the silhouette width is highest. \n\n**Points to Remember While Calculating Silhouette Coefficient:**\n\nThe value of the silhouette coefficient is between [-1, 1].\nA score of 1 denotes the best, meaning that the data points are very compact within the cluster to which it belongs and far away from the other clusters.\nThe worst value is -1. Values near 0 denote overlapping clusters.\n\nIn this demonstration, we are going to see how silhouette method is used.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsilhouette_score <- function(k){\n km <- kmeans(df1, centers = k,nstart = 25)\n ss <- silhouette(km$cluster, dist(df1))\n mean(ss[, 3])\n}\nk <- 2:10\n\navg_sil <- sapply(k, silhouette_score)\nplot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)\n```\n\n::: {.cell-output-display}\n![](Clustering_Knowhow_files/figure-html/unnamed-chunk-7-1.png){width=672}\n:::\n:::\n\n\n\nFrom the above method we can see the silhouette width is highest at cluster 5 so the optimal number of cluster should be 5.\n\nSimilar to the elbow method, this process to compute the “average silhoutte method” has been wrapped up in a single function (fviz_nbclust):\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfviz_nbclust(df1, kmeans, method='silhouette',nstart=25)\n```\n\n::: {.cell-output-display}\n![](Clustering_Knowhow_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\n\nThe optimal number of clusters is 5.\n\n\n**Step 4: Perform K-Means Clustering with Optimal K**\n\nLastly, we can perform k-means clustering on the dataset using the optimal value for k of 5:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#make this example reproducible\nset.seed(1)\n\n#perform k-means clustering with k = 5 clusters\nfit <- kmeans(df1, 5, nstart=25)\n#view results\nfit\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nK-means clustering with 5 clusters of sizes 22, 35, 81, 39, 23\n\nCluster means:\n Annual_Income Spending_score\n1 -1.3262173 1.12934389\n2 1.0523622 -1.28122394\n3 -0.2004097 -0.02638995\n4 0.9891010 1.23640011\n5 -1.3042458 -1.13411939\n\nClustering vector:\n [1] 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5\n [38] 1 5 1 5 1 5 3 5 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3\n [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3\n[112] 3 3 3 3 3 3 3 3 3 3 3 3 4 2 4 3 4 2 4 2 4 3 4 2 4 2 4 2 4 2 4 3 4 2 4 2 4\n[149] 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4 2\n[186] 4 2 4 2 4 2 4 2 4 2 4 2 4 2 4\n\nWithin cluster sum of squares by cluster:\n[1] 5.217630 18.304646 14.485632 19.655252 7.577407\n (between_SS / total_SS = 83.6 %)\n\nAvailable components:\n\n[1] \"cluster\" \"centers\" \"totss\" \"withinss\" \"tot.withinss\"\n[6] \"betweenss\" \"size\" \"iter\" \"ifault\" \n```\n\n\n:::\n:::\n\n\n\nWe can visualize the clusters on a scatterplot that displays the first two principal components on the axes using the fivz_cluster() function:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#plot results of final k-means model\n\nfviz_cluster(fit, data = df1)\n```\n\n::: {.cell-output-display}\n![](Clustering_Knowhow_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\n\n\n**Step 5: Exporting the data by adding generated clusters**\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#Adding the clusters in the main data\n\ndf_cluster <- df %>% \n mutate(cluster=fit$cluster)\n\n#Creating Summary of created clusters based on existing variables\n\ndf_summary <- df_cluster %>% \n group_by(cluster) %>% \n summarise(records=n(),avg_age=mean(Age),avg_annual_income=mean(Annual_Income),avg_spending_score=mean(Spending_score))\n\nprint(df_summary)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 5 × 5\n cluster records avg_age avg_annual_income avg_spending_score\n <int> <int> <dbl> <dbl> <dbl>\n1 1 22 25.3 25.7 79.4\n2 2 35 41.1 88.2 17.1\n3 3 81 42.7 55.3 49.5\n4 4 39 32.7 86.5 82.1\n5 5 23 45.2 26.3 20.9\n```\n\n\n:::\n:::\n\n\n\nWe can create a group of potential customers to target based on their age, average annual income and average spending score.\n",
"supporting": [
"Clustering_Knowhow_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 19 additions & 0 deletions _freeze/Comp/r-east_gsd_tte/execute-results/html.json

Large diffs are not rendered by default.

Large diffs are not rendered by default.

15 changes: 15 additions & 0 deletions _freeze/Comp/r-sas-summary-stats/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"hash": "6a38a66f4a654055f72935602e97866e",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Deriving Quantiles or Percentiles in R vs SAS\"\n---\n\n\n\n### Data\n\nThe following data will be used show the differences between the default percentile definitions used by SAS and R:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n10, 20, 30, 40, 150, 160, 170, 180, 190, 200\n```\n:::\n\n\n\n### SAS Code\n\nAssuming the data above is stored in the variable `aval` within the dataset `adlb`, the 25th and 40th percentiles could be calculated using the following code.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nproc univariate data=adlb;\n var aval;\n output out=stats pctlpts=25 40 pctlpre=p;\nrun;\n```\n:::\n\n\n\nThis procedure creates the dataset `stats` containing the variables `p25` and `p40`.\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](../images/summarystats/sas-percentiles-output.jpg){fig-align='center' width=15%}\n:::\n:::\n\n\n\nThe procedure has the option `PCTLDEF` which allows for five different percentile definitions to be used. The default is `PCTLDEF=5`.\n\n### R code\n\nThe 25th and 40th percentiles of `aval` can be calculated using the `quantile` function.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquantile(adlb$aval, probs = c(0.25, 0.4))\n```\n:::\n\n\n\nThis gives the following output.\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\n 25% 40% \n 32.5 106.0 \n```\n\n\n:::\n:::\n\n\n\nThe function has the argument `type` which allows for nine different percentile definitions to be used. The default is `type = 7`.\n\n### Comparison\n\nThe default percentile definition used by the UNIVARIATE procedure in SAS finds the 25th and 40th percentiles to be 30 and 95. The default definition used by R finds these percentiles to be 32.5 and 106.\n\nIt is possible to get the quantile function in R to use the same definition as the default used in SAS, by specifying `type=2`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquantile(adlb$aval, probs = c(0.25, 0.4), type=2)\n```\n:::\n\n\n\nThis gives the following output.\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\n25% 40% \n 30 95 \n```\n\n\n:::\n:::\n\n\n\nIt is not possible to get the UNIVARIATE procedure in SAS to use the same definition as the default used in R.\n\nRick Wicklin provided a [blog post](https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html) showing how SAS has built in support for calculations using 5 of the 9 percentile definitions available in R, and also demonstrated how you can use a SAS/IML function to calculate percentiles using the other 4 definitions.\n\nMore information about quantile derivation can be found in the [SAS blog](https://blogs.sas.com/content/iml/2021/07/26/compare-quantiles-sas-r-python.html).\n\n### Key references:\n\n[Compare the default definitions for sample quantiles in SAS, R, and Python](https://blogs.sas.com/content/iml/2021/07/26/compare-quantiles-sas-r-python.html)\n\n[Sample quantiles: A comparison of 9 definitions](https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html)\n\n[Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. The American Statistician, 50(4), 361-365.](https://www.jstor.org/stable/2684934)\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
Loading