Skip to content

Latest commit

 

History

History
129 lines (74 loc) · 8.71 KB

README.md

File metadata and controls

129 lines (74 loc) · 8.71 KB

Machine-learning

This repository will contain all the stuffs required for beginners in ML do follow and star this repo for regular updates

This repo contains data preprocessing steps need to be known by beginners

For every ML beginner python is recommended, this repo is full of ML python algorithms.

Contributing Guidelines for Hacktoberfest2022

  1. Star it

  2. Fork the repo

  3. Clone it onto your PC.

  4. Create a folder with your GitHub username

  5. Create separate files for all the issues you are solving and always open an issue which has all details of the process or method you will use to perform anomaly detection and wait till it is assigned (not more than 2-3 hours it will take, we are passionate open source developers )

  6. Open PRs for the issues you are solving. (You can open multiple PRs for different issues by branching).

  7. Make sure the data is only from the given category (No repetitions of same data )

    a. Healthcare -covid, heart attack, cancer, etc

    b. Finance -stocks etc

    c. Retail or CPG

    d. Image classifciation

    e. Time series

Only code like .py are not accepted please push proper jupyter (.ipynb) files with problem statement and solution analysis.

Python packages used : numpy,pandas,matploit,sklearn,statsmodels,keras,nltk,........continues will be added more.

Regression

  1. In linear regression we have used a dataset containing details of employee salary and years of experience using this model we can predict the salary of employee by years of experience.

  2. In multiple linear regression we have used dataset containing details of expenditure of startups and their profit using this model we can predict the profit of startup ,and also we have developed a model using backward elimination technique.

  3. In polynomial linear regression we have used dataset containing details of salary and years of experience ,this could be useful for HR dept. to detect the if the new joinee employee is giving right info about his/her salary.

  4. In SVR linear regression we have used dataset containing details of salary and years of experience ,this could be useful for HR dept. to detect the if the new joinee employee is giving right info about his/her salary.

  5. In Decision Tree regression we have used dataset containing details of salary and years of experience ,this could be useful for HR dept. to detect the if the new joinee employee is giving right info about his/her salary.

  6. In Random Forest regression we have used dataset containing details of salary and years of experience ,this could be useful for HR dept. to detect the if the new joinee employee is giving right info about his/her salary.this algorithm gives the best result better than polynomial regression.

Classification

  1. In logistic regression we have used a dataset containing details of salary ,age and product bought using this model we can predict whether the customer of certain age and salary will buy the product or not .

  2. In Knn regression we have used a dataset containing details of salary ,age and product bought using this model we can predict the whether the customer of certain age and salary will buy the product or not .

  3. In SVN regression we have used a dataset containing details of salary ,age and product bought using this model we can predict the whether the customer of certain age and salary will buy the product or not .

  4. In Random Forest regression we have used a dataset containing details of salary ,age and product bought using this model we can predict the whether the customer of certain age and salary will buy the product or not .

  5. In Decision Tree regression we have used a dataset containing details of salary ,age and product buyed using this model we can predict the whether the customer of certain age and salary will buy the product or not .

  6. In kernel SVM we have used a dataset containing details of salary ,age and product buyed using this model we can predict the whether the customer of certain age and salary will buy the product or not .Kernel SVM is mostly used for complicated dataset where data is not linearly separable.

  7. In Naives bayes one of the most imp classification algorithm here we have used a dataset containing details of salary ,age and product buyed using this model we can predict the whether the customer of certain age and salary will buy the product or not .Naive bayes works on bayes probability theorem , before getting into coding one have to understand how the formulae works for classifying which group the point belongs to.

Clustering

  1. In k-means clustering we have used a dataset containing details of gender ,age, score etc using this model we can predict which cluster the customer belongs to ,this is an example of customer segmentation.

  2. In Hierarchical_Clustering we have used the mall.csv where we cluster people according to their income & spending habits Here is an interesting concept of dendogram is introduced which is helpful for knowing how many clusters we need for segmentation.

Association Rule Learning

  1. Market Basket Analysis is a machine learning-based technique for identifying buying pattern from numerous retail transactions for helping the retailer in increasing the sales ,we use Apriori Algorithm which works like bayes rule approach to find relationships between products by customers.

  2. Analyzing Market basket using Eclat algorithm for identifying buying pattern from numerous retail transactions for helping the retailer in increasing the sales.

Dimensionality Reduction

In statistics, machine learning, and information theory, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. Here in datasets we do it by the following algorithm

  1. PCA algorithm
  2. Kernel_pca
  3. LDA

Deep Learning

  1. Artificial Neural Network

Artificial neural networks or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains. Here we use a dataset describing the customers leaving or retaining in bank and how several factors are affecting the retain and exit of customers .Using ANN in python.

  1. Convolutional Neural Network

Convolutional neural networks. Sounds like a weird combination of biology and math with a little CS sprinkled in, but these networks have been some of the most influential innovations in the field of computer vision.Here we make a CNN which classifies or say identifies cat or dog images.

Natural Language Processing

Natural Language Processing (or NLP) is applying Machine Learning models to text and language. Teaching machines to understand what is said in spoken and written word is the focus of Natural Language Processing Here in this code I have done the following things:

Clean texts to prepare them for the Machine Learning models,
Create a Bag of Words model,
Apply Machine Learning models onto this Bag of Worlds model.

Model Selection

  1. XGBoost

XGboost is a very fast, scalable implementation of gradient boosting that has taken data science by storm, with models using XGBoost regularly winning many online data science competitions and used at scale across different industries

  1. K cross-fold validation

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

  1. Grid CV search

GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Reinforcement Learning

Reinforcement Learning is a branch of Machine Learning, also called Online Learning. It is used to solve interacting problems where the data observed up to time t is considered to decide which action to take at time t + 1. It is also used for Artificial Intelligence when training machines to perform tasks such as walking. Desired outcomes provide the AI with reward, undesired with punishment. Machines learn through trial and error.

Upper Confidence Bound (UCB)
Thompson Sampling

Recommendation Systems

I have depicted two methods of building recommendation engine one with traditional methods and other with scalable algorithm using pyspark using a opensource book review dataset .