Skip to content

Latest commit

 

History

History
163 lines (116 loc) · 11.4 KB

Interview.md

File metadata and controls

163 lines (116 loc) · 11.4 KB

Data Science

Core: Programming Skills + Maths and Statistics + Subject Matter Expert

Programming Skills: Python + SQL + JavaScript (D3.js for Data Presentation)

Maths and Statistics: Linear Algebra + Probability + Bayesian + Calculus






Machine Learning Engineering (MLE) Machine Learning Operations (MLOps)
Focuses on building products/services that are highly scalable and highly performant. Focuses on delivering products/services in production and ensuring service quality is always maintained.
Focuses on providing permanent fixes in response to any incident/ bug. Focuses on ensuring product/service is up and running.
Not an end-user-facing role. End-user-facing role that requires to have strong communication skills.

Outliers : Extreme Value Analysis | DBSCAN | 5 Number Summary | Algorithm ( KNN & Random Forest )

Imbalanced : Up & Down Sampling | F1 Score | Stratified K Fold Cross Validation | Random Forest ( class_weight )

Overfitting : Apply Regularization | Apply Ensembles | Apply Cross Validation | Feature Selection

Time Complexity of Occurence of Characters in a String : O(n)

Log Function

  • Log is inverse of exponent
  • e.g. Base investment : 5₹ and 5 times return : 125₹ ( Log5 125 : 3 Years )
  • Log5 53 = 3 ( i.e. 3 * Log55 = 3 * 1 | Log55 = 1 )

PyTest

  • A testing framework for Python that simplifies the process of writing and executing tests.
  • It provides an easy-to-use and expressive syntax for creating test cases, running tests, and reporting the results.

Model = Algorithm ( Parameters ) + Data

Data Pipeline ( Where and how the data are collected, transformed and loaded )

  • A set of actions that extract data from various sources, transform it into proper format and load for processing.
  • An automated process :
  1. Select columns from database.
  2. Merge columns from two or more tables.
  3. Subset rows ( Sample )
  4. Handle missing data.
  5. Load them in other database.
  • First time the process is complicated but if you do it right you will have to do it just once.
  • To have automation you need to think, plan and write in Simple Language, keep it reproducible.

Data Lake

  • A storage repository where data is stored in its natural | raw format without applying any transformation.
  • Data warehouse uses files or folders structure, data lakes uses flat architecture.

Important Disclaimer

  • We try to make out model more accurate by tuning and tweaking the parameters.
  • But we cannot make a 100% accurate model.
  • Prediction and classification models, can never be error free.

Y = f ( x ) + e

Y : Response Variable | Dependent Variable

x : Independent variable

e : Irreducible error ( Even we make a 100% accurate estimate of f ( x ), Our model can't be error free, known as irreducible error )

Activation Function

  • A function that takes in the weighted sum of all the inputs from previous layer adds bias and generates output for next layer.

Hyperparameter Optimization

  • Finding ideal set of parameters for a prediction algorithm with optimum performance.
Parameter Hyperparameter
Automatically learns while training Manually tuned by the developer to guide the training.
Weights and bias are the model parameters Learning rate, depth of tree, class weights.
Internal configuration variables of the model External configuration variables of the model.
Data Ware House Data Lake
Structured + Pre-processed Unstructured + Semi Structured + Structured + Raw
Organized before storing Organized before using
Business professionals, Analyst, BI and Visualizations Data Scientists, Analytics and AI
DBMS RDBMS
Store data in the form of file Store data in the form of tables
Hierarchical arrangement of data Rows and columns ( Tables )
Manage data in computer Maintain relationships of table in a database
Classification Clustering
Need prior knowledge of data No prior knowledge of data
Classify new sample into known classes Suggest groups based on patterns in data
Decision tree K Means
Labelled samples Unlabelled samples
LDA PCA
Linear Discriminant Analysis Principle Component Analysis
Supervised Unsupervised
K Means K Nearest Neighbor
Unsupervised Supervised
K : Number of clusters K : Number of nearest neighbors.
Determine the distances of each data points to the centroid and assign each point to closest cluster centroid Calculate distance between new data point with nearest K neighbours.
Variance s 2 Standard Deviation s
Distance between the data points in the dataset Distance of a data point from the mean of the dataset
Variance Covariance
Magnitude Magnitude and Direction
Data points from its mean Data points varies with respect to each other.

Which Algorithm Generates the Best Model ?

Accuracy Latency
How they handle data of different size ? How long will it take to train the model ?
How will they handle complexity of feature relationships ? How long will it take to predict the dependent variables ?
How will they handle messy data ( Missing Data + Outliers )

Autocorrelation

  • The correlation of the data point with a delayed copy of itself.
  • Temperature of the day today vs temperature of the day yesterday or tommorrow.

Multicollinearity

  • A phenomenon in which at least two independent variables are linearly correlated ( One can be predicted from the other )

Cross Join | Cartesian Product

  • Generate paired combination of each row of first table with each row of the second table.

Data Scientist Steps

  1. Explore ( EDA ) and clean ( Data Cleaning ) the data.
  2. Split data into train + validate + test sets.
  3. Train with an initial model and evaluate.
  4. Tune hyperparameters + cross validations ( Assurance of accuracy )
  5. Evaluate on validation set ( Performance )
  6. Evaluate on test set ( Prediction )