The course aims at developing both math and programming skills required for a data scientist. It allows us to get insight into data analysis problems that arise in business verticals and solving those problems using statistical and machine learning approaches. The course also focus upon the understanding fundamental math underlying those models. This course is more of practical research oreinted course than developer oriented. It focuses on 6 most common data analysis problems that arise in most business verticals: Classification, Regression, Recommender Systems, Clustering, Association Analysis and Outlier Detection.

**Objectives**

Upon successful completion of Data Science/Analytics course, participants will be able to:

- Understand and Apply how statistical data analysis techniques are utilized in business decision making
- Understand and Apply machine learning techniques in business data analysis
- Solve the data analysis use case from its inception to deployment on their own
- Apply algorithms to build machine intelligence

Algorithmica, founded in 2008, is a world class corporate training company that focuses on improving and expanding the engineering skills of developers and on enhancing the quality of the software they develop. Since 2008, Algorithmica has been helping IT professionals get better at what they do by providing an extensive range of training services on emerging technologies. Always pushing the envelope, Algorithmica constantly explores new fields of knowledge as well as new training methodologies to better serve clients. The company is led by a team of experts from IIT Alumni , with accumulated experience of tens of years of software development, architectural design and project management. The team has provided most authentic, comprehensive and high quality training services to good number of companies in recent years, ranging from small start-ups to large enterprises. We pride ourselves by standing by our commitment to help IT professionals get to the next level, by being in tune with our customers actual needs, and by always delivering on what we promise, all while having fun doing it.

**1. Introduction to Data Science/Analytics**

- Why does companies care about Data Scientist/Analyst?
- Data Analytics:OLAP vs DataMining
- What is DataScience? Why DataScience?
- Data driven product engineering
- Skill-set of Data Scientist and How to become a Data Scientist?
- Who is hiring? Career Opportunities

**2. Data Analysis Problems/Usecases in Business**

- Predictive Analytics Problems: Classification, Regression, Recommenders
- Descriptive Analytics Problems: Frequent Pattern Mining, Clustering, Outlier Detection
- Types of Data: Structured, Time-Series, Text, Image, Voice and Video data
- Business Verticals: Retail, Banking, Financial, Social, Web, Medical, Scientific, Logistics, Real Estate

**3. Tools for Data Science/Analytics**

- Data Life Cycle for Analysis
- Technologies for Data Science/Analytics
- Single Machine Analytic Platforms: R, Python
- Distributed Analytical Platforms: Hadoop, Spark, H20
- Datasets for doing data science/analytics

**4. Mastering R/Python Language **

- IDE for R/Python
- basic data structures
- basic features
- advanced features
- packages required for datScience in R/Python
- Lab Session

**5. Linear Algebra for data scientist**

- Ideas that need Linear Algebra
- Vector Algebra
- ideas that map to vectors
- understanding vector operations
- understanding lienar independance
- applications of dotproduct
- Lab Session

- Matrix Algebra
- ideas that map to matrices
- understanding matrix operations
- understanding determinant
- understanding eigen-values and eigen-vectors
- understanding inverse
- understanding rank
- understanding positive definite & semi-definiteness
- concept of basis
- basis,orthogonal and ortho-normal basis
- understanding basis change

- understanding factorization
- Spectral factorization
- Eigen factorization
- SVD factorization
- (Optional)LU factorization
- (Optional)QR factorization

- applications of matrices
- image processing
- solving systems of equations
- modelling discrete systems

- Lab Session

**6. Statistics for data scientist**

- Ideas that need statistics
- Descriptive stats for single variable
- mean, median, mode, quantiles, percentiles
- standard deviation, variance
- MAD, IQR

- Descriptive stats for two variables
- covariance
- correlation
- chi-squared Analysis

- Hypothesis Testing
- Inferential Statistics
- Lab Session

**Syllabus**

**1. Introduction to Data Science/Analytics**

- Why does companies care about Data Scientist/Analyst?
- Data Analytics:OLAP vs DataMining
- What is DataScience? Why DataScience?
- Data driven product engineering
- Skill-set of Data Scientist and How to become a Data Scientist?
- Who is hiring? Career Opportunities

**2. Data Analysis Problems/Usecases in Business**

- Predictive Analytics Problems: Classification, Regression, Recommenders
- Descriptive Analytics Problems: Frequent Pattern Mining, Clustering, Outlier Detection
- Types of Data: Structured, Time-Series, Text, Image, Voice and Video data
- Business Verticals: Retail, Banking, Financial, Social, Web, Medical, Scientific, Logistics, Real Estate

**3. Tools for Data Science/Analytics**

- Data Life Cycle for Analysis
- Technologies for Data Science/Analytics
- Single Machine Analytic Platforms: R, Python
- Distributed Analytical Platforms: Hadoop, Spark, H20
- Datasets for doing data science/analytics

**4. Mastering R/Python Language **

- IDE for R/Python
- basic data structures
- basic features
- advanced features
- packages required for datScience in R/Python
- Lab Session

**5. Linear Algebra for data scientist**

- Ideas that need Linear Algebra
- Vector Algebra
- ideas that map to vectors
- understanding vector operations
- understanding lienar independance
- applications of dotproduct
- Lab Session

- Matrix Algebra
- ideas that map to matrices
- understanding matrix operations
- understanding determinant
- understanding eigen-values and eigen-vectors
- understanding inverse
- understanding rank
- understanding positive definite & semi-definiteness
- concept of basis
- basis,orthogonal and ortho-normal basis
- understanding basis change

- understanding factorization
- Spectral factorization
- Eigen factorization
- SVD factorization
- (Optional)LU factorization
- (Optional)QR factorization

- applications of matrices
- image processing
- solving systems of equations
- modelling discrete systems

- Lab Session

**6. Statistics for data scientist**

- Ideas that need statistics
- Descriptive stats for single variable
- mean, median, mode, quantiles, percentiles
- standard deviation, variance
- MAD, IQR

- Descriptive stats for two variables
- covariance
- correlation
- chi-squared Analysis

- Hypothesis Testing
- Inferential Statistics
- Lab Session

**7. Probability for data scientist**

- Ideas that need probabilistic analysis
- Basic Probability, Conditional Probability
- Bayes Rule/Reasoning
- MAP vs MLE Reasoning
- Mapping Random process to Random variable
- Properties of Random variables
- expectation
- variance
- entropy and cross-entropy
- covariance and correlation

- Estimating probability of Random variable
- Understanding standard random processes
- Probability Distributions: Normal, Gamma, Poisson , Dirichlet, Bernoulli, Binomial, Powerlaw, Log normal, Multinomial
- Parameter Estimation in Distributions: MAP and MLE approaches
- Lab Session

**8. Calculus for data scientist**

- Ideas that need calculus
- Rate of change
- Concept of limit
- Concept of derivative
- Partial derivatives & gradient
- Significance of gradient
- Concept of integration
- Applications of calculus
- Lab Session

**9. Optimization theory for data scientist**

- Ideas with optimization requirement
- Modelling ML problems with optimization requirements
- Solving unconstrained optimization problems
- Solving optimization problems with linear constraints
- Gradient descent ideas
- gradient descent, steepest descent ideas
- batch gradient descent
- stochastic gradient descent

- Lab Session

**10. Classification Problem**

- What is classification?
- Classification Examples in Business Verticals
- Solution strategies for classification
- Finding pattern and Fixed Pattern Approach
- Limitations of Fixed Pattern Approach
- Machine Learning Approaches for classfication
- KNN, Decision Trees, SVM, Naive Bayes
- Logistic Regression, Neural Network, Ensembles

- How do you handle overfitting?
- Evaluation Metrics for Classification Algorithms
- Confusion Matrix, Accuracy, Error Rate
- Precision, Recall and F-Score
- ROC curve, AUC

**11. Regression Problem**

- What is Regression?
- Regression Examples in Business Verticals
- Solution strategies for Regression
- Finding pattern and Fixed Pattern Approach
- Limitations of Fixed Pattern Approach
- Machine Learning Approaches for regression
- KNN, Linear Regression, Ridge and Lasso Regression
- Decision Trees, SVM, Neural Network, Ensembles

- How do you handle overfitting?
- Evaluation Metrics for Regression Algorithms
- RMSE(Root Mean Squared Error)
- Mean Absolute Deviation(MAD)

**12. Recommendation Problem**

- What is Recommendation System?
- Top-N Recommender
- Rating Prediction

- Recommendations in Business Verticals
- Solution strategies for Recommender System
- Content based Recommenders
- Limitations of Content based recommenders
- Machine Learning Approaches for Recommenders
- User-User KNN model, Item-Item KNN model
- Factorization or latent factor model

- Hybrid Recommenders

- How do you handle overfitting?
- Evaluation Metrics for Recommendation Algorithms
- Top-N Recommnder: Accuracy, Error Rate
- Rating Prediction: RMSE

**13. Frequent Pattern Mining**

- What is Frequent Pattern Mining?
- Frequent Pattern Mining in Business Verticals
- Solution strategies for Frequent Pattern Mining
- Finding pattern and Fixed Pattern Approach
- Limitations of Fixed Pattern Approach
- Machine Learning Approaches for Frequent Pattern Mining
- Apriori, Eclat, FP-Growth

- Evaluation Metrics for Frequent Pattern Mining
- Support, Confidence, Lift

**14. Clustering Problem**

- What is Clustering?
- Clustering Examples in Business Verticals
- Solution strategies for Clustering
- Finding pattern and Fixed Pattern Approach
- Limitations of Fixed Pattern Approach
- Machine Learning Approaches for Clustering
- Iterative based K-Means & K-Medoid Approaches
- Hierarchical Agglomerative Approaches
- Density based DB-SCAN Approach

- Evaluation Metrics for Clustering
- Cohesion, Coupling Metrics
- Correlation Metric

**15. Outlier Problem**

- What are Outliers?
- Outlier Examples in Business Verticals
- Solution strategies for Outlier Detection
- Finding pattern and Fixed Pattern Approach
- Limitations of Fixed Pattern Approach
- Machine Learning Approaches for Outliers
- Probabilistic Approach, KNN Approach
- Density based LOF Approach, Cluster Based Approach

**16. Overview of Machine Learning Algorithms**

- What is Machine Learning?
- Pipeline for ML Algorithms
- Pipeline Stages: Data Collection, Data Preparation, Feature Engineering, Model Building, Model Evaluation and Model Deployment
- Supervised, Unsupervised and Semi-supervised ML Algorithms

**17. Data Collection Techniques**

- Collecting data from Excel/csv/tsv files
- Collecting data from databases
- Collecting data from services
- Collecting data via scraping
- Lab Session

**18. Data Preparation Techniques**

- Structured Data Preparation
- Data Type Conversion
- Category to Numeric Conversion
- Numeric to Category Conversion

- Data Normalization:0-1, Z-Score
- Handling Skew Data:Box-Cox Idea
- Handling Missing Data

- Data Type Conversion
- Text Data Preparation
- Normalizing Text
- Stop word Removal
- Whitespace Removal
- Stemming
- Building Document Term Matrix

- Image Data Preparation
- Converting to gray scale
- Pixel Value Normalization
- Building Pixel Intensity Matrix

- (Optional)Voice Data Preparation
- (Optional)Video Data Preparation
- Lab Session

**19. EDA(Numerical + Graphical) and Feature Engineering**

- Exploring Individual Features
- Exploring Bi-Feature Relationships
- Exploring Multi-feature Relationships
- Feature/Dimension Reduction: PCA
- Intuition behind PCA
- Covariance & Correlation
- Relating PCA to Covariance/Correlation
- Intuition to math
- Applications of PCA:Dimensionality Reduction, Image Compression

- (Optional)Automatic Feature Extraction via Deep Learning
- Lab Session

**20. Classification and Regression: KNN Model**

- Intutitive idea of KNN classification
- KNN learning
- Limitations of KNN
- KNN Regression
- Applying KNN and parameter tuning
- Pros and Cons of the Model
- Lab Session

**21. Classification and Regression: Decision Tree Model**

- Intuitive Idea of Decision Tree for classification
- Decision Tree Learning
- Approaches for tree learning: Entropy,Inf Gain,Inf Gain Ratio,Gini-index,Misclassfication error
- How to control over-fitting in tree learning?
- Comparing ID3, CART, C4.5
- Decision Trees for Regression
- Applying Decision Tree and parameter tuning
- Pros and Cons of the Model
- Lab Session

**22. Classification and Regression: Naive Bayes Model**

- Intutitive idea of Naive Bayes classification
- Math of Naive Bayes Model
- Naive Bayes learning
- Limitations of Naive Bayes Learning
- Smoothing in Navie Bayes Learning
- Applying NaiveBayes model and parameter tuning
- Pros and Cons of the Model
- Lab Session

**23. Classification:Logistic Regression**

- Intuitive Idea of Logistic Regression
- Math of Logistic Regression
- Logistic Regression Learning
- Applying Logistic Regression and parameter tuning
- Pros and Cons of the Model
- Lab Session

**24. Classification and Regression: SVM Model**

- Intuitive Idea of SVM classification
- Transforming SVM idea to Math
- Hard-margin SVM Learning
- Limitations of Hard-margin SVM Learning
- Soft-margin SVM Learning
- Limitations of Soft-margin SVM Learning
- Kernel SVM Learning
- Generalizing SVM to multi-classes
- SVM Regression
- Applying SVM and parameter tuning
- Pros and Cons of the Model
- Lab Session

**25. Classification and Regression: Neural Network Model**

- Intutitive idea of Neural Network
- Perceptron model for classification and regression
- Perceptron Learning
- Limitations of Perceptron model
- Multi-layer FF NN model for classification and regression
- ML-FF-NN Learning with backpropagation
- Applying ML-FF-NN and parameter tuning
- Pros and Cons of the Model
- Lab Session

**26. Classification and Regression: Ensemble Model**

- Intuitive Idea of Ensemble for classification
- Understanding Weak Learners
- Approaches for Ensemble learning: Boosting, Bagging and Randomization
- Bagging Idea in depth and why it works?
- Bagged Tree Model Learning
- Boosting Idea in depth and why it works?
- Boosting variations: AdaBoost & GradientBoost
- Boosted Tree Model Learning
- Ensembles for Regression
- Applying Bagging and Boosting and parameter tuning
- Pros and Cons of the Model
- Lab Session

**27. Recommenders: Content based Recommendation **

- Building user/item profiles
- Recommendation Algorithm based on content
- Applying the Algorithm and tuning
- Pros and Cons of the Model
- Lab Session

**28. Recommenders: User-User KNN Model **

- Building user/user similarity matrix from rating matrix
- Recommendation Algorithm based on user-user similarity matrix
- Applying the Algorithm and tuning
- Pros and Cons of the Model
- Lab Session

**29. Recommenders:Item-Item KNN Model**

- Building Item/Item similarity matrix from rating matrix
- Recommendation Algorithm based on item-item similarity matrix
- Applying the Algorithm and tuning
- Pros and Cons of the Model
- Lab Session

**30. Recommenders:Latent Factor Model**

- Building factors of rating matrix
- Recommendation Algorithm based matrix factors
- Applying the Algorithm and tuning
- Pros and Cons of the Model
- Lab Session

**31. Clustering: Iterative Models**

- Intuitive Idea of Iterative Model
- K-Means & K-Medoid Models
- Applying the Algorithm and tuning
- Pros and Cons of the Model
- Lab Session

**32. Clustering: Hierarchical Models**

- Intuitive Idea of Hierarchical Model
- Agglomerative Models:Single, Complete, Average Link
- Agglomerative Models:Centroid, Custom Link
- Applying the Algorithm and tuning
- Pros and Cons of the Model
- Lab Session

**33. Clustering: Density Models**

- Intuitive Idea of Density Model
- DB-SCAN Model
- Applying the Algorithm and tuning
- Pros and Cons of the Model
- Lab Session

**34. Outliers: Probabilistic Model**

- Intuitive Idea
- Probabilistic Model
- Applying the Algorithm and tuning
- Pros and Cons of the Model
- Lab Session

**35. Outliers: KNN Model**

- Intuitive Idea
- KNN Model
- Applying the Algorithm and tuning
- Pros and Cons of the Model
- Lab Session

**36. Outliers: Density Model**

- Intuitive Idea
- LOF Model
- Applying the Algorithm and tuning
- Pros and Cons of the Model
- Lab Session

**37. Association Analysis: Apriori Model**

- Intuitive Idea
- Apriori Model
- Applying the Algorithm and tuning
- Pros and Cons of the Model
- Lab Session

**38. Distributed/BIGDATA Analytics**

- Analytics at Scale
- Platforms for Distributed Analytics: Hadoop, Spark, H20
- Lab Session

**39. (Optional)Data Visualization**

- Need of Data visualization in practice
- D3 basics + Lab Session

**40. Project(4 day Hackathon)**

- Hackathon(Day 1)
- Hackathon(Day 2)
- Hackathon(Day 3)
- Hackathon(Day 4)

Developers at all levels, BI professionals, DataWarehousing Professionals, Team Leads, Analytics Managers & Business Managers.

**Prerequisites**

Nothing but passion & interest towards data engineering

- Ac Classroom
- Power Backup
- Lift
- Purified Water
- Four Wheeler Parking
- Two Wheeler Parking
- Hostel Support
- Girls Wash Room
- Female Staff
- Fire Alarm System
- Fire Extinguishers
- Manned Security Building
- Security Cams Facility

- Hours 90
- Online Query Support
- Online Tests
- Telephonic Query Support
- Video Classes
- Study Content
- Class Hand Outs