Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed
--Arthur Samuel, 1959
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E
--Tom Mitchell, 1997
Artificial Intelligence human-like intelligence displayed by software and/or machines, is the broader concept of machines being able to carry out tasks in a way that we would consider “smart”, AI concentrated on mimicking human decision making processes and carrying out tasks in ever more human ways
Machine learning algorithms that can learn from data to make predictions, focuses on the development of computer programs that can access data and use it learn for themselves
Data mining is the process of finding anomalies, patterns and correlations within large data sets to predict outcomes, the process of digging through data to discover hidden connections and predict future trends, knowledge discovery in databases (KDD)
Pattern recognition the automated recognition of patterns and regularities in data
Supervised Learning
Starting from the analysis of a known training dataset by given "right answers" (labels), the trained model is able to make predictions about the output values
- Classification, classify new instance
- Regression, predict a target numeric value
- attribute is a data type ("Mileage"), while feature means an attribute plus its vaue ("Mileage = 15,000")
- Most important supervised learning algorithms: k-Nearest Neightbors, Linear Regression, Logistic Regression, Suport Vector Machines (SVMs), Decision Trees and Random Forests, Neural Networks
Unsupervised Learning
- Classify or label data without given "right answers" (labels)
- Most important unsupervised learning algorithms
- Clustering: k-Means, DBSCAN, Hierarchical Cluster Analysis (HCA), Expectation Maximization
- Anomaly detection and novelty detection: One-Class SVM, Isolaiton Forest
- Visualizaiton and dimensionality reduction: Principal Component Analysis (PCA), Kernel PCA, Locally-Linear Embedding (LLE), t-distributed Stochastic Neighbor Embedding (t-SNE)
- Association rule learning: Apriori, Eclat
Semisupervised Learning
- Deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data
- Trained sequentially in an unsupervised manner, and then the whole system is fine-tuned using supervised learning techniques
- Deep Belief Networks (DBNs) and Restricted Boltzmann Machines (RBMs)
Reinforcement Learning
The learning system, called an agent, observe the environment, select and perform actions, and get rewards in return, to get the most reward over time. It learns by itself what is the best strategy, called policy. A policy defines what action the agent should choose when it is in a given situation
Batch Learning: offline learning, train the model using all the available data
Online learning: train the system incrementlly by feeding it data instances sequentially, either individually or in small groups called mini-btaches
Instanced-based learning: the system learns the examples by heart, then generalizes to new cases using a similarity measure
- Lazy Learners (K-Nearest Neighbors)
- Radial-Based Functions (RBF Kernel)
- Case-Based Reasoning (CBR)
Model-based learning: generalize by building a model from a set of examples, then use the built model to make predictions
Utility function or fitness function, measures how good the model is
Cost function or loss function, measure how bad the model is, meansures the distance between model predictions and the training examples, the objectives is to minimize the distance
Unreasonable Effectiveness of Data
- Data matters more than algorithms
- Google Paper
Nonrepresentative Training Data
- It is crucial to use a training set that is representative of the new cases you want to generalize to
- If the sample is too small, have sample noise
- If nonrepresentative large samples, have sampling bias
Poor-Quality Data
- Training data is full of errors, outliers, and noise
- If some instances are clearly outliers, it may help to simply discard them or try to fix the errors manually
- If some instances are missing a few features, decide whether you want to ignore this attribute altogether, ignore these instances, fill in the missing values, or train on model with the feature and one model without it
Irrelevant Features
The system will only be capable of learning if the training data contains enough relevant features and too many irrelevant ones
- Feature Engineering
- Feature selection: selecting the most useful features to train on among existing features
- Feature extraction: combining existing features to produce a more useful one, such as, dimensionality reduction
- Creating new features by gathering new data
Overfitting
The model performs well on the training data, but does not generalize well Solutions
- To simplify the model
- To gather more training data
- To reduce the noise in the training data
Underfitting
The model is to simple to learn the underlying structure of the data Solutions
- Selecting a more powerful model
- Feeding better features to the learning algorithm
- Reducing the constraints on the model, reducing the regularizaiton hyperparameter
- Bias referes to the error that is introduced by approximating a real-life problem
- Variance is due to the model's excessive sensitivity to small variations in the training data, refers to the amount by which factor would change if we estimated it using a different training data set
- Irreducible error is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data
- Higher the degrees of freedom, may cause overfitting, low bias, high variance
- Lower the degrees of freedom, may cause underfitting, high bias, low variance
Cross Validation
- Overfitting, performs well on training data but generalizes poorly according to the cross-validation metrics
- Underfitting, performs poorly on both training data and the cross-validation metrics
Learning Curves
- Plots of the model's performance on the training set and the validation set as a function of the training set size
- Underfitting, both curves have reached a plateau, they are close and fairly high
- Overfitting, there is a gap between the curves, the model performs significantly better on the training data than on the validation data
Constaining a model to make it simpler and reduce the risk of overfitting
Parameters whose values control the learning process and determine the values of model parameters
Reduction and Cleansing
- Dimensionality Reduction
- Sampling
- Cleaning
Data Wrangling
- Formatting
- Normalization
- Feature Engineering