Terminology

What is Machine Learning¶

Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed

--Arthur Samuel, 1959

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E

--Tom Mitchell, 1997

What is the difference amongs Artifical Intelligence (AI), Machine Learning (ML), Data Mining and Pattern Recognization¶

Artificial Intelligence human-like intelligence displayed by software and/or machines, is the broader concept of machines being able to carry out tasks in a way that we would consider “smart”, AI concentrated on mimicking human decision making processes and carrying out tasks in ever more human ways

Machine learning algorithms that can learn from data to make predictions, focuses on the development of computer programs that can access data and use it learn for themselves

Data mining is the process of finding anomalies, patterns and correlations within large data sets to predict outcomes, the process of digging through data to discover hidden connections and predict future trends, knowledge discovery in databases (KDD)

Pattern recognition the automated recognition of patterns and regularities in data

Supervised/Unsupervised Learning¶

Supervised Learning

Starting from the analysis of a known training dataset by given "right answers" (labels), the trained model is able to make predictions about the output values

Classification, classify new instance

Regression, predict a target numeric value

attribute is a data type ("Mileage"), while feature means an attribute plus its vaue ("Mileage = 15,000")

Most important supervised learning algorithms: k-Nearest Neightbors, Linear Regression, Logistic Regression, Suport Vector Machines (SVMs), Decision Trees and Random Forests, Neural Networks

Unsupervised Learning

Classify or label data without given "right answers" (labels)

Most important unsupervised learning algorithms

Clustering: k-Means, DBSCAN, Hierarchical Cluster Analysis (HCA), Expectation Maximization

Anomaly detection and novelty detection: One-Class SVM, Isolaiton Forest

Visualizaiton and dimensionality reduction: Principal Component Analysis (PCA), Kernel PCA, Locally-Linear Embedding (LLE), t-distributed Stochastic Neighbor Embedding (t-SNE)

Association rule learning: Apriori, Eclat

Semisupervised Learning

Deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data

Trained sequentially in an unsupervised manner, and then the whole system is fine-tuned using supervised learning techniques

Deep Belief Networks (DBNs) and Restricted Boltzmann Machines (RBMs)

Reinforcement Learning

The learning system, called an agent, observe the environment, select and perform actions, and get rewards in return, to get the most reward over time. It learns by itself what is the best strategy, called policy. A policy defines what action the agent should choose when it is in a given situation

Batch and Online Learning¶

Batch Learning: offline learning, train the model using all the available data

Online learning: train the system incrementlly by feeding it data instances sequentially, either individually or in small groups called mini-btaches

Instance-Based Versus Model-Based Learning¶

Instanced-based learning: the system learns the examples by heart, then generalizes to new cases using a similarity measure

Lazy Learners (K-Nearest Neighbors)

Radial-Based Functions (RBF Kernel)

Case-Based Reasoning (CBR)

Model-based learning: generalize by building a model from a set of examples, then use the built model to make predictions

Performance Measure¶

Utility function or fitness function, measures how good the model is

Cost function or loss function, measure how bad the model is, meansures the distance between model predictions and the training examples, the objectives is to minimize the distance

Main Challenges of Machine Learning¶

Unreasonable Effectiveness of Data

Data matters more than algorithms

Google Paper

Nonrepresentative Training Data

It is crucial to use a training set that is representative of the new cases you want to generalize to

If the sample is too small, have sample noise

If nonrepresentative large samples, have sampling bias

Poor-Quality Data

Training data is full of errors, outliers, and noise

If some instances are clearly outliers, it may help to simply discard them or try to fix the errors manually

If some instances are missing a few features, decide whether you want to ignore this attribute altogether, ignore these instances, fill in the missing values, or train on model with the feature and one model without it

Irrelevant Features

The system will only be capable of learning if the training data contains enough relevant features and too many irrelevant ones

Feature Engineering

Feature selection: selecting the most useful features to train on among existing features

Feature extraction: combining existing features to produce a more useful one, such as, dimensionality reduction

Creating new features by gathering new data

Overfitting

The model performs well on the training data, but does not generalize well Solutions

To simplify the model

To gather more training data

To reduce the noise in the training data

Underfitting

The model is to simple to learn the underlying structure of the data Solutions

Selecting a more powerful model

Feeding better features to the learning algorithm

Reducing the constraints on the model, reducing the regularizaiton hyperparameter

Bias referes to the error that is introduced by approximating a real-life problem

Variance is due to the model's excessive sensitivity to small variations in the training data, refers to the amount by which factor would change if we estimated it using a different training data set

Irreducible error is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data

Higher the degrees of freedom, may cause overfitting, low bias, high variance

Lower the degrees of freedom, may cause underfitting, high bias, low variance

Avoid Underfitting and Overfitting¶¶

Cross Validation

Overfitting, performs well on training data but generalizes poorly according to the cross-validation metrics

Underfitting, performs poorly on both training data and the cross-validation metrics

Learning Curves

Plots of the model's performance on the training set and the validation set as a function of the training set size

Underfitting, both curves have reached a plateau, they are close and fairly high

Overfitting, there is a gap between the curves, the model performs significantly better on the training data than on the validation data

Terminology¶

What is Machine Learning¶

What is the difference amongs Artifical Intelligence (AI), Machine Learning (ML), Data Mining and Pattern Recognization¶

Supervised/Unsupervised Learning¶

Batch and Online Learning¶

Instance-Based Versus Model-Based Learning¶

Performance Measure¶

Main Challenges of Machine Learning¶

Avoid Underfitting and Overfitting¶¶

Regularization¶

Hyper-parameter¶

Preprocessing¶