Semi-supervised Learning
¶

  • A small portion of labeled examples
  • A large number of unlabeled examples
  • A model must learn and make predictions on new examples
  • Can achieve better performance than a supervised learning algorithm fit only on the labeled training examples

1. Inductive Learning¶

  • Train a model using labeled training data, predict the labels of the unlabeled data
  • Build a predictive model
  • Can predict any point in the space of points
  • Less computational cost

Classifier based methods¶

  • Start from initial classifier(s), and iteratively enhance it
  • Expectation–maximization (EM) algorithm
  • Co-Training

Self Training¶

  • Data contains labeled data and unlabeled data
  1. Train a classifer that implementing predict_proba with labeled data
  2. Predict labels for the unlabeled data with the trained model
  3. Add a subset of the predicted labels to the labeled data, selection criterion is threshold (default) or k_best
  • Repeat step 1-3
In [62]:
import numpy as np
from sklearn import datasets
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.svm import SVC
In [63]:
# create a data set containing labeled data and unlabeled data
# unlabeled data have the label -1
rng = np.random.RandomState(42)
iris = datasets.load_iris()
random_unlabeled_points = rng.rand(iris.target.shape[0]) < 0.3
iris.target[random_unlabeled_points] = -1
In [64]:
iris.target
Out[64]:
array([ 0,  0,  0,  0, -1, -1, -1,  0,  0,  0, -1,  0,  0, -1, -1, -1,  0,
        0,  0, -1,  0, -1, -1,  0,  0,  0, -1,  0,  0, -1,  0, -1, -1,  0,
        0,  0,  0, -1,  0,  0, -1,  0, -1,  0, -1,  0,  0,  0,  0, -1,  1,
        1,  1,  1,  1,  1, -1, -1, -1,  1,  1, -1,  1,  1, -1,  1, -1,  1,
       -1,  1,  1, -1, -1,  1,  1,  1,  1, -1,  1, -1,  1,  1,  1, -1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1, -1, -1, -1,  2,
        2,  2,  2, -1,  2,  2, -1, -1, -1, -1,  2,  2,  2,  2,  2, -1,  2,
        2,  2,  2,  2, -1, -1,  2,  2,  2, -1,  2,  2, -1, -1,  2,  2,  2,
        2,  2,  2,  2,  2, -1,  2,  2, -1, -1,  2,  2, -1, -1])
In [65]:
# create a classifier implementing predict_proba
svc = SVC(probability=True, gamma="auto")

# create a self-training model
self_training_model = SelfTrainingClassifier(svc)

# train the model
self_training_model.fit(iris.data, iris.target)
Out[65]:
SelfTrainingClassifier(base_estimator=SVC(gamma='auto', probability=True))
In [70]:
# predict unlabeled data
predict = self_training_model.predict(iris.data[random_unlabeled_points]) # predicted labels
labels = datasets.load_iris().target[random_unlabeled_points] # real labels

from sklearn.metrics import classification_report
print(classification_report(labels, predict))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.80      0.89        15
           2       0.85      1.00      0.92        17

    accuracy                           0.94        51
   macro avg       0.95      0.93      0.94        51
weighted avg       0.95      0.94      0.94        51

2. Transductive Learning¶

  • Train the model and label unlabeled data
  • Does not build a predictive model, if a new unlabeled data is encountered, have to re-run the algorithm
  • Can predict only the encountered unlabeled points based on observed labeled data
  • Computationally costly

Data based methods¶

  • Discover an inherent geometry in the data, and exploit it in finding a good classifier
  • Manifold Regularization
  • Harmonic Mixtures
  • Information Regularization

Label Propagation¶

In [83]:
rng = np.random.RandomState(42)
iris = datasets.load_iris()
random_unlabeled_points = rng.rand(iris.target.shape[0]) < 0.3

# create a data set containing labeled data and unlabeled data
# unlabeled data have the label -1
iris.target[random_unlabeled_points] = -1
In [84]:
iris.target
Out[84]:
array([ 0,  0,  0,  0, -1, -1, -1,  0,  0,  0, -1,  0,  0, -1, -1, -1,  0,
        0,  0, -1,  0, -1, -1,  0,  0,  0, -1,  0,  0, -1,  0, -1, -1,  0,
        0,  0,  0, -1,  0,  0, -1,  0, -1,  0, -1,  0,  0,  0,  0, -1,  1,
        1,  1,  1,  1,  1, -1, -1, -1,  1,  1, -1,  1,  1, -1,  1, -1,  1,
       -1,  1,  1, -1, -1,  1,  1,  1,  1, -1,  1, -1,  1,  1,  1, -1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1, -1, -1, -1,  2,
        2,  2,  2, -1,  2,  2, -1, -1, -1, -1,  2,  2,  2,  2,  2, -1,  2,
        2,  2,  2,  2, -1, -1,  2,  2,  2, -1,  2,  2, -1, -1,  2,  2,  2,
        2,  2,  2,  2,  2, -1,  2,  2, -1, -1,  2,  2, -1, -1])
In [85]:
from sklearn.semi_supervised import LabelPropagation
label_prop_model = LabelPropagation()
label_prop_model.fit(iris.data, iris.target)
Out[85]:
LabelPropagation()
In [88]:
# predict unlabeled data
predict = label_prop_model.predict(iris.data[random_unlabeled_points]) # predicted labels
labels = datasets.load_iris().target[random_unlabeled_points] # real labels

from sklearn.metrics import classification_report
print(classification_report(labels, predict))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.87      0.93        15
           2       0.89      1.00      0.94        17

    accuracy                           0.96        51
   macro avg       0.96      0.96      0.96        51
weighted avg       0.96      0.96      0.96        51

Reference¶

  • Inductive vs. Transductive Learning
  • Sklearn Documentation
  • Wiki