Feature Engineering
Feature selection: selecting the most useful features to train on among existing features
Feature extraction: combining existing features to produce a more useful one, such as, dimensionality reduction
Creating new features by gathering new data
Encoding Categorical Features
Count Encoding, count encoding replaces each categorical value with the number of times it appears in the dataset
Target Encoding, target encoding replaces a categorical value with the average value of the target for that value of the feature
CatBoost Encoding, based on the target probablity for a given value
Kaggle Tutorial
Correlated Features
Why Removing Correlated Features
- Make the learning algorithm faster
- Decrease harmful bias
- If correlated features are also correlated to the target, keep them, Naive Bayes actually directly benefit from "positive" correlated features, Random Forest may indirectly benefit from them
- Interpretability of your model
- Dealing with curse of high dimensionality, overfitting
- If we can’t solve a problem with a few features, adding more features seems like a good idea
- However the number of samples usually stays the same
- The method with more features is likely to perform worse instead of expected better
- For each fixed sample size n, there is the optimal number of features to use
Greedy
Recursive Feature Elimination (RFE)
Dimensionality Reduction
Lasso Regularision
Principle Component Analysis (PCA)
Reference