- Machine Learning Basics
- Machine Learning - Home
- Machine Learning - Getting Started
- Machine Learning - Basic Concepts
- Machine Learning - Python Libraries
- Machine Learning - Applications
- Machine Learning - Life Cycle
- Machine Learning - Required Skills
- Machine Learning - Implementation
- Machine Learning - Challenges & Common Issues
- Machine Learning - Limitations
- Machine Learning - Reallife Examples
- Machine Learning - Data Structure
- Machine Learning - Mathematics
- Machine Learning - Artificial Intelligence
- Machine Learning - Neural Networks
- Machine Learning - Deep Learning
- Machine Learning - Getting Datasets
- Machine Learning - Categorical Data
- Machine Learning - Data Loading
- Machine Learning - Data Understanding
- Machine Learning - Data Preparation
- Machine Learning - Models
- Machine Learning - Supervised
- Machine Learning - Unsupervised
- Machine Learning - Semi-supervised
- Machine Learning - Reinforcement
- Machine Learning - Supervised vs. Unsupervised
- Machine Learning Data Visualization
- Machine Learning - Data Visualization
- Machine Learning - Histograms
- Machine Learning - Density Plots
- Machine Learning - Box and Whisker Plots
- Machine Learning - Correlation Matrix Plots
- Machine Learning - Scatter Matrix Plots
- Statistics for Machine Learning
- Machine Learning - Statistics
- Machine Learning - Mean, Median, Mode
- Machine Learning - Standard Deviation
- Machine Learning - Percentiles
- Machine Learning - Data Distribution
- Machine Learning - Skewness and Kurtosis
- Machine Learning - Bias and Variance
- Machine Learning - Hypothesis
- Regression Analysis In ML
- Machine Learning - Regression Analysis
- Machine Learning - Linear Regression
- Machine Learning - Simple Linear Regression
- Machine Learning - Multiple Linear Regression
- Machine Learning - Polynomial Regression
- Classification Algorithms In ML
- Machine Learning - Classification Algorithms
- Machine Learning - Logistic Regression
- Machine Learning - K-Nearest Neighbors (KNN)
- Machine Learning - Naïve Bayes Algorithm
- Machine Learning - Decision Tree Algorithm
- Machine Learning - Support Vector Machine
- Machine Learning - Random Forest
- Machine Learning - Confusion Matrix
- Machine Learning - Stochastic Gradient Descent
- Clustering Algorithms In ML
- Machine Learning - Clustering Algorithms
- Machine Learning - Centroid-Based Clustering
- Machine Learning - K-Means Clustering
- Machine Learning - K-Medoids Clustering
- Machine Learning - Mean-Shift Clustering
- Machine Learning - Hierarchical Clustering
- Machine Learning - Density-Based Clustering
- Machine Learning - DBSCAN Clustering
- Machine Learning - OPTICS Clustering
- Machine Learning - HDBSCAN Clustering
- Machine Learning - BIRCH Clustering
- Machine Learning - Affinity Propagation
- Machine Learning - Distribution-Based Clustering
- Machine Learning - Agglomerative Clustering
- Dimensionality Reduction In ML
- Machine Learning - Dimensionality Reduction
- Machine Learning - Feature Selection
- Machine Learning - Feature Extraction
- Machine Learning - Backward Elimination
- Machine Learning - Forward Feature Construction
- Machine Learning - High Correlation Filter
- Machine Learning - Low Variance Filter
- Machine Learning - Missing Values Ratio
- Machine Learning - Principal Component Analysis
- Machine Learning Miscellaneous
- Machine Learning - Performance Metrics
- Machine Learning - Automatic Workflows
- Machine Learning - Boost Model Performance
- Machine Learning - Gradient Boosting
- Machine Learning - Bootstrap Aggregation (Bagging)
- Machine Learning - Cross Validation
- Machine Learning - AUC-ROC Curve
- Machine Learning - Grid Search
- Machine Learning - Data Scaling
- Machine Learning - Train and Test
- Machine Learning - Association Rules
- Machine Learning - Apriori Algorithm
- Machine Learning - Gaussian Discriminant Analysis
- Machine Learning - Cost Function
- Machine Learning - Bayes Theorem
- Machine Learning - Precision and Recall
- Machine Learning - Adversarial
- Machine Learning - Stacking
- Machine Learning - Epoch
- Machine Learning - Perceptron
- Machine Learning - Regularization
- Machine Learning - Overfitting
- Machine Learning - P-value
- Machine Learning - Entropy
- Machine Learning - MLOps
- Machine Learning - Data Leakage
- Machine Learning - Resources
- Machine Learning - Quick Guide
- Machine Learning - Useful Resources
- Machine Learning - Discussion
Machine Learning - Categorical Data
What is Categorical Data?
Categorical data in Machine Learning refers to data that consists of categories or labels, rather than numerical values. These categories may be nominal, meaning that there is no inherent order or ranking between them (e.g., color, gender), or ordinal, meaning that there is a natural ordering between the categories (e.g., education level, income bracket).
Categorical data is often represented using discrete values, such as integers or strings, and is frequently encoded as one-hot vectors before being used as input to machine learning models. One-hot encoding involves creating a binary vector for each category, where the vector has a 1 in the position corresponding to the category and 0s in all other positions.
Techniques for Handling Categorical Data
Handling categorical data is an important part of machine learning preprocessing, as many algorithms require numerical input. Depending on the algorithm and the nature of the categorical data, different encoding techniques may be used, such as label encoding, ordinal encoding, or binary encoding etc.
In the subsequent sections of this chapter, we will discuss the different techniques for handling categorical data in machine learning along with their implementations in Python.
One-Hot Encoding
One-hot encoding is a popular technique for handling categorical data in machine learning. It involves creating a binary vector for each category, where each element of the vector represents the presence or absence of the category. For example, if we have a categorical variable for color with values red, blue, and green, one-hot encoding would create three binary vectors: [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively.
Example
Below is an example of how to perform one-hot encoding in Python using the Pandas library −
import pandas as pd # Creating a sample dataset with a categorical variable data = {'color': ['red', 'green', 'blue', 'red', 'green']} df = pd.DataFrame(data) # Performing one-hot encoding one_hot_encoded = pd.get_dummies(df['color'], prefix='color') # Combining the encoded data with the original data df = pd.concat([df, one_hot_encoded], axis=1) # Drop the original categorical variable df = df.drop('color', axis=1) # Print the encoded data print(df)
Output
This will create a one-hot encoded dataframe with three binary variables ("color_blue," "color_green," and "color_red") that take the value 1 if the corresponding color is present and 0 if it is not. This encoded data, output given below, can then be used for machine learning tasks such as classification and regression.
color_blue color_green color_red 0 0 0 1 1 0 1 0 2 1 0 0 3 0 0 1 4 0 1 0
One-Hot Encoding technique works well for small and finite categorical variables but can be problematic for large categorical variables as it can lead to a high number of input features.
Label Encoding
Label Encoding is another technique for handling categorical data in machine learning. It involves assigning a unique numerical value to each category in a categorical variable, with the order of the values based on the order of the categories.
For example, suppose we have a categorical variable "Size" with three categories: "small," "medium," and "large." Using label encoding, we would assign the values 0, 1, and 2 to these categories, respectively.
Example
Below is an example of how to perform label encoding in Python using the scikit-learn library −
from sklearn.preprocessing import LabelEncoder # create a sample dataset with a categorical variable data = ['small', 'medium', 'large', 'small', 'large'] # create a label encoder object label_encoder = LabelEncoder() # fit and transform the data using the label encoder encoded_data = label_encoder.fit_transform(data) # print the encoded data print(encoded_data)
This will create an encoded array with the values [0, 1, 2, 0, 2], which correspond to the encoded categories "small," "medium," and "large." Note that the encoding is based on the alphabetical order of the categories by default, but you can change the order by passing a custom list to the LabelEncoder object.
Output
[2 1 0 2 0]
Label encoding can be useful when there is a natural ordering between the categories, such as in the case of ordinal categorical variables. However, it should be used with caution for nominal categorical variables because the numerical values may imply an order that does not actually exist. In these cases, one-hot encoding is a safer option.
Frequency Encoding
Frequency Encoding is another technique for handling categorical data in machine learning. It involves replacing each category in a categorical variable with its frequency (or count) in the dataset. The idea behind frequency encoding is that categories that appear more frequently may be more important or informative for the machine learning algorithm.
Example
Below is an example of how to perform frequency encoding in Python −
import pandas as pd # create a sample dataset with a categorical variable data = {'color': ['red', 'green', 'blue', 'red', 'green']} df = pd.DataFrame(data) # calculate the frequency of each category in the categorical variable freq = df['color'].value_counts(normalize=True) # replace each category with its frequency df['color_freq'] = df['color'].map(freq) # drop the original categorical variable df = df.drop('color', axis=1) # print the encoded data print(df)
This will create an encoded dataframe with one variable ("color_freq") that represents the frequency of each category in the original categorical variable. For example, if the original variable had two occurrences of "red" and three occurrences of "green," then the corresponding frequencies would be 0.4 and 0.6, respectively.
Output
color_freq 0 0.4 1 0.4 2 0.2 3 0.4 4 0.4
Frequency encoding can be a useful alternative to one-hot encoding or label encoding, especially when dealing with high-cardinality categorical variables (i.e., variables with a large number of categories). However, it may not always be effective, and its performance can depend on the particular dataset and machine learning algorithm being used.
Target Encoding
Target Encoding is another technique for handling categorical data in machine learning. It involves replacing each category in a categorical variable with the mean (or other aggregation) of the target variable (i.e., the variable you want to predict) for that category. The idea behind target encoding is that it can capture the relationship between the categorical variable and the target variable, and therefore improve the predictive performance of the machine learning model.
Example
Below is an example of how to perform target encoding in Python with the Scikit-learn library by using a combination of a label encoder and a mean encoder −
import pandas as pd from sklearn.preprocessing import LabelEncoder # create a sample dataset with a categorical variable and a target variable data = {'color': ['red', 'green', 'blue', 'red', 'green'], 'target': [1, 0, 1, 0, 1]} df = pd.DataFrame(data) # create a label encoder object and fit it to the data label_encoder = LabelEncoder() label_encoder.fit(df['color']) # transform the categorical variable using the label encoder df['color_encoded'] = label_encoder.transform(df['color']) # create a mean encoder object and fit it to the transformed data mean_encoder = df.groupby('color_encoded')['target'].mean().to_dict() # map the mean encoded values to the categorical variable df['color_encoded'] = df['color_encoded'].map(mean_encoder) # print the encoded data print(df)
In this example, we first create a Pandas DataFrame df with a categorical variable 'color' and a target variable 'target'. We then create a LabelEncoder object from scikit-learn and fit it to the 'color' column of df.
Next, we transform the categorical variable 'color' using the label encoder by calling the transform method on the label encoder object and assigning the resulting encoded values to a new column 'color_encoded' in df.
Finally, we create a mean encoder object by grouping df by the 'color_encoded' column and calculating the mean of the 'target' column for each group. We then convert this mean encoder object to a dictionary and map the mean encoded values to the original 'color' column of df.
Output
color target color_encoded 0 red 1 0.5 1 green 0 0.5 2 blue 1 1.0 3 red 0 0.5 4 green 1 0.5
Target encoding can be a powerful technique for improving the predictive performance of machine learning models, especially for datasets with high-cardinality categorical variables. However, it is important to avoid overfitting by using cross-validation and regularization techniques.
Binary Encoding
Binary encoding is another technique used for encoding categorical variables in machine learning. In binary encoding, each category is assigned a binary code, where each digit represents whether the category is present (1) or not (0). The binary codes are typically based on the position of the category in a sorted list of all categories.
Example
Here's an example Python implementation of binary encoding using the category_encoders library −
import pandas as pd import category_encoders as ce # create a sample dataset with a categorical variable data = {'color': ['red', 'green', 'blue', 'red', 'green']} df = pd.DataFrame(data) # create a binary encoder object and fit it to the data binary_encoder = ce.BinaryEncoder(cols=['color']) binary_encoder.fit(df['color']) # transform the categorical variable using the binary encoder encoded_data = binary_encoder.transform(df['color']) # merge the encoded variable with the original dataframe df = pd.concat([df, encoded_data], axis=1) # print the encoded data print(df)
In this example, we first create a Pandas DataFrame df with a categorical variable 'color'. We then create a BinaryEncoder object from the category_encoders library and fit it to the 'color' column of df.
Next, we transform the categorical variable 'color' using the binary encoder by calling the transform method on the binary encoder object and assigning the resulting encoded values to a new DataFrame encoded_data.
Finally, we merge the encoded variable with the original DataFrame df using the concat method along the column axis (axis=1). The resulting DataFrame should have the original 'color' column along with the encoded binary columns.
Output
When you run the code, it will produce the following output −
color color_0 color_1 0 red 0 1 1 green 1 0 2 blue 1 1 3 red 0 1 4 green 1 0
The binary encoding works best for categorical variables with a moderate number of categories, as it can quickly become inefficient for variables with a large number of categories.