Jan 24, 2024

Comprehensive Guide to Data Preprocessing in Python

A detailed exploration of data preprocessing techniques including cleaning, normalizing, and feature engineering with practical Python code examples

Comprehensive Guide to Data Preprocessing in Python

Data preprocessing is a crucial step in any data science project. It involves cleaning, normalizing, and feature engineering the data to make it suitable for analysis and modeling. In this article, we will explore the basics and advanced techniques of data preprocessing and provide code examples in Python to help you get started with the right foot.

Cleaning

Clean as water!!

Cleaning data refers to the process of removing or correcting errors, inconsistencies, and missing values in the data. This is important because these issues can lead to inaccurate or unreliable results.

Handling Missing Values

One common technique for handling missing values is to fill them with the mean or median of the column, but be careful, this is common but not recommended for all, will depend a lot on your data.

import pandas as pd

df = pd.read_csv("data.csv")
df = df.fillna(df.mean())

Another technique is to remove the rows or columns with a lot of missing values:

df = df.dropna()

Dealing with Outliers

Outliers are an important aspect of data cleaning. These are values that fall outside of a certain range and can skew the results. One common technique for dealing with outliers is to use the interquartile range (IQR) method. The IQR is the difference between the 75th and 25th percentiles of the data. Values outside of the range of (Q1–1.5 * IQR) to (Q3 + 1.5 * IQR) are considered outliers.

import numpy as np

Q3, Q1 = np.percentile(df['data'], [75 ,25])
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

Correcting Errors and Handling Duplicates

Correcting errors, such as typos or incorrect values:

# Standardizing categorical data
df['gender'] = df['gender'].replace(['Female', 'female'], 'F')
df['gender'] = df['gender'].replace(['Male', 'male'], 'M')

# Removing duplicate rows
df = df.drop_duplicates()

Normalizing

Another history of -1, 0 and 1

Normalizing data refers to the process of scaling the numeric data attributes into a 0 to 1 range. This is important because some machine learning algorithms are sensitive to the scale of the data.

MinMaxScaler

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df = scaler.fit_transform(df)

StandardScaler

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = scaler.fit_transform(df)

Feature Engineering

Your imagination is the limit

Feature engineering is the process of creating new features from the existing ones.

Creating Categorical Features

# Creating age range categories
df['age_range'] = df['age'].map(lambda x: 'young' if x < 30 else ('middle' if x < 60 else 'old'))

# One-Hot Encoding for categorical variables
df_encoded = pd.get_dummies(df, columns=['gender', 'age_range'])

Data Splitting

Splitting the data into training and test sets:

from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Key Takeaways

Data preprocessing is a multi-step process involving cleaning, normalizing, and feature engineering
Each dataset is unique and may require different preprocessing techniques
Always explore and understand your data before applying preprocessing steps
Preprocess training and test sets separately to avoid data leakage and overfitting
The field of data science is constantly evolving, so keep learning and researching new techniques

References

If you are interested in data science, machine learning, artificial intelligence, and education, let’s get in touch and follow me for more!! ( ^-^)**(⁰^ )

Thank you very much for coming this far, I hope these resources will help you. Any comments and feedback are welcome. ╰(°▽°)╯