Data Preprocessing and EDA with the Iris Dataset¶
Table of Contents¶
- Introduction
- Understanding the Iris Dataset
- Data Preprocessing
- Exploratory Data Analysis (EDA)
- Principal Component Analysis (PCA)
- Conclusion
Introduction¶
Welcome to this guide on Data Preprocessing and Exploratory Data Analysis (EDA) using the Iris dataset. We will also delve into Principal Component Analysis (PCA) to understand how to reduce the dimensionality of our data.
Understanding the Iris Dataset¶
The Iris dataset is a simple but widely used dataset in pattern recognition. It contains 3 classes of 50 instances each, where each class refers to a type of iris plant. The dataset has the following attributes:
- Sepal Length: Length of the sepal in cm
- Sepal Width: Width of the sepal in cm
- Petal Length: Length of the petal in cm
- Petal Width: Width of the petal in cm
- Class: Species of the iris plant (Iris Setosa, Iris Versicolour, Iris Virginica)
Data Preprocessing¶
Before diving into any analysis, it's crucial to preprocess the data.
# Importing libraries and loading the dataset
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['class'] = iris.target
# Summary statistics
df.describe()
# Importing matplotlib for data visualization
import matplotlib.pyplot as plt
# Plotting histograms for each feature
df.hist()
plt.show()
Scatter Plots¶
# Scatter plot based on Sepal Length and Width
plt.scatter(df['sepal length (cm)'], df['sepal width (cm)'], c=df['class'])
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.show()
Principal Component Analysis (PCA)¶
PCA is a technique used to reduce the dimensionality of the dataset.
How Does PCA Work?¶
- Standardization: Standardize the dataset.
- Covariance Matrix: Compute the covariance matrix.
- Eigenvalues and Eigenvectors: Compute eigenvalues and eigenvectors.
- Sort and Select: Sort eigenvalues and select the top k eigenvectors.
- New Dataset: Form the new dataset.
# Implementing PCA
from sklearn.decomposition import PCA
# Standardizing the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.iloc[:, :-1])
# Applying PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Scatter plot for the first two principal components
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['class'])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()
Interpretation¶
The scatter plot of the first two principal components should show a clear separation between the different classes of the iris plant. This indicates that the reduced dataset still contains most of the original dataset's variance.
Conclusion¶
We've covered Data Preprocessing, EDA, and PCA using the Iris dataset. Understanding these concepts is crucial for anyone diving into Data Science and Machine Learning.