Data Preprocessing and EDA with the Iris Dataset¶

Table of Contents¶

Introduction
Understanding the Iris Dataset
Data Preprocessing
Exploratory Data Analysis (EDA)
Principal Component Analysis (PCA)
Conclusion

Introduction¶

Welcome to this guide on Data Preprocessing and Exploratory Data Analysis (EDA) using the Iris dataset. We will also delve into Principal Component Analysis (PCA) to understand how to reduce the dimensionality of our data.

Understanding the Iris Dataset¶

The Iris dataset is a simple but widely used dataset in pattern recognition. It contains 3 classes of 50 instances each, where each class refers to a type of iris plant. The dataset has the following attributes:

Sepal Length: Length of the sepal in cm
Sepal Width: Width of the sepal in cm
Petal Length: Length of the petal in cm
Petal Width: Width of the petal in cm
Class: Species of the iris plant (Iris Setosa, Iris Versicolour, Iris Virginica)

Data Preprocessing¶

Before diving into any analysis, it's crucial to preprocess the data.

In [ ]:

Copied!





# Importing libraries and loading the dataset
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['class'] = iris.target
# Importing libraries and loading the dataset
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['class'] = iris.target

Exploratory Data Analysis (EDA)¶

EDA is all about understanding the data through visualizations and summaries.

Summary Statistics¶

In [ ]:

Copied!

# Summary statistics
df.describe()
# Summary statistics
df.describe()

Data Visualization¶

Histograms¶

In [ ]:

Copied!





# Importing matplotlib for data visualization
import matplotlib.pyplot as plt

# Plotting histograms for each feature
df.hist()
plt.show()
# Importing matplotlib for data visualization
import matplotlib.pyplot as plt

# Plotting histograms for each feature
df.hist()
plt.show()

Scatter Plots¶

In [ ]:

Copied!





# Scatter plot based on Sepal Length and Width
plt.scatter(df['sepal length (cm)'], df['sepal width (cm)'], c=df['class'])
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.show()
# Scatter plot based on Sepal Length and Width
plt.scatter(df['sepal length (cm)'], df['sepal width (cm)'], c=df['class'])
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.show()

Principal Component Analysis (PCA)¶

PCA is a technique used to reduce the dimensionality of the dataset.

How Does PCA Work?¶

Standardization: Standardize the dataset.
Covariance Matrix: Compute the covariance matrix.
Eigenvalues and Eigenvectors: Compute eigenvalues and eigenvectors.
Sort and Select: Sort eigenvalues and select the top k eigenvectors.
New Dataset: Form the new dataset.

In [ ]:

Copied!





# Implementing PCA
from sklearn.decomposition import PCA

# Standardizing the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.iloc[:, :-1])

# Applying PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Scatter plot for the first two principal components
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['class'])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()
# Implementing PCA
from sklearn.decomposition import PCA

# Standardizing the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.iloc[:, :-1])

# Applying PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Scatter plot for the first two principal components
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['class'])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()

Interpretation¶

The scatter plot of the first two principal components should show a clear separation between the different classes of the iris plant. This indicates that the reduced dataset still contains most of the original dataset's variance.

Conclusion¶

We've covered Data Preprocessing, EDA, and PCA using the Iris dataset. Understanding these concepts is crucial for anyone diving into Data Science and Machine Learning.