Mission Assignment 2 - Prediction and Machine Learning - AI-Empowered Data Science Teammate Cookbook

AI-Powered Prediction & Machine Learning with Palmer Penguins¶

🤖 From Visualization to Prediction: Your Next AI Journey!¶

This chapter demonstrates how AI can take you beyond descriptive analytics into the powerful world of predictive modeling. You’ll learn to build machine learning models that don’t just show you what happened—they predict what will happen next.

By the end of this chapter, you’ll understand:

How to transition from visualization to prediction with AI assistance
Building your first machine learning classifier
Evaluating model performance and understanding predictions
Best practices for AI-assisted machine learning workflows

🎯 The WHAT, HOW, and WHY of Predictive AI¶

📖 WHAT: Understanding Machine Learning in Your Workflow¶

Key Insight: Machine learning is the natural next step after data visualization. While charts show you patterns in historical data, ML models learn those patterns to make predictions about new, unseen data.

Think of AI-powered machine learning as your intelligent pattern recognition system that brings:

🧠 Pattern Learner: Identifies complex relationships in your data
🔮 Future Predictor: Makes informed predictions about new cases
⚖️ Decision Assistant: Provides confidence scores and reasoning
🎯 Performance Monitor: Evaluates and improves prediction accuracy

Key Machine Learning Integration Points:¶

Data Preprocessing & Feature Engineering
Model Selection & Training
Prediction Generation & Validation
Performance Evaluation & Interpretation
Model Deployment & Monitoring

import pandas as pd

# GitHub URL for the CSV file (raw format)
url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"

# Load the dataset into a pandas DataFrame
penguins = pd.read_csv(url)

# Display the variable (column) names
print("Variable (column) names in the Palmer Penguins dataset:")
print(penguins.columns.tolist())

Variable (column) names in the Palmer Penguins dataset:
['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year']

pip install plotly

Requirement already satisfied: plotly in /home/runner/micromamba/envs/cookbook-dev/lib/python3.12/site-packages (6.3.1)
Requirement already satisfied: narwhals>=1.15.1 in /home/runner/micromamba/envs/cookbook-dev/lib/python3.12/site-packages (from plotly) (2.7.0)
Requirement already satisfied: packaging in /home/runner/micromamba/envs/cookbook-dev/lib/python3.12/site-packages (from plotly) (25.0)

Note: you may need to restart the kernel to use updated packages.

import plotly

import pandas as pd

# GitHub URL for the CSV file (raw format)
url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"

# Load the dataset into a pandas DataFrame
penguins = pd.read_csv(url)

# Display the variable (column) names
print("Variable (column) names in the Palmer Penguins dataset:")
print(penguins.columns.tolist())

Variable (column) names in the Palmer Penguins dataset:
['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year']

print(penguins.head(10))

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   
5  Adelie  Torgersen            39.3           20.6              190.0   
6  Adelie  Torgersen            38.9           17.8              181.0   
7  Adelie  Torgersen            39.2           19.6              195.0   
8  Adelie  Torgersen            34.1           18.1              193.0   
9  Adelie  Torgersen            42.0           20.2              190.0   

   body_mass_g     sex  year  
0       3750.0    male  2007  
1       3800.0  female  2007  
2       3250.0  female  2007  
3          NaN     NaN  2007  
4       3450.0  female  2007  
5       3650.0    male  2007  
6       3625.0  female  2007  
7       4675.0    male  2007  
8       3475.0     NaN  2007  
9       4250.0     NaN  2007

💡 Why This Approach Works:¶

Notice how we’re not just loading data—we’re immediately getting contextual insights that inform our next steps. The AI helps us understand the scope and structure before we dive deeper.

🧹 AI-Assisted Data Cleaning¶

🤖 How We Engage AI for Intelligent Cleaning¶

💬 Effective AI Prompting for Data Cleaning¶

Instead of: “Clean my data”
Try: “Analyze the missing data patterns and recommend a cleaning strategy that balances data retention with model performance for a penguin species classification task”

Why it works: We provide context (classification task), constraints (data retention vs performance), and specific guidance (missing data patterns).

Key Insight: The key to AI-assisted cleaning is providing domain context and analytical objectives

column_names_list = penguins.columns.tolist()
print("Column names (as a list using .tolist()):")
print(column_names_list)

Column names (as a list using .tolist()):
['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year']

🧠 AI Cleaning Philosophy¶

Notice how the AI provides not just actions, but rationale for each decision. This builds trust and understanding, making the process educational rather than just automated.

🎨 AI-Enhanced Data Visualization¶

🤖 Engaging AI for Strategic Visualization Planning¶

💬 How to Ask AI for Visualization Strategy¶

Instead of: “Make some charts”
Try: “Given that I have a classification problem with 3 species and 4 numerical features, what visualization strategy would best reveal class separability and feature relationships?”

Result: AI provides a strategic framework rather than random charts.

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"
penguins = pd.read_csv(url)

# Drop rows with missing values in selected columns
penguins = penguins[['species', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].dropna()

# Set style
plt.style.use('seaborn-v0_8-whitegrid')

# 1. Scatter Plot: Bill Length vs Bill Depth
plt.figure(figsize=(8, 6))
for species in penguins['species'].unique():
    subset = penguins[penguins['species'] == species]
    plt.scatter(subset['bill_length_mm'], subset['bill_depth_mm'], label=species, alpha=0.7)
plt.xlabel("Bill Length (mm)")
plt.ylabel("Bill Depth (mm)")
plt.title("Bill Length vs Bill Depth by Species")
plt.legend()
plt.tight_layout()
plt.show()

# 2. Box Plot: Flipper Length by Species
plt.figure(figsize=(8, 6))
penguins.boxplot(column='flipper_length_mm', by='species')
plt.title("Flipper Length by Species")
plt.suptitle('')
plt.xlabel("Species")
plt.ylabel("Flipper Length (mm)")
plt.tight_layout()
plt.show()

# 3. Scatter Plot: Flipper Length vs Body Mass
plt.figure(figsize=(8, 6))
for species in penguins['species'].unique():
    subset = penguins[penguins['species'] == species]
    plt.scatter(subset['flipper_length_mm'], subset['body_mass_g'], label=species, alpha=0.7)
plt.xlabel("Flipper Length (mm)")
plt.ylabel("Body Mass (g)")
plt.title("Flipper Length vs Body Mass by Species")
plt.legend()
plt.tight_layout()
plt.show()

<Figure size 800x600 with 0 Axes>

import pandas as pd
import plotly.express as px
import plotly.io as pio
import os

# --- Detect if in Jupyter Notebook ---
try:
    shell = get_ipython().__class__.__name__
    if "ZMQInteractiveShell" in shell:
        pio.renderers.default = 'notebook'  # Jupyter Notebook
    else:
        pio.renderers.default = 'iframe'    # Other notebook-like interface
except NameError:
    # Not in Jupyter (likely script or terminal)
    try:
        pio.renderers.default = 'browser'
    except:
        pio.renderers.default = 'svg'

# --- Load dataset ---
url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"
penguins = pd.read_csv(url)

# --- Clean data ---
penguins = penguins[['species', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].dropna()

# --- Plot 1: Bill Length vs Bill Depth ---
fig1 = px.scatter(
    penguins,
    x='bill_length_mm',
    y='bill_depth_mm',
    color='species',
    title='Bill Length vs Bill Depth by Species',
    labels={'bill_length_mm': 'Bill Length (mm)', 'bill_depth_mm': 'Bill Depth (mm)'}
)

# --- Plot 2: Flipper Length by Species (Box Plot) ---
fig2 = px.box(
    penguins,
    x='species',
    y='flipper_length_mm',
    color='species',
    title='Flipper Length by Species',
    labels={'flipper_length_mm': 'Flipper Length (mm)'}
)

# --- Plot 3: Flipper Length vs Body Mass ---
fig3 = px.scatter(
    penguins,
    x='flipper_length_mm',
    y='body_mass_g',
    color='species',
    title='Flipper Length vs Body Mass by Species',
    labels={'flipper_length_mm': 'Flipper Length (mm)', 'body_mass_g': 'Body Mass (g)'}
)

# --- Show or Save Plots Based on Environment ---
try:
    # Try to show plots interactively
    fig1.show()
    fig2.show()
    fig3.show()
except:
    # If that fails (e.g., headless), save as HTML
    print("Interactive display failed. Saving plots as HTML...")
    fig1.write_html("fig1_bill_vs_depth.html")
    fig2.write_html("fig2_flipper_box.html")
    fig3.write_html("fig3_flipper_vs_mass.html")
    print("Plots saved in current directory.")

import pandas as pd
import plotly.express as px
import plotly.io as pio
from IPython.display import HTML, display

# Load and clean data
url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"
penguins = pd.read_csv(url)
penguins = penguins[['species', 'bill_length_mm', 'bill_depth_mm']].dropna()

# Create plot
fig = px.scatter(
    penguins,
    x='bill_length_mm',
    y='bill_depth_mm',
    color='species',
    title='Bill Length vs Bill Depth by Species'
)

# Convert Plotly figure to HTML string
html_plot = pio.to_html(fig, full_html=False, include_plotlyjs='cdn')

# Display using IPython
display(HTML(html_plot))

fig.write_html("my_plot.html")
#<iframe src="my_plot.html" width="100%" height="500px"></iframe>

# 📌 Step 1: Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

# 📌 Step 2: Load Palmer Penguins Dataset
url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"
penguins = pd.read_csv(url)

# Drop rows with missing values
penguins.dropna(inplace=True)

# Display first few rows
penguins.head()

# 📌 Step 3: Exploratory Data Analysis (Optional but helpful)
sns.pairplot(penguins, hue="species")
plt.suptitle("Penguin Feature Distributions by Species", y=1.02)
plt.show()

🧠 AI Cleaning Philosophy¶

Notice how the AI provides not just actions, but rationale for each decision. This builds trust and understanding, making the process educational rather than just automated.

🎨 AI-Enhanced Data Visualization¶

🤖 Engaging AI for Strategic Visualization Planning¶

💬 How to Ask AI for Visualization Strategy¶

Result: AI provides a strategic framework rather than random charts.

# 📌 Step 4: Preprocessing
# Encode categorical features: 'island', 'sex', and target 'species'

le_species = LabelEncoder()
penguins['species_label'] = le_species.fit_transform(penguins['species'])

le_island = LabelEncoder()
penguins['island_label'] = le_island.fit_transform(penguins['island'])

le_sex = LabelEncoder()
penguins['sex_label'] = le_sex.fit_transform(penguins['sex'])

# Define features and target
features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'island_label', 'sex_label']
target = 'species_label'

# 📌 Step 5: Train-Test Split
X = penguins[features]
y = penguins[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 📌 Step 6: Train Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# 📌 Step 7: Evaluate Model
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=le_species.classes_))

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', xticklabels=le_species.classes_, yticklabels=le_species.classes_)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

Classification Report:
               precision    recall  f1-score   support

      Adelie       0.94      1.00      0.97        31
   Chinstrap       1.00      0.89      0.94        18
      Gentoo       1.00      1.00      1.00        18

    accuracy                           0.97        67
   macro avg       0.98      0.96      0.97        67
weighted avg       0.97      0.97      0.97        67

🤖 AI Pattern Recognition Summary:¶

Clear species clustering visible in multiple feature combinations
Excellent linear separability suggests high classification accuracy potential
Flipper length vs body mass shows strongest species separation

# 📌 Step 8: Feature Importance
importances = model.feature_importances_
feature_names = X.columns

plt.figure(figsize=(8, 6))
sns.barplot(x=importances, y=feature_names)
plt.title("Feature Importances in Penguin Classification")
plt.show()

🏆 Key Takeaways and Next Steps¶

🎯 What You’ve Accomplished¶

✅ Your AI Integration Achievements:

Mastered AI-enhanced data loading with immediate insights
Generated comprehensive data profiles using AI assistance
Implemented intelligent data cleaning with AI recommendations
Created strategic visualizations guided by AI analysis
Discovered key patterns that will drive successful classification

🧠 Key AI Integration Principles¶

🤝 Collaboration over Replacement - AI amplifies human expertise rather than replacing it
📊 Data-Driven Decisions - Let AI analyze patterns, humans interpret context
🔄 Iterative Improvement - Use AI feedback to refine your approach
📈 Scalable Methods - Develop reusable AI-assisted workflows
🎯 Goal-Oriented - Align AI assistance with business objectives

🚀 Ready Mission Assignment 3: Data Interpretation¶

🎖️ Achievement Unlocked!¶

Congratulations! You’ve completed Prediction and Machine Learning!

In the next section, you’ll learn to:

Build and train classification models with AI guidance
Optimize model performance using AI recommendations
Deploy your trained models for real-world predictions
Evaluate and interpret results with AI assistance

Note: Continue to Misison Assignment 3 to become an ‘AI Model Builder’ and continue your journey toward an AI Integration Spceialist!

Chapter 1 - Foundations / Methods

Mission Assignment 1 - Data Cleaning and Data Visualization

Chapter 1 - Foundations / Methods

Mission Assignment 3 - Data Interpretation and Communication