Simple ML Model with Amazon SageMaker (Full Step-by-Step Guide)

🎯 Goal

Train your first Machine Learning model using AWS cloud tools.

🧠 Skills You Will Learn

Machine Learning workflow basics
How cloud notebooks work
How to train and evaluate a model
How cloud storage is used in ML

☁️ AWS Services Used

Amazon S3

Used to store:

Datasets
Training outputs
Saved models

Amazon SageMaker

Used to:

Write ML code
Train models
Test models
Run ML notebooks in the cloud

🌟 Big Picture Overview

Think of the workflow like this:

Collect Data
Store Data in Cloud
Train Model
Test Model
Save Model

💡 Optional Practice — Before You Start: Draw the workflow on paper as a diagram with arrows connecting each step. Try to explain it out loud to a friend or family member like you're teaching them. Teaching something is one of the best ways to understand it!

🧱 PART 1: Create AWS Account

Step 1: Sign Up for AWS

Go to AWS website
Click Create Account
Enter email and password
Add payment method (Free tier still requires card)
Verify phone number
Choose Free Tier plan

🧱 PART 2: Open SageMaker

Step 2: Open AWS Console

Log into AWS
In search bar type:

SageMaker

Click SageMaker service

Step 3: Open SageMaker Studio

Click Studio
Click Open Studio

Wait for it to load.

📦 PART 3: Create Storage (S3 Bucket)

Step 4: Go to S3

In AWS search bar type:

S3

Click S3 service

Step 5: Create Bucket

Click Create Bucket
Enter bucket name:

ML-project-3-yourname

Leave settings default
Click Create Bucket

🎮 Optional Practice — Bucket Explorer: After creating your bucket, create a second bucket with a different name and try uploading a regular text file (like a .txt file with a fun message) into it. Then delete that second bucket. This builds comfort with S3 before your real data goes in!

📓 PART 4: Create Notebook

Step 6: Create New Notebook

Inside SageMaker Studio:

Click File
Click New
Click Notebook
Choose Python 3 kernel
Click Create

📊 PART 5: Add Dataset

Step 7: Download Simple Dataset

Good beginner dataset:

Iris dataset (CSV format)

Step 8: Upload Dataset to S3

Open your bucket
Click Upload
Select CSV file
Click Upload

🤖 PART 6: Train First Model

Step 9: Install Required Libraries

Run in notebook:

pip install sagemaker pandas numpy scikit-learn

Step 10: Import Libraries

import pandas as pd
import numpy as np

Step 11: Load Dataset

data = pd.read_csv("your_file.csv")
print(data.head())

🔍 Optional Practice — Explore Your Data: Before moving on, try running these extra lines to get to know your dataset better:
print(data.shape)        # How many rows and columns?
print(data.describe())   # Min, max, average of each column
print(data.isnull().sum()) # Are there any missing values?
See if you can answer: How many flowers are in the Iris dataset? How many different species are there?

Step 12: Prepare Data

from sklearn.model_selection import train_test_split

X = data.drop("target", axis=1)
y = data["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2
)

Step 13: Train Model

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train, y_train)

🌲 Optional Practice — Change the Forest Size: The RandomForestClassifier has a setting called n_estimators which controls how many decision trees it uses. Try changing it and see if your accuracy improves:
model = RandomForestClassifier(n_estimators=10)   # small forest
model = RandomForestClassifier(n_estimators=100)  # medium forest
model = RandomForestClassifier(n_estimators=500)  # big forest
Train and test each one. Does more trees always mean better accuracy?

📈 PART 7: Evaluate Model

Step 14: Make Predictions

predictions = model.predict(X_test)

Step 15: Measure Accuracy

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

If accuracy is close to:

1.0 → very good
0.5 → average
Below 0.5 → needs improvement

🏆 Optional Practice — Try a Different Model: You used a Random Forest, but there are other models too! Try swapping it out and compare their accuracy scores:
# Option 1: Decision Tree
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

# Option 2: K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()

# Option 3: Logistic Regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
Train each one and record the accuracy. Which model wins on the Iris dataset? Make a little leaderboard in a comment in your notebook!

💾 PART 8: Save Model (Basic Concept)

Usually models are saved:

Locally
Or back into S3 storage

💾 Optional Practice — Actually Save Your Model: Try saving and reloading your trained model using joblib:
import joblib

# Save the model to a file
joblib.dump(model, "my_first_model.pkl")
print("Model saved!")

# Load it back and make a prediction
loaded_model = joblib.load("my_first_model.pkl")
print("Model loaded! Accuracy:", accuracy_score(y_test, loaded_model.predict(X_test)))
If the accuracy matches your original, the save worked perfectly. Congrats — you just built and preserved your first ML model! 🎉

🧹 PART 9: Clean Up Resources

Step 16: Stop Notebook

Inside SageMaker:

Stop notebook instance

Step 17: Stop Training Jobs

Check:

Running training jobs
Stop any active jobs

⚠️ Important: AWS can charge if resources keep running. Always clean up when done.

🚀 BONUS: Extra Challenges (When You're Ready)

These are completely optional but a great way to level up your skills!

🥉 Beginner Bonus — Predict a Single Flower

Use your trained model to predict just one new flower you make up:

import numpy as np

# A made-up flower: sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])
prediction = model.predict(new_flower)
print("This flower is probably a:", prediction)

Try changing the numbers. Does the predicted species change?

🥈 Intermediate Bonus — Visualize Your Data

See your data as a chart before training:

import matplotlib.pyplot as plt

# Plot two features against each other, colored by species
colors = {"setosa": "red", "versicolor": "blue", "virginica": "green"}
for species, group in data.groupby("target"):
    plt.scatter(group["sepal_length"], group["petal_length"], label=species)

plt.xlabel("Sepal Length")
plt.ylabel("Petal Length")
plt.legend()
plt.title("Iris Flowers by Species")
plt.show()

Can you see natural clusters forming? That's what the model is learning!

🥇 Advanced Bonus — Try a Totally Different Dataset

The Iris dataset is great for learning, but try one of these next:

Dataset	What it predicts	Where to find it
Titanic	Who survived	kaggle.com
Penguins	Penguin species	`seaborn.load_dataset("penguins")`
Wine Quality	Wine rating	UCI ML Repository
Digits	Handwritten numbers	`sklearn.datasets.load_digits()`

Load a new dataset and repeat the full workflow from Step 11 onwards. You already know how!

🏅 Ultimate Challenge — Build a Confusion Matrix

A confusion matrix shows you where your model gets confused, not just how accurate it is overall:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
plt.title("Where Does My Model Get Confused?")
plt.show()

Each row is the real answer. Each column is what the model guessed. Diagonal squares = correct! Off-diagonal = mistakes. Which species confuses your model the most?

🌟 You Did It! You've gone from zero to training, evaluating, and saving a real Machine Learning model in the cloud. Every data scientist started exactly where you are right now.