Simple ML Model with Amazon SageMaker

Train your first ML model on AWS

View Repository
README.md
View on GitHub

Simple ML Model with Amazon SageMaker (Full Step-by-Step Guide)


๐ŸŽฏ Goal

Train your first Machine Learning model using AWS cloud tools.


๐Ÿง  Skills You Will Learn

  • Machine Learning workflow basics
  • How cloud notebooks work
  • How to train and evaluate a model
  • How cloud storage is used in ML

โ˜๏ธ AWS Services Used

Amazon S3

Used to store:

  • Datasets
  • Training outputs
  • Saved models

Amazon SageMaker

Used to:

  • Write ML code
  • Train models
  • Test models
  • Run ML notebooks in the cloud

๐ŸŒŸ Big Picture Overview

Think of the workflow like this:

  1. Collect Data
  2. Store Data in Cloud
  3. Train Model
  4. Test Model
  5. Save Model

๐Ÿ’ก Optional Practice โ€” Before You Start: Draw the workflow on paper as a diagram with arrows connecting each step. Try to explain it out loud to a friend or family member like you're teaching them. Teaching something is one of the best ways to understand it!


๐Ÿงฑ PART 1: Create AWS Account

Step 1: Sign Up for AWS

  1. Go to AWS website
  2. Click Create Account
  3. Enter email and password
  4. Add payment method (Free tier still requires card)
  5. Verify phone number
  6. Choose Free Tier plan

๐Ÿงฑ PART 2: Open SageMaker

Step 2: Open AWS Console

  1. Log into AWS
  2. In search bar type:
SageMaker
  1. Click SageMaker service

Step 3: Open SageMaker Studio

  1. Click Studio
  2. Click Open Studio

Wait for it to load.


๐Ÿ“ฆ PART 3: Create Storage (S3 Bucket)

Step 4: Go to S3

  1. In AWS search bar type:
S3
  1. Click S3 service

Step 5: Create Bucket

  1. Click Create Bucket
  2. Enter bucket name:
ML-project-3-yourname
  1. Leave settings default
  2. Click Create Bucket

๐ŸŽฎ Optional Practice โ€” Bucket Explorer: After creating your bucket, create a second bucket with a different name and try uploading a regular text file (like a .txt file with a fun message) into it. Then delete that second bucket. This builds comfort with S3 before your real data goes in!


๐Ÿ““ PART 4: Create Notebook

Step 6: Create New Notebook

Inside SageMaker Studio:

  1. Click File
  2. Click New
  3. Click Notebook
  4. Choose Python 3 kernel
  5. Click Create

๐Ÿ“Š PART 5: Add Dataset

Step 7: Download Simple Dataset

Good beginner dataset:

  • Iris dataset (CSV format)

Step 8: Upload Dataset to S3

  1. Open your bucket
  2. Click Upload
  3. Select CSV file
  4. Click Upload

๐Ÿค– PART 6: Train First Model

Step 9: Install Required Libraries

Run in notebook:

pip install sagemaker pandas numpy scikit-learn

Step 10: Import Libraries

import pandas as pd
import numpy as np

Step 11: Load Dataset

data = pd.read_csv("your_file.csv")
print(data.head())

๐Ÿ” Optional Practice โ€” Explore Your Data: Before moving on, try running these extra lines to get to know your dataset better:

print(data.shape)        # How many rows and columns?
print(data.describe())   # Min, max, average of each column
print(data.isnull().sum()) # Are there any missing values?

See if you can answer: How many flowers are in the Iris dataset? How many different species are there?


Step 12: Prepare Data

from sklearn.model_selection import train_test_split

X = data.drop("target", axis=1)
y = data["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2
)

Step 13: Train Model

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train, y_train)

๐ŸŒฒ Optional Practice โ€” Change the Forest Size: The RandomForestClassifier has a setting called n_estimators which controls how many decision trees it uses. Try changing it and see if your accuracy improves:

model = RandomForestClassifier(n_estimators=10)   # small forest
model = RandomForestClassifier(n_estimators=100)  # medium forest
model = RandomForestClassifier(n_estimators=500)  # big forest

Train and test each one. Does more trees always mean better accuracy?


๐Ÿ“ˆ PART 7: Evaluate Model

Step 14: Make Predictions

predictions = model.predict(X_test)

Step 15: Measure Accuracy

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

If accuracy is close to:

  • 1.0 โ†’ very good
  • 0.5 โ†’ average
  • Below 0.5 โ†’ needs improvement

๐Ÿ† Optional Practice โ€” Try a Different Model: You used a Random Forest, but there are other models too! Try swapping it out and compare their accuracy scores:

# Option 1: Decision Tree
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

# Option 2: K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()

# Option 3: Logistic Regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

Train each one and record the accuracy. Which model wins on the Iris dataset? Make a little leaderboard in a comment in your notebook!


๐Ÿ’พ PART 8: Save Model (Basic Concept)

Usually models are saved:

  • Locally
  • Or back into S3 storage

๐Ÿ’พ Optional Practice โ€” Actually Save Your Model: Try saving and reloading your trained model using joblib:

import joblib

# Save the model to a file
joblib.dump(model, "my_first_model.pkl")
print("Model saved!")

# Load it back and make a prediction
loaded_model = joblib.load("my_first_model.pkl")
print("Model loaded! Accuracy:", accuracy_score(y_test, loaded_model.predict(X_test)))

If the accuracy matches your original, the save worked perfectly. Congrats โ€” you just built and preserved your first ML model! ๐ŸŽ‰


๐Ÿงน PART 9: Clean Up Resources

Step 16: Stop Notebook

Inside SageMaker:

  • Stop notebook instance

Step 17: Stop Training Jobs

Check:

  • Running training jobs
  • Stop any active jobs

โš ๏ธ Important: AWS can charge if resources keep running. Always clean up when done.


๐Ÿš€ BONUS: Extra Challenges (When You're Ready)

These are completely optional but a great way to level up your skills!

๐Ÿฅ‰ Beginner Bonus โ€” Predict a Single Flower

Use your trained model to predict just one new flower you make up:

import numpy as np

# A made-up flower: sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])
prediction = model.predict(new_flower)
print("This flower is probably a:", prediction)

Try changing the numbers. Does the predicted species change?


๐Ÿฅˆ Intermediate Bonus โ€” Visualize Your Data

See your data as a chart before training:

import matplotlib.pyplot as plt

# Plot two features against each other, colored by species
colors = {"setosa": "red", "versicolor": "blue", "virginica": "green"}
for species, group in data.groupby("target"):
    plt.scatter(group["sepal_length"], group["petal_length"], label=species)

plt.xlabel("Sepal Length")
plt.ylabel("Petal Length")
plt.legend()
plt.title("Iris Flowers by Species")
plt.show()

Can you see natural clusters forming? That's what the model is learning!


๐Ÿฅ‡ Advanced Bonus โ€” Try a Totally Different Dataset

The Iris dataset is great for learning, but try one of these next:

DatasetWhat it predictsWhere to find it
TitanicWho survivedkaggle.com
PenguinsPenguin speciesseaborn.load_dataset("penguins")
Wine QualityWine ratingUCI ML Repository
DigitsHandwritten numberssklearn.datasets.load_digits()

Load a new dataset and repeat the full workflow from Step 11 onwards. You already know how!


๐Ÿ… Ultimate Challenge โ€” Build a Confusion Matrix

A confusion matrix shows you where your model gets confused, not just how accurate it is overall:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
plt.title("Where Does My Model Get Confused?")
plt.show()

Each row is the real answer. Each column is what the model guessed. Diagonal squares = correct! Off-diagonal = mistakes. Which species confuses your model the most?


๐ŸŒŸ You Did It! You've gone from zero to training, evaluating, and saving a real Machine Learning model in the cloud. Every data scientist started exactly where you are right now.