Simple ML Model with Amazon SageMaker (Full Step-by-Step Guide)
๐ฏ Goal
Train your first Machine Learning model using AWS cloud tools.
๐ง Skills You Will Learn
- Machine Learning workflow basics
- How cloud notebooks work
- How to train and evaluate a model
- How cloud storage is used in ML
โ๏ธ AWS Services Used
Amazon S3
Used to store:
- Datasets
- Training outputs
- Saved models
Amazon SageMaker
Used to:
- Write ML code
- Train models
- Test models
- Run ML notebooks in the cloud
๐ Big Picture Overview
Think of the workflow like this:
- Collect Data
- Store Data in Cloud
- Train Model
- Test Model
- Save Model
๐ก Optional Practice โ Before You Start: Draw the workflow on paper as a diagram with arrows connecting each step. Try to explain it out loud to a friend or family member like you're teaching them. Teaching something is one of the best ways to understand it!
๐งฑ PART 1: Create AWS Account
Step 1: Sign Up for AWS
- Go to AWS website
- Click Create Account
- Enter email and password
- Add payment method (Free tier still requires card)
- Verify phone number
- Choose Free Tier plan
๐งฑ PART 2: Open SageMaker
Step 2: Open AWS Console
- Log into AWS
- In search bar type:
SageMaker
- Click SageMaker service
Step 3: Open SageMaker Studio
- Click Studio
- Click Open Studio
Wait for it to load.
๐ฆ PART 3: Create Storage (S3 Bucket)
Step 4: Go to S3
- In AWS search bar type:
S3
- Click S3 service
Step 5: Create Bucket
- Click Create Bucket
- Enter bucket name:
ML-project-3-yourname
- Leave settings default
- Click Create Bucket
๐ฎ Optional Practice โ Bucket Explorer: After creating your bucket, create a second bucket with a different name and try uploading a regular text file (like a
.txtfile with a fun message) into it. Then delete that second bucket. This builds comfort with S3 before your real data goes in!
๐ PART 4: Create Notebook
Step 6: Create New Notebook
Inside SageMaker Studio:
- Click File
- Click New
- Click Notebook
- Choose Python 3 kernel
- Click Create
๐ PART 5: Add Dataset
Step 7: Download Simple Dataset
Good beginner dataset:
- Iris dataset (CSV format)
Step 8: Upload Dataset to S3
- Open your bucket
- Click Upload
- Select CSV file
- Click Upload
๐ค PART 6: Train First Model
Step 9: Install Required Libraries
Run in notebook:
pip install sagemaker pandas numpy scikit-learn
Step 10: Import Libraries
import pandas as pd
import numpy as np
Step 11: Load Dataset
data = pd.read_csv("your_file.csv")
print(data.head())
๐ Optional Practice โ Explore Your Data: Before moving on, try running these extra lines to get to know your dataset better:
print(data.shape) # How many rows and columns? print(data.describe()) # Min, max, average of each column print(data.isnull().sum()) # Are there any missing values?See if you can answer: How many flowers are in the Iris dataset? How many different species are there?
Step 12: Prepare Data
from sklearn.model_selection import train_test_split
X = data.drop("target", axis=1)
y = data["target"]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2
)
Step 13: Train Model
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
๐ฒ Optional Practice โ Change the Forest Size: The
RandomForestClassifierhas a setting calledn_estimatorswhich controls how many decision trees it uses. Try changing it and see if your accuracy improves:model = RandomForestClassifier(n_estimators=10) # small forest model = RandomForestClassifier(n_estimators=100) # medium forest model = RandomForestClassifier(n_estimators=500) # big forestTrain and test each one. Does more trees always mean better accuracy?
๐ PART 7: Evaluate Model
Step 14: Make Predictions
predictions = model.predict(X_test)
Step 15: Measure Accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
If accuracy is close to:
1.0โ very good0.5โ average- Below
0.5โ needs improvement
๐ Optional Practice โ Try a Different Model: You used a Random Forest, but there are other models too! Try swapping it out and compare their accuracy scores:
# Option 1: Decision Tree from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() # Option 2: K-Nearest Neighbors from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier() # Option 3: Logistic Regression from sklearn.linear_model import LogisticRegression model = LogisticRegression()Train each one and record the accuracy. Which model wins on the Iris dataset? Make a little leaderboard in a comment in your notebook!
๐พ PART 8: Save Model (Basic Concept)
Usually models are saved:
- Locally
- Or back into S3 storage
๐พ Optional Practice โ Actually Save Your Model: Try saving and reloading your trained model using
joblib:import joblib # Save the model to a file joblib.dump(model, "my_first_model.pkl") print("Model saved!") # Load it back and make a prediction loaded_model = joblib.load("my_first_model.pkl") print("Model loaded! Accuracy:", accuracy_score(y_test, loaded_model.predict(X_test)))If the accuracy matches your original, the save worked perfectly. Congrats โ you just built and preserved your first ML model! ๐
๐งน PART 9: Clean Up Resources
Step 16: Stop Notebook
Inside SageMaker:
- Stop notebook instance
Step 17: Stop Training Jobs
Check:
- Running training jobs
- Stop any active jobs
โ ๏ธ Important: AWS can charge if resources keep running. Always clean up when done.
๐ BONUS: Extra Challenges (When You're Ready)
These are completely optional but a great way to level up your skills!
๐ฅ Beginner Bonus โ Predict a Single Flower
Use your trained model to predict just one new flower you make up:
import numpy as np
# A made-up flower: sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])
prediction = model.predict(new_flower)
print("This flower is probably a:", prediction)
Try changing the numbers. Does the predicted species change?
๐ฅ Intermediate Bonus โ Visualize Your Data
See your data as a chart before training:
import matplotlib.pyplot as plt
# Plot two features against each other, colored by species
colors = {"setosa": "red", "versicolor": "blue", "virginica": "green"}
for species, group in data.groupby("target"):
plt.scatter(group["sepal_length"], group["petal_length"], label=species)
plt.xlabel("Sepal Length")
plt.ylabel("Petal Length")
plt.legend()
plt.title("Iris Flowers by Species")
plt.show()
Can you see natural clusters forming? That's what the model is learning!
๐ฅ Advanced Bonus โ Try a Totally Different Dataset
The Iris dataset is great for learning, but try one of these next:
| Dataset | What it predicts | Where to find it |
|---|---|---|
| Titanic | Who survived | kaggle.com |
| Penguins | Penguin species | seaborn.load_dataset("penguins") |
| Wine Quality | Wine rating | UCI ML Repository |
| Digits | Handwritten numbers | sklearn.datasets.load_digits() |
Load a new dataset and repeat the full workflow from Step 11 onwards. You already know how!
๐ Ultimate Challenge โ Build a Confusion Matrix
A confusion matrix shows you where your model gets confused, not just how accurate it is overall:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
plt.title("Where Does My Model Get Confused?")
plt.show()
Each row is the real answer. Each column is what the model guessed. Diagonal squares = correct! Off-diagonal = mistakes. Which species confuses your model the most?
๐ You Did It! You've gone from zero to training, evaluating, and saving a real Machine Learning model in the cloud. Every data scientist started exactly where you are right now.