Machine learning can feel overwhelming when you are starting out. The tools, setup, and environment configuration often become the biggest barrier before you even write your first line of code.

The easiest way to begin is Google Colab. It removes all setup complexity and allows you to run Python, PySpark, and machine learning workflows directly in your browser.

This guide walks you through creating your first machine learning notebook from scratch, building it step by step, and implementing a complete end-to-end ML pipeline.

Why Google Colab?

Google Colab is ideal for beginners because:

No installation required
Runs entirely in your browser
Free access to compute resources
Supports Python, PySpark, and ML libraries
Easy sharing and collaboration

Your entire machine learning project will live inside a notebook, where each step is executed in sequence.

Open Google Colab

Go to:
https://colab.research.google.com

Sign in with your Google account
Click File → New notebook

You now have a blank notebook.

Understand Notebook Structure

A notebook consists of cells:

Code cells → run Python code
Text cells → add explanations

You will build your ML project by adding cells one by one.

Install Required Libraries

Create your first code cell

Paste this:

!pip install pyspark pandas matplotlib seaborn scikit-learn

Run the cell

Click the Play button (▶) on the left
Wait for installation to complete

This installs all required tools:

PySpark → distributed processing
Pandas → data handling
Matplotlib/Seaborn → visualisation
Scikit-learn → ML utilities

Import Libraries

Add a new code cell

Hover your mouse around the edge of the botton of an exisitng cell and a pop up will appear +Code or +Text, choose +Code for pasting code or +Text for regular text.

			
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
# MLlib
from pyspark.ml.feature import VectorAssembler, StandardScaler, PCA
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

		

Create Spark Session

Add a new cell

			
spark = SparkSession.builder \
    .appName("ParkinsonsClassification") \
    .config("spark.sql.shuffle.partitions", "8") \
    .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
print("Spark Version:", spark.version)

		

Creates the Spark engine
Sets partitions to optimise performance for small datasets
Reduces log noise

This is the entry point for all PySpark operations.

Download and Load Dataset

Add a new cell

			
import os
import urllib.request
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data"
if not os.path.exists("parkinsons.data"):
    urllib.request.urlretrieve(url, "parkinsons.data")
df = spark.read.csv("parkinsons.data", header=True, inferSchema=True)
# Drop non-useful column
df = df.drop("name")
df.printSchema()
print("Shape:", df.count(), len(df.columns))

		

Dataset downloaded automatically
Loaded into Spark DataFrame name column removed (not useful for ML)
Dataset has 195 rows and 23 columns

Data Exploration

7.1 Preview Data

df.show(5)

This confirms data loaded correctly.

7.2 Descriptive Statistics

df.describe().show()

This gives:

Mean
Standard deviation
Min / Max

7.3 Class Distribution

df.groupBy("status").count().show()

Insight

Healthy: ~24%
Parkinson’s: ~75%

This shows class imbalance, which affects model evaluation.

7.4 Feature Distribution Analysis

Understanding how each feature behaves is an essential step before building machine learning models. Instead of plotting all features in a single combined chart, we create a structured grid of histograms so that each feature can be analysed individually.

Add a new code cell

			
# Feature distributions
feature_cols = [c for c in pdf.columns if c != 'status']
fig, axes = plt.subplots(5, 5, figsize=(22, 18))
axes = axes.flatten()
for i, col in enumerate(feature_cols):
    axes[i].hist(pdf[col], bins=25, color='#42A5F5', edgecolor='white', alpha=0.85)
    axes[i].set_title(col, fontsize=8, fontweight='bold')
    axes[i].tick_params(labelsize=7)
# Hide unused subplots
for j in range(len(feature_cols), len(axes)):
    axes[j].set_visible(False)
plt.suptitle("Feature Distributions", fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.savefig("feature_distributions.png", dpi=120, bbox_inches='tight')
plt.show()

		

7.5 Box Plot Analysis by Class

Box plots help compare selected biomedical voice features between the two target classes: healthy subjects and Parkinson’s disease cases. This makes it easier to see whether a feature shows visible separation between the two groups.

Add a new code cell

			
# Box plots: feature vs status
selected = ['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)',
            'HNR', 'RPDE', 'DFA', 'spread1', 'spread2', 'PPE']
fig, axes = plt.subplots(3, 3, figsize=(16, 12))
axes = axes.flatten()
for i, col in enumerate(selected):
    pdf.boxplot(column=col, by='status', ax=axes[i],
                boxprops=dict(color='#1565C0'),
                medianprops=dict(color='red', linewidth=2),
                whiskerprops=dict(color='gray'))
    axes[i].set_title(col, fontsize=9, fontweight='bold')
    axes[i].set_xlabel("Status (0=Healthy, 1=PD)")
plt.suptitle("Feature Box Plots by Class", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig("boxplots.png", dpi=120, bbox_inches='tight')
plt.show()

		

Data Visualisation

Convert to Pandas (for plotting)

pdf = df.toPandas()

8.1 Histogram Example

			
pdf.hist(figsize=(15,10))
plt.show()

Insight

Many features are skewed
Some have extreme values

8.2 Correlation Heatmap

			
plt.figure(figsize=(12,10))
sns.heatmap(pdf.corr(), cmap='coolwarm')
plt.show()

Insight

Strong correlations exist between features
Features like spread1, PPE are important predictors

Data Preprocessing

9.1 Check Missing Values

			
from pyspark.sql.functions import col, sum
df.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns]).show()

Result

No missing values found

9.2 Feature Preparation

			
feature_cols = [c for c in df.columns if c != "status"]
df = df.select(
    [col(c).cast(DoubleType()) for c in feature_cols] +
    [col("status").cast(DoubleType()).alias("label")]
)

		

9.3 Feature Vector + Scaling

			
assembler = VectorAssembler(inputCols=feature_cols, outputCol="raw_features")
scaler = StandardScaler(inputCol="raw_features", outputCol="features")
pipeline = Pipeline(stages=[assembler, scaler])
model = pipeline.fit(df)
df_prepared = model.transform(df)

		

We need scaling because features have different ranges, which can bias models.

Step 10: Train-Test Split

			
train, test = df_prepared.randomSplit([0.8, 0.2], seed=42)
print("Train:", train.count())
print("Test:", test.count())

80% training
20% testing
Reproducible using seed

Model Building

11.1 Logistic Regression

			
lr = LogisticRegression(featuresCol="features", labelCol="label")
model_lr = lr.fit(train)
pred_lr = model_lr.transform(test)

11.2 Random Forest

			
rf = RandomForestClassifier(featuresCol="features", labelCol="label")
model_rf = rf.fit(train)
pred_rf = model_rf.transform(test)

11.3 Decision Tree

			
dt = DecisionTreeClassifier(featuresCol="features", labelCol="label")
model_dt = dt.fit(train)
pred_dt = model_dt.transform(test)

Model Evaluation

			
evaluator = BinaryClassificationEvaluator(labelCol="label")
print("LR AUC:", evaluator.evaluate(pred_lr))
print("RF AUC:", evaluator.evaluate(pred_rf))
print("DT AUC:", evaluator.evaluate(pred_dt))

How to Run the Project

At this stage, your notebook contains a complete machine learning pipeline. The next step is understanding how to execute it correctly, whether you want to run everything at once or work step by step.

Running the Entire Notebook

The simplest way to execute your project is to run all cells in sequence.

Steps:

Go to the top menu
Click Runtime → Run all
If prompted, click Run anyway

What happens:

Each cell executes one after another
Data is downloaded automatically
Models are trained and evaluated
Outputs such as tables, charts, and metrics appear below each cell

Expected runtime:

Typically 5 to 15 minutes, depending on Colab resources

You will know it has completed successfully when:

All cells show a number like [1], [2], etc.
No cells are stuck with [*]
Outputs (graphs, metrics, predictions) are visible
No red error messages appear

Running Step by Step (Recommended for Learning)

If you want to understand how everything works, run the notebook one section at a time.

Suggested execution order:

Install libraries
Import libraries
Create Spark session
Load dataset
Explore data
Preprocess data
Prepare features
Train models
Evaluate models

How to run a single cell:

Click the Play button (▶) next to the cell
Or press Shift + Enter

Re-running Specific Sections

Sometimes you may only want to re-run a part of the notebook.

Example scenarios:

Changing a model parameter
Updating visualisations
Fixing a preprocessing step

Important rule:

Always ensure dependencies are executed first.

For example:

If you restart runtime, you must re-run installation and imports
If you change preprocessing, re-run feature engineering and model training

Restarting the Environment

Google Colab sessions are temporary. If disconnected:

Steps:

Click Runtime → Restart runtime
Then click Runtime → Run all

This ensures the entire pipeline runs from a clean state.

Running Only Model Training

If you want to focus only on models:

Ensure dataset is already loaded and processed
Run only:
- Feature preparation
- Train-test split
- Model training cells
- Evaluation cells

Key Results

Logistic Regression → baseline
Random Forest → best AUC
Decision Tree → best accuracy

Final Performance

Decision Tree Accuracy: ~90%
Random Forest AUC: ~0.94

Understanding the Outcome

You created a complete ML pipeline:

Data loading
Exploration
Preprocessing
Feature engineering
Model training
Evaluation

Key takeaway

Voice features can predict Parkinson’s disease
Non-linear models perform better
Tree-based models handle correlated features well

Source Code and Full Notebook

The complete implementation of this project, including the fully structured Google Colab notebook with all steps, outputs, and model evaluations, is available in the following repository:

GitHub Repository:
https://github.com/vivekbhadra/Sample_GoogleColabProject_For_ML

This repository contains:

The full end-to-end notebook used in this guide
All machine learning steps from data loading to model evaluation
Visualisations, metrics, and final results
A ready-to-run version that you can directly open in Google Colab

If you are following this guide step by step, you can use the repository as:

A reference implementation to verify your progress
A starting point for your own experiments
A base project to extend with additional models or datasets

How to Upload and Run an Existing Notebook in Google Colab

If you already have a notebook file (.ipynb) from GitHub or your local machine, Google Colab allows you to upload and run it easily without any setup.

This is useful when you want to:

Run a pre-built machine learning project
Reuse an existing notebook
Test or modify someone else’s implementation

Run the Notebook

Click Runtime → Run all
Click Run anyway if prompted

All cells will execute sequentially.

Why Google Colab?

Open Google Colab

Understand Notebook Structure

Install Required Libraries

Create your first code cell

Run the cell

Import Libraries

Add a new code cell

Create Spark Session

Add a new cell

Download and Load Dataset

Add a new cell

Data Exploration

7.1 Preview Data

7.2 Descriptive Statistics

7.3 Class Distribution

Insight

7.4 Feature Distribution Analysis

Add a new code cell

7.5 Box Plot Analysis by Class

Add a new code cell

Data Visualisation

Convert to Pandas (for plotting)

8.1 Histogram Example

Insight

8.2 Correlation Heatmap

Insight

Data Preprocessing

9.1 Check Missing Values

Result

9.2 Feature Preparation

9.3 Feature Vector + Scaling

Step 10: Train-Test Split

Model Building

11.1 Logistic Regression

11.2 Random Forest

11.3 Decision Tree

Model Evaluation

How to Run the Project

Running the Entire Notebook

Steps:

What happens:

Expected runtime:

Running Step by Step (Recommended for Learning)

Suggested execution order:

How to run a single cell:

Re-running Specific Sections

Example scenarios:

Important rule:

Restarting the Environment

Steps:

Running Only Model Training

Key Results

Final Performance

Understanding the Outcome

Key takeaway

Source Code and Full Notebook

How to Upload and Run an Existing Notebook in Google Colab

Run the Notebook

Related

Leave a ReplyCancel reply

Discover more from Tech For Talk