Machine learning can feel overwhelming when you are starting out. The tools, setup, and environment configuration often become the biggest barrier before you even write your first line of code.

The easiest way to begin is Google Colab. It removes all setup complexity and allows you to run Python, PySpark, and machine learning workflows directly in your browser.

This guide walks you through creating your first machine learning notebook from scratch, building it step by step, and implementing a complete end-to-end ML pipeline.

Why Google Colab?

Google Colab is ideal for beginners because:

  • No installation required
  • Runs entirely in your browser
  • Free access to compute resources
  • Supports Python, PySpark, and ML libraries
  • Easy sharing and collaboration

Your entire machine learning project will live inside a notebook, where each step is executed in sequence.

Open Google Colab

Go to:
https://colab.research.google.com

  • Sign in with your Google account
  • Click File → New notebook

You now have a blank notebook.

Understand Notebook Structure

A notebook consists of cells:

  • Code cells → run Python code
  • Text cells → add explanations

You will build your ML project by adding cells one by one.

Install Required Libraries

Create your first code cell

Paste this:

!pip install pyspark pandas matplotlib seaborn scikit-learn

Run the cell

  • Click the Play button (▶) on the left
  • Wait for installation to complete

This installs all required tools:

  • PySpark → distributed processing
  • Pandas → data handling
  • Matplotlib/Seaborn → visualisation
  • Scikit-learn → ML utilities

Import Libraries

Add a new code cell

Hover your mouse around the edge of the botton of an exisitng cell and a pop up will appear +Code or +Text, choose +Code for pasting code or +Text for regular text.

# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
# MLlib
from pyspark.ml.feature import VectorAssembler, StandardScaler, PCA
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

Create Spark Session

Add a new cell

spark = SparkSession.builder \
.appName("ParkinsonsClassification") \
.config("spark.sql.shuffle.partitions", "8") \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
print("Spark Version:", spark.version)
  • Creates the Spark engine
  • Sets partitions to optimise performance for small datasets
  • Reduces log noise

This is the entry point for all PySpark operations.

Download and Load Dataset

Add a new cell

import os
import urllib.request
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data"
if not os.path.exists("parkinsons.data"):
urllib.request.urlretrieve(url, "parkinsons.data")
df = spark.read.csv("parkinsons.data", header=True, inferSchema=True)
# Drop non-useful column
df = df.drop("name")
df.printSchema()
print("Shape:", df.count(), len(df.columns))
  • Dataset downloaded automatically
  • Loaded into Spark DataFrame name column removed (not useful for ML)
  • Dataset has 195 rows and 23 columns

Data Exploration

7.1 Preview Data

df.show(5)

This confirms data loaded correctly.

7.2 Descriptive Statistics

df.describe().show()

This gives:

  • Mean
  • Standard deviation
  • Min / Max

7.3 Class Distribution

df.groupBy("status").count().show()

Insight

  • Healthy: ~24%
  • Parkinson’s: ~75%

This shows class imbalance, which affects model evaluation.

7.4 Feature Distribution Analysis

Understanding how each feature behaves is an essential step before building machine learning models. Instead of plotting all features in a single combined chart, we create a structured grid of histograms so that each feature can be analysed individually.

Add a new code cell

# Feature distributions
feature_cols = [c for c in pdf.columns if c != 'status']
fig, axes = plt.subplots(5, 5, figsize=(22, 18))
axes = axes.flatten()
for i, col in enumerate(feature_cols):
axes[i].hist(pdf[col], bins=25, color='#42A5F5', edgecolor='white', alpha=0.85)
axes[i].set_title(col, fontsize=8, fontweight='bold')
axes[i].tick_params(labelsize=7)
# Hide unused subplots
for j in range(len(feature_cols), len(axes)):
axes[j].set_visible(False)
plt.suptitle("Feature Distributions", fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.savefig("feature_distributions.png", dpi=120, bbox_inches='tight')
plt.show()

7.5 Box Plot Analysis by Class

Box plots help compare selected biomedical voice features between the two target classes: healthy subjects and Parkinson’s disease cases. This makes it easier to see whether a feature shows visible separation between the two groups.

Add a new code cell

# Box plots: feature vs status
selected = ['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)',
'HNR', 'RPDE', 'DFA', 'spread1', 'spread2', 'PPE']
fig, axes = plt.subplots(3, 3, figsize=(16, 12))
axes = axes.flatten()
for i, col in enumerate(selected):
pdf.boxplot(column=col, by='status', ax=axes[i],
boxprops=dict(color='#1565C0'),
medianprops=dict(color='red', linewidth=2),
whiskerprops=dict(color='gray'))
axes[i].set_title(col, fontsize=9, fontweight='bold')
axes[i].set_xlabel("Status (0=Healthy, 1=PD)")
plt.suptitle("Feature Box Plots by Class", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig("boxplots.png", dpi=120, bbox_inches='tight')
plt.show()

Data Visualisation

Convert to Pandas (for plotting)

pdf = df.toPandas()

8.1 Histogram Example

pdf.hist(figsize=(15,10))
plt.show()

Insight

  • Many features are skewed
  • Some have extreme values

8.2 Correlation Heatmap

plt.figure(figsize=(12,10))
sns.heatmap(pdf.corr(), cmap='coolwarm')
plt.show()

Insight

  • Strong correlations exist between features
  • Features like spread1, PPE are important predictors

Data Preprocessing

9.1 Check Missing Values

from pyspark.sql.functions import col, sum
df.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns]).show()

Result

  • No missing values found

9.2 Feature Preparation

feature_cols = [c for c in df.columns if c != "status"]
df = df.select(
[col(c).cast(DoubleType()) for c in feature_cols] +
[col("status").cast(DoubleType()).alias("label")]
)

9.3 Feature Vector + Scaling

assembler = VectorAssembler(inputCols=feature_cols, outputCol="raw_features")
scaler = StandardScaler(inputCol="raw_features", outputCol="features")
pipeline = Pipeline(stages=[assembler, scaler])
model = pipeline.fit(df)
df_prepared = model.transform(df)

We need scaling because features have different ranges, which can bias models.

Step 10: Train-Test Split

train, test = df_prepared.randomSplit([0.8, 0.2], seed=42)
print("Train:", train.count())
print("Test:", test.count())
  • 80% training
  • 20% testing
  • Reproducible using seed

Model Building

11.1 Logistic Regression

lr = LogisticRegression(featuresCol="features", labelCol="label")
model_lr = lr.fit(train)
pred_lr = model_lr.transform(test)

11.2 Random Forest

rf = RandomForestClassifier(featuresCol="features", labelCol="label")
model_rf = rf.fit(train)
pred_rf = model_rf.transform(test)

11.3 Decision Tree

dt = DecisionTreeClassifier(featuresCol="features", labelCol="label")
model_dt = dt.fit(train)
pred_dt = model_dt.transform(test)

Model Evaluation

evaluator = BinaryClassificationEvaluator(labelCol="label")
print("LR AUC:", evaluator.evaluate(pred_lr))
print("RF AUC:", evaluator.evaluate(pred_rf))
print("DT AUC:", evaluator.evaluate(pred_dt))

How to Run the Project

At this stage, your notebook contains a complete machine learning pipeline. The next step is understanding how to execute it correctly, whether you want to run everything at once or work step by step.

Running the Entire Notebook

The simplest way to execute your project is to run all cells in sequence.

Steps:

  1. Go to the top menu
  2. Click Runtime → Run all
  3. If prompted, click Run anyway

What happens:

  • Each cell executes one after another
  • Data is downloaded automatically
  • Models are trained and evaluated
  • Outputs such as tables, charts, and metrics appear below each cell

Expected runtime:

  • Typically 5 to 15 minutes, depending on Colab resources

You will know it has completed successfully when:

  • All cells show a number like [1], [2], etc.
  • No cells are stuck with [*]
  • Outputs (graphs, metrics, predictions) are visible
  • No red error messages appear

Running Step by Step (Recommended for Learning)

If you want to understand how everything works, run the notebook one section at a time.

Suggested execution order:

  1. Install libraries
  2. Import libraries
  3. Create Spark session
  4. Load dataset
  5. Explore data
  6. Preprocess data
  7. Prepare features
  8. Train models
  9. Evaluate models

How to run a single cell:

  • Click the Play button (▶) next to the cell
  • Or press Shift + Enter

Re-running Specific Sections

Sometimes you may only want to re-run a part of the notebook.

Example scenarios:

  • Changing a model parameter
  • Updating visualisations
  • Fixing a preprocessing step

Important rule:

Always ensure dependencies are executed first.

For example:

  • If you restart runtime, you must re-run installation and imports
  • If you change preprocessing, re-run feature engineering and model training

Restarting the Environment

Google Colab sessions are temporary. If disconnected:

Steps:

  1. Click Runtime → Restart runtime
  2. Then click Runtime → Run all

This ensures the entire pipeline runs from a clean state.

Running Only Model Training

If you want to focus only on models:

  • Ensure dataset is already loaded and processed
  • Run only:
    • Feature preparation
    • Train-test split
    • Model training cells
    • Evaluation cells

Key Results

  • Logistic Regression → baseline
  • Random Forest → best AUC
  • Decision Tree → best accuracy

Final Performance

  • Decision Tree Accuracy: ~90%
  • Random Forest AUC: ~0.94

Understanding the Outcome

You created a complete ML pipeline:

  1. Data loading
  2. Exploration
  3. Preprocessing
  4. Feature engineering
  5. Model training
  6. Evaluation

Key takeaway

  • Voice features can predict Parkinson’s disease
  • Non-linear models perform better
  • Tree-based models handle correlated features well

Source Code and Full Notebook

The complete implementation of this project, including the fully structured Google Colab notebook with all steps, outputs, and model evaluations, is available in the following repository:

GitHub Repository:
https://github.com/vivekbhadra/Sample_GoogleColabProject_For_ML

This repository contains:

  • The full end-to-end notebook used in this guide
  • All machine learning steps from data loading to model evaluation
  • Visualisations, metrics, and final results
  • A ready-to-run version that you can directly open in Google Colab

If you are following this guide step by step, you can use the repository as:

  • A reference implementation to verify your progress
  • A starting point for your own experiments
  • A base project to extend with additional models or datasets

How to Upload and Run an Existing Notebook in Google Colab

If you already have a notebook file (.ipynb) from GitHub or your local machine, Google Colab allows you to upload and run it easily without any setup.

This is useful when you want to:

  • Run a pre-built machine learning project
  • Reuse an existing notebook
  • Test or modify someone else’s implementation

Run the Notebook

  • Click Runtime → Run all
  • Click Run anyway if prompted

All cells will execute sequentially.

Leave a Reply