Machine learning can feel overwhelming when you are starting out. The tools, setup, and environment configuration often become the biggest barrier before you even write your first line of code.
The easiest way to begin is Google Colab. It removes all setup complexity and allows you to run Python, PySpark, and machine learning workflows directly in your browser.
This guide walks you through creating your first machine learning notebook from scratch, building it step by step, and implementing a complete end-to-end ML pipeline.
Why Google Colab?
Google Colab is ideal for beginners because:
- No installation required
- Runs entirely in your browser
- Free access to compute resources
- Supports Python, PySpark, and ML libraries
- Easy sharing and collaboration
Your entire machine learning project will live inside a notebook, where each step is executed in sequence.
Open Google Colab
Go to:
https://colab.research.google.com
- Sign in with your Google account
- Click File → New notebook


You now have a blank notebook.
Understand Notebook Structure
A notebook consists of cells:
- Code cells → run Python code
- Text cells → add explanations
You will build your ML project by adding cells one by one.
Install Required Libraries
Create your first code cell
Paste this:
!pip install pyspark pandas matplotlib seaborn scikit-learn
Run the cell
- Click the Play button (▶) on the left
- Wait for installation to complete
This installs all required tools:
- PySpark → distributed processing
- Pandas → data handling
- Matplotlib/Seaborn → visualisation
- Scikit-learn → ML utilities

Import Libraries
Add a new code cell

Hover your mouse around the edge of the botton of an exisitng cell and a pop up will appear +Code or +Text, choose +Code for pasting code or +Text for regular text.
# Core librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns# PySparkfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import *from pyspark.sql.types import *# MLlibfrom pyspark.ml.feature import VectorAssembler, StandardScaler, PCAfrom pyspark.ml.classification import LogisticRegression, RandomForestClassifier, DecisionTreeClassifierfrom pyspark.ml.evaluation import BinaryClassificationEvaluatorfrom pyspark.ml.tuning import ParamGridBuilder, CrossValidator

Create Spark Session
Add a new cell
spark = SparkSession.builder \ .appName("ParkinsonsClassification") \ .config("spark.sql.shuffle.partitions", "8") \ .getOrCreate()spark.sparkContext.setLogLevel("ERROR")print("Spark Version:", spark.version)
- Creates the Spark engine
- Sets partitions to optimise performance for small datasets
- Reduces log noise
This is the entry point for all PySpark operations.

Download and Load Dataset
Add a new cell
import osimport urllib.requesturl = "https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data"if not os.path.exists("parkinsons.data"): urllib.request.urlretrieve(url, "parkinsons.data")df = spark.read.csv("parkinsons.data", header=True, inferSchema=True)# Drop non-useful columndf = df.drop("name")df.printSchema()print("Shape:", df.count(), len(df.columns))

- Dataset downloaded automatically
- Loaded into Spark DataFrame
namecolumn removed (not useful for ML) - Dataset has 195 rows and 23 columns
Data Exploration
7.1 Preview Data
df.show(5)
This confirms data loaded correctly.
7.2 Descriptive Statistics
df.describe().show()
This gives:
- Mean
- Standard deviation
- Min / Max

7.3 Class Distribution
df.groupBy("status").count().show()
Insight
- Healthy: ~24%
- Parkinson’s: ~75%
This shows class imbalance, which affects model evaluation.


7.4 Feature Distribution Analysis
Understanding how each feature behaves is an essential step before building machine learning models. Instead of plotting all features in a single combined chart, we create a structured grid of histograms so that each feature can be analysed individually.
Add a new code cell
# Feature distributionsfeature_cols = [c for c in pdf.columns if c != 'status']fig, axes = plt.subplots(5, 5, figsize=(22, 18))axes = axes.flatten()for i, col in enumerate(feature_cols): axes[i].hist(pdf[col], bins=25, color='#42A5F5', edgecolor='white', alpha=0.85) axes[i].set_title(col, fontsize=8, fontweight='bold') axes[i].tick_params(labelsize=7)# Hide unused subplotsfor j in range(len(feature_cols), len(axes)): axes[j].set_visible(False)plt.suptitle("Feature Distributions", fontsize=15, fontweight='bold', y=1.01)plt.tight_layout()plt.savefig("feature_distributions.png", dpi=120, bbox_inches='tight')plt.show()

7.5 Box Plot Analysis by Class
Box plots help compare selected biomedical voice features between the two target classes: healthy subjects and Parkinson’s disease cases. This makes it easier to see whether a feature shows visible separation between the two groups.
Add a new code cell
# Box plots: feature vs statusselected = ['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'HNR', 'RPDE', 'DFA', 'spread1', 'spread2', 'PPE']fig, axes = plt.subplots(3, 3, figsize=(16, 12))axes = axes.flatten()for i, col in enumerate(selected): pdf.boxplot(column=col, by='status', ax=axes[i], boxprops=dict(color='#1565C0'), medianprops=dict(color='red', linewidth=2), whiskerprops=dict(color='gray')) axes[i].set_title(col, fontsize=9, fontweight='bold') axes[i].set_xlabel("Status (0=Healthy, 1=PD)")plt.suptitle("Feature Box Plots by Class", fontsize=14, fontweight='bold')plt.tight_layout()plt.savefig("boxplots.png", dpi=120, bbox_inches='tight')plt.show()

Data Visualisation
Convert to Pandas (for plotting)
pdf = df.toPandas()
8.1 Histogram Example
pdf.hist(figsize=(15,10))plt.show()
Insight
- Many features are skewed
- Some have extreme values
8.2 Correlation Heatmap
plt.figure(figsize=(12,10))sns.heatmap(pdf.corr(), cmap='coolwarm')plt.show()
Insight
- Strong correlations exist between features
- Features like spread1, PPE are important predictors
Data Preprocessing
9.1 Check Missing Values
from pyspark.sql.functions import col, sumdf.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns]).show()
Result
- No missing values found
9.2 Feature Preparation
feature_cols = [c for c in df.columns if c != "status"]df = df.select( [col(c).cast(DoubleType()) for c in feature_cols] + [col("status").cast(DoubleType()).alias("label")])
9.3 Feature Vector + Scaling
assembler = VectorAssembler(inputCols=feature_cols, outputCol="raw_features")scaler = StandardScaler(inputCol="raw_features", outputCol="features")pipeline = Pipeline(stages=[assembler, scaler])model = pipeline.fit(df)df_prepared = model.transform(df)
We need scaling because features have different ranges, which can bias models.
Step 10: Train-Test Split
train, test = df_prepared.randomSplit([0.8, 0.2], seed=42)print("Train:", train.count())print("Test:", test.count())
- 80% training
- 20% testing
- Reproducible using seed
Model Building
11.1 Logistic Regression
lr = LogisticRegression(featuresCol="features", labelCol="label")model_lr = lr.fit(train)pred_lr = model_lr.transform(test)
11.2 Random Forest
rf = RandomForestClassifier(featuresCol="features", labelCol="label")model_rf = rf.fit(train)pred_rf = model_rf.transform(test)
11.3 Decision Tree
dt = DecisionTreeClassifier(featuresCol="features", labelCol="label")model_dt = dt.fit(train)pred_dt = model_dt.transform(test)
Model Evaluation
evaluator = BinaryClassificationEvaluator(labelCol="label")print("LR AUC:", evaluator.evaluate(pred_lr))print("RF AUC:", evaluator.evaluate(pred_rf))print("DT AUC:", evaluator.evaluate(pred_dt))
How to Run the Project
At this stage, your notebook contains a complete machine learning pipeline. The next step is understanding how to execute it correctly, whether you want to run everything at once or work step by step.
Running the Entire Notebook
The simplest way to execute your project is to run all cells in sequence.
Steps:
- Go to the top menu
- Click Runtime → Run all
- If prompted, click Run anyway
What happens:
- Each cell executes one after another
- Data is downloaded automatically
- Models are trained and evaluated
- Outputs such as tables, charts, and metrics appear below each cell
Expected runtime:
- Typically 5 to 15 minutes, depending on Colab resources
You will know it has completed successfully when:
- All cells show a number like
[1],[2], etc. - No cells are stuck with
[*] - Outputs (graphs, metrics, predictions) are visible
- No red error messages appear
Running Step by Step (Recommended for Learning)
If you want to understand how everything works, run the notebook one section at a time.
Suggested execution order:
- Install libraries
- Import libraries
- Create Spark session
- Load dataset
- Explore data
- Preprocess data
- Prepare features
- Train models
- Evaluate models
How to run a single cell:
- Click the Play button (▶) next to the cell
- Or press Shift + Enter
Re-running Specific Sections
Sometimes you may only want to re-run a part of the notebook.
Example scenarios:
- Changing a model parameter
- Updating visualisations
- Fixing a preprocessing step
Important rule:
Always ensure dependencies are executed first.
For example:
- If you restart runtime, you must re-run installation and imports
- If you change preprocessing, re-run feature engineering and model training
Restarting the Environment
Google Colab sessions are temporary. If disconnected:
Steps:
- Click Runtime → Restart runtime
- Then click Runtime → Run all
This ensures the entire pipeline runs from a clean state.
Running Only Model Training
If you want to focus only on models:
- Ensure dataset is already loaded and processed
- Run only:
- Feature preparation
- Train-test split
- Model training cells
- Evaluation cells
Key Results
- Logistic Regression → baseline
- Random Forest → best AUC
- Decision Tree → best accuracy
Final Performance
- Decision Tree Accuracy: ~90%
- Random Forest AUC: ~0.94
Understanding the Outcome
You created a complete ML pipeline:
- Data loading
- Exploration
- Preprocessing
- Feature engineering
- Model training
- Evaluation
Key takeaway
- Voice features can predict Parkinson’s disease
- Non-linear models perform better
- Tree-based models handle correlated features well
Source Code and Full Notebook
The complete implementation of this project, including the fully structured Google Colab notebook with all steps, outputs, and model evaluations, is available in the following repository:
GitHub Repository:
https://github.com/vivekbhadra/Sample_GoogleColabProject_For_ML
This repository contains:
- The full end-to-end notebook used in this guide
- All machine learning steps from data loading to model evaluation
- Visualisations, metrics, and final results
- A ready-to-run version that you can directly open in Google Colab
If you are following this guide step by step, you can use the repository as:
- A reference implementation to verify your progress
- A starting point for your own experiments
- A base project to extend with additional models or datasets
How to Upload and Run an Existing Notebook in Google Colab
If you already have a notebook file (.ipynb) from GitHub or your local machine, Google Colab allows you to upload and run it easily without any setup.
This is useful when you want to:
- Run a pre-built machine learning project
- Reuse an existing notebook
- Test or modify someone else’s implementation
Run the Notebook
- Click Runtime → Run all
- Click Run anyway if prompted
All cells will execute sequentially.
Leave a Reply