Enhancing Your ETL Pipeline with AWS Glue and PySpark

In our previous post, we built the foundation of a serverless ETL pipeline. We used AWS Glue and PySpark to ingest, clean, and split retail sales data. This data was from Kaggle’s Store Item Demand Forecasting Challenge. That version was intentionally minimal. It dropped null rows. The version also performed basic data type conversions to get the pipeline running end-to-end.

Now that the initial infrastructure is operational, it’s time to enhance the pre-processing step. We need to make it more robust. It should also be transparent and machine-learning-ready.

Strengthening the Foundation of Data Preprocessing

The first implementation was designed for simplicity. It read raw data, dropped missing rows, and wrote clean outputs. While this helped validate the pipeline, the approach also introduced several shortcomings:

Columns remained as strings without explicit type casting.
Missing values were ignored instead of handled systematically.
The pipeline lacked visibility into schema, null counts, and data distribution.

These issues can silently propagate downstream, leading to type mismatches, degraded model accuracy, and limited traceability. The next iteration of the pipeline addresses these concerns comprehensively.

Implementing Explicit Column Type Conversion

The dataset originally stored all fields as strings. To enforce proper typing, each column is now explicitly cast to its correct data type before further processing:

df = (
    df.withColumn("store", col("store").cast("int"))
      .withColumn("item", col("item").cast("int"))
      .withColumn("sales", col("sales").cast("float"))
      .withColumn("date", col("date").cast("date"))
)

This change ensures that numerical columns behave as numbers, and date fields can be parsed correctly for time-based analysis. To validate the schema, the following diagnostic loop prints each column name and its inferred data type:

print("---- Column Data Types ----")
for name, dtype in df.dtypes:
    print(f"{name}: {dtype}")

This provides immediate confirmation of data structure before transformation and helps detect mismatches early in the workflow.

Enabling Missing Value Imputation

Instead of removing incomplete rows, the pipeline now fills missing values in the sales column using mean imputation. This preserves valuable records and ensures data completeness, which is critical for real-world retail datasets:

mean_sales = df.select(mean("sales").alias("mean_sales")).collect()[0]["mean_sales"]
df_cleaned = df.fillna({"sales": mean_sales})

Before imputation, a log captures the number of missing values per column:

missing_info = {colname: df.filter(col(colname).isNull()).count() for colname in df.columns}
print("---- Missing Value Check ----")
print(missing_info)

These metrics are sent to CloudWatch Logs, allowing engineers to monitor data health with each pipeline run.

Adding Statistical Summaries for Observability

Descriptive statistics are an essential part of understanding the dataset before transformation or modelling. Using PySpark’s built-in .describe() method, the pipeline now generates and logs summary statistics for key columns:

df.describe(["sales", "store", "item"]).show()

This simple enhancement provides visibility into mean, standard deviation, min, max, and count. It helps detect outliers, unexpected scales, or data imbalances early.

Introducing Sales Value Normalization

To ensure consistent feature scaling for downstream machine learning, the sales column is now normalized using the Min–Max scaling technique. This ensures that values fall within the range [0, 1], preventing any single variable from dominating the model:

assembler = VectorAssembler(inputCols=["sales"], outputCol="sales_vector")
scaler = MinMaxScaler(inputCol="sales_vector", outputCol="sales_scaled")
pipeline = Pipeline(stages=[assembler, scaler])
scaler_model = pipeline.fit(df_cleaned)
df_scaled = scaler_model.transform(df_cleaned)

# Convert vector to scalar float
vector_to_float = udf(lambda vec: float(vec[0]), FloatType())
df_final = (
    df_scaled.withColumn("sales_scaled_value", vector_to_float(col("sales_scaled")))
             .drop("sales_vector", "sales_scaled")
)

The resulting sales_scaled_value column becomes a machine-learning-friendly feature, consistently scaled and ready for model input.

Establishing Cloud-Native Traceability

Observability is a core part of production-grade data engineering. To make transformations fully traceable, print statements were added throughout the script. These emit logs to AWS CloudWatch, allowing each step — from ingestion to scaling — to be verified after execution.

These logs create a detailed trail of transformations, supporting debugging, auditing, and performance tuning across pipeline runs. They also align the ETL process with modern DevOps and MLOps best practices.

Providing Access to the Full Enhanced Glue Script

The complete version of the enhanced AWS Glue script, containing all preprocessing improvements — including type casting, imputation, normalisation, and logging — is available for review and download from GitHub:

📎 Download Script: Glue Job Script (Preprocessing)

You can copy this script directly into the AWS Glue Script Editor. Alternatively, upload it to an S3 location. Then you can reference it as part of your Glue job configuration. It’s fully production-ready and aligns with the best practices for serverless data pipelines.

Achieving a Production-Ready Preprocessing Layer

With these enhancements, the ETL preprocessing stage transitions from a minimal prototype to a cloud-native, machine-learning-ready data foundation. It now offers:

Enforced schema consistency through type casting
Complete visibility of missing data and descriptive statistics
Intelligent imputation to handle real-world data gaps
Feature scaling for stable model training
Integrated observability via CloudWatch Logs

Each of these improvements adds reliability, transparency, and scalability — qualities essential for any production-grade data pipeline.

Advancing Towards Exploratory Data Analysis

With a robust preprocessing step in place, the dataset is now clean, structured, and consistent — ideal for deeper exploration. In the next blog, we will focus on Exploratory Data Analysis (EDA). This will help to uncover sales patterns. It will reveal store-level behaviour and highlight seasonal trends.