Mastering Feature Engineering for Machine Learning

After completing the Exploratory Data Analysis (EDA) stage, we move to the next logical step in our pipeline. This step is Feature Engineering. Feature Engineering is a critical bridge between data exploration and model training.
In this stage, we convert the cleaned and analysed dataset into a machine-learning–ready form. We create new, meaningful features that capture seasonality. These features also capture store-item interactions and temporal patterns.

Objective

The objective of this phase is threefold:

Generate additional temporal and statistical features that help the model recognise trends, periodic patterns, and seasonality.
Encode categorical identifiers such as store and item while maintaining numeric consistency for machine learning algorithms.
Ensure the enriched dataset remains complete, schema-consistent, and structured in S3 for direct ingestion into the model training workflow.

Establishing the Feature Engineering Environment

To maintain continuity and reproducibility across all pipeline stages, this phase is executed using AWS Glue in script mode. It is the same configuration used for preprocessing and exploratory data analysis.
Running all jobs within the same Glue environment ensures identical cluster configurations. It also ensures consistent IAM roles and S3 access permissions throughout the workflow.

A new Glue job, scdf-feature-engineering-job, is created under the existing IAM role used previously. To enable this job to write outputs to a new S3 directory, the IAM policy is extended with the following resource path:

"arn:aws:s3:::scdf-project-data/features/*"

This addition allows the Glue job to persist the transformed feature dataset under the features/ directory. This completes the data pipeline’s analytical foundation. It also prepares the data for model training and evaluation.

Code Walkthrough and Output Verification

The feature engineering stage builds upon the earlier preprocessing and EDA pipeline. It enriches the dataset with temporal and lag-based predictors that are essential for time-series forecasting.
The implementation, contained in the script glue_script_feature_engineering.py, follows a structured five-step flow consistent with the previous Glue jobs.
The full script is available via the public GitHub repository created for this project.

AWS Glue Job Setup

The script begins by establishing the Glue environment:

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

This setup initialises a managed Spark session. It integrates the session with Glue’s job lifecycle. This integration ensures a consistent runtime configuration across the pipeline.

Reading the Processed Dataset

input_path = "s3://scdf-project-data/processed/"
df = spark.read.parquet(input_path)
print("Processed dataset loaded successfully.")

The CloudWatch logs confirm a successful data load:

2025-10-20T06:26:09.718Z Processed dataset loaded successfully.

This validates the seamless transition from the preprocessing phase to feature engineering. It confirms that the pipeline’s intermediate outputs are accessible. The outputs are also schema-consistent.

Creating Temporal Features

Using PySpark’s built-in date functions, the script derives three time-based columns:

df = df.withColumn("year", year(col("date"))) \
       .withColumn("month", month(col("date"))) \
       .withColumn("day_of_week", dayofweek(col("date")))
print("Temporal features (year, month, day_of_week) created.")

CloudWatch confirmation:

2025-10-20T06:26:09.853Z Temporal features (year, month, day_of_week) created.

These derived columns introduce seasonality awareness, enabling future models to capture monthly and weekday variations in sales behaviour.

Lag and Rolling Average Features

To represent temporal dependencies, the script constructs lag and rolling average features using Spark’s window functions:

window_spec = Window.partitionBy("store", "item").orderBy("date")
df = df.withColumn("lag_1", lag("sales", 1).over(window_spec))
df = df.withColumn("lag_7", lag("sales", 7).over(window_spec))
df = df.withColumn("rolling_avg_7", avg("sales").over(window_spec.rowsBetween(-6, 0)))
print("Lag and rolling average features created.")

CloudWatch log confirmation:

2025-10-20T06:26:10.077Z Lag and rolling average features created.

lag_1: captures previous day’s sales
lag_7: captures sales from the previous week
rolling_avg_7: computes a seven-day moving average

Together, these features provide temporal context and momentum indicators that enhance model interpretability.

Handling Missing Values and Persisting Outputs

Lag-based computations generate missing values for the earliest records in each window. These are imputed with zeros before saving the output:

df = df.na.fill(0, subset=["lag_1", "lag_7", "rolling_avg_7"])
print("Missing values in lag features imputed with zeros.")
output_path = "s3://scdf-project-data/features/"
df.write.mode("overwrite").parquet(output_path)
print("Feature-engineered dataset written to:", output_path)

CloudWatch logs confirm the completion sequence:

2025-10-20T06:26:10.151Z  Missing values in lag features imputed with zeros.
2025-10-20T06:26:25.522Z  Feature-engineered dataset written to: s3://scdf-project-data/features/
2025-10-20T06:26:25.528Z  Feature engineering job completed successfully at: 2025-10-20 06:26:25.527828

Each module feeds seamlessly into the next, ensuring a consistent and reproducible data flow from ingestion to model-ready outputs.

Source Code

The complete implementation of the feature engineering stage, including configuration and output persistence, is available in the project’s public GitHub repository:

🔗 Glue Script Feature Engineering

This script is designed for execution within AWS Glue in script mode. It directly continues from the earlier preprocessing and EDA stages.

Conclusion

In conclusion, the Feature Engineering stage is crucial in the machine learning pipeline. It transforms our dataset into a format conducive to effective model training. By generating temporal features, encoding categorical identifiers, and ensuring a schema-consistent dataset, we lay a robust foundation for predictive analytics. The implementation through AWS Glue not only guarantees a seamless integration of various pipeline stages. It also promotes reproducibility and scalability.

The addition of lag, rolling averages, and seasonal predictors is crucial. They help capture the dynamics of sales behavior over time. This results in more accurate forecasting models. As we progress through the pipeline, this enriched dataset signifies a crucial step toward extracting meaningful insights from the data.

With the feature-engineered dataset now in place, the next milestone in this series is Model Training and Evaluation. In this phase, the enriched features will train demand forecasting models. We will assess their performance and visualise prediction accuracy.
This final phase will complete the end-to-end pipeline. It includes everything from raw data ingestion to predictive insight generation. This process will occur entirely within AWS Glue and S3.