In the previous series of posts, we first designed the foundation of our data pipeline. In Building an ETL Pipeline for Retail Demand Data, we established how raw retail data is ingested. We also organised it for downstream processing. Then, we scaled and operationalised that flow in Enhancing Your ETL Pipeline with AWS Glue and PySpark. This created a fully managed ETL layer. It automated transformations across datasets.
Next, we turned our attention to the data itself in Mastering EDA for Demand Forecasting on AWS. We explored its structure. We also identified trends and resolved quality issues. Building on those insights, we engineered predictive features. This was done in Mastering Feature Engineering for Machine Learning. We prepared a rich dataset optimized for model training. We trained forecasting models in Training and Evaluating ML Models with AWS Glue. Then, we validated them, completing the machine learning stage of the pipeline.
In this post, we move into observability — giving our pipeline a voice. We build an API-driven layer. This enables authorised users to retrieve real-time status updates. They can access data directly from AWS services such as Glue, S3, and CloudWatch. This step changes our solution from a static data pipeline into a cloud-native system. It becomes self-monitoring and transparent. The system is API-accessible. It reports its own operational health without requiring anyone to log into the AWS console.
Retrieving Pipeline Status via AWS APIs
The next stage of the project focused on exposing internal pipeline details through an API-driven interface. This allows authorised users to query the operational status of both data and machine learning pipelines. This capability eliminates the need for users to manually inspect AWS services such as Glue or S3. It still maintains full transparency and observability.
The design objective was to show how a cloud-native solution can offer programmatic visibility into system health. It can also improve performance by leveraging AWS APIs. Through this approach, stakeholders can validate data ingestion. They can also validate model training and ETL workflows in near real time. This ensures traceability and accountability across the pipeline lifecycle.
Retrieve Key Application Details Using AWS APIs
To achieve this, a Lambda function named scdf-status-check-api-access was developed. It acts as the central API access layer for the solution.
This function was implemented in Python using the boto3 SDK. It enables seamless interaction with AWS Glue, Amazon S3, and Amazon CloudWatch. Each of these services contributes a unique aspect of pipeline observability:
- AWS Glue – provides metadata and execution states for ETL and ML jobs.
- Amazon S3 – verifies artefacts and ensures that output datasets exist as expected.
- Amazon CloudWatch – integrates metrics and logs for performance monitoring.
Together, these services provide a comprehensive, programmatic view of the entire system.
Source Code
The full implementation of this Lambda function is available in the public GitHub repository:
API Driven Cloud-Native Solutions
The script, named scdf_status_check.py, contains the complete Lambda handler logic, including:
- AWS Glue job discovery and run status retrieval
- Amazon S3 artefact verification
- CloudWatch metrics integration (extensible)
- Structured console output for verification
The function was built and tested in the AWS Lambda environment (Python 3.13 runtime) and follows least-privilege IAM access principles. The code is modular and scalable, making it easy to extend with additional AWS service integrations in future.
Understanding the Lambda Function’s Role
Before diving into the implementation, it’s important to understand what this script achieves at a high level.
The scdf-status-check-api-access function acts as a lightweight observability agent within the pipeline. Its goal is to collect, correlate, and present the most essential runtime information about the system — without human intervention.
When triggered, the Lambda performs three primary actions:
- Discovery – automatically identifies all AWS Glue jobs deployed within the environment.
- Inspection – queries the latest job runs to determine current state, start time, and runtime duration.
- Reporting – structures these details into a clear, console-based summary that is logged to CloudWatch.
This approach aligns with the broader vision of a self-aware cloud-native pipeline. In this vision, operational transparency is built in — not bolted on.
Users no longer need to check multiple AWS services manually. They can invoke a single API endpoint. This allows them to immediately retrieve up-to-date health and performance metrics for the entire system.
With that context, let’s walk through the function step by step and validate its behaviour using real execution logs.
Code Walkthrough and Output Verification
The Lambda function is organised into distinct functional segments, each handling one AWS service integration. Let’s explore how these components work together and examine the actual execution logs that validate their operation.
Lambda Initialisation
Upon invocation, the Lambda begins by initialising boto3 clients for the required AWS services — namely AWS Glue and Amazon S3:
glue = boto3.client('glue')
s3 = boto3.client('s3')
The boto3 library — the official AWS SDK for Python — is the natural choice for this implementation. It’s natively supported within the AWS Lambda runtime. It requires no external dependencies. It also provides a high-level API for interacting with AWS services.
By using boto3:
- The Lambda securely authenticates through its execution role, without requiring static credentials.
- All calls to Glue, S3, and CloudWatch are handled through fully managed, retry-safe API sessions.
- Data can be retrieved and formatted directly within the same Python environment, ensuring minimal latency and consistent error handling.
A descriptive header marks the start of each Lambda execution:
===== API Driven Cloud-Native Solution - Verification Output =====
Automatic Discovery of AWS Glue Jobs
Once initialised, the Lambda automatically discovers all AWS Glue jobs configured in the account. This eliminates the need for hardcoded job names and ensures that new jobs are automatically included in the monitoring scope.
Implementation snippet
jobs_response = glue.get_jobs(MaxResults=20)
job_names = [job['Name'] for job in jobs_response['Jobs']]
print(f"Discovered {len(job_names)} Glue jobs in this account.\n")
The get_jobs() API returns metadata for all Glue jobs within the same AWS region. The Lambda extracts job names into a list, which is then used for subsequent API queries.
This technique demonstrates dynamic pipeline introspection — a key trait of cloud-native automation. It allows the system to automatically adjust as new ETL or ML jobs are deployed, without requiring any code changes.
Log Verification
Discovered 6 Glue jobs in this account.
This confirms successful enumeration of all jobs, validating both IAM permissions and boto3 integration.
Retrieving Four Key Application Details
For each discovered Glue job, the Lambda retrieves its most recent run information using the get_job_runs() API. This gives us a real-time snapshot of operational health and performance.
Implementation Logic
runs = glue.get_job_runs(JobName=job, MaxResults=1)
if runs.get('JobRuns'):
last_run = runs['JobRuns'][0]
job_details.append({
"Job Name": job,
"Status": last_run.get('JobRunState', 'N/A'),
"Started On": str(last_run.get('StartedOn', 'N/A')),
"Execution Time (s)": last_run.get('ExecutionTime', 'N/A')
})
From this metadata, four critical details are extracted:
- Job Name – identifies the Glue pipeline component.
- Status – reflects its most recent state (
SUCCEEDED,FAILED, orRUNNING). - Started On – timestamp marking the start of the run.
- Execution Duration (s) – total runtime, useful for performance assessment.
To make verification straightforward, the Lambda formats these details into a clean tabular output printed to CloudWatch logs.
Log Verification
=== Verification Table: Four Application Details Retrieved via AWS APIs ===
Job Name Status Started On Exec Time (s)
-----------------------------------------------------------------------------------------------
EDA Job ReTry SUCCEEDED 2025-10-20 05:45:58.693000 107
EDA_ExportToCSV_Job SUCCEEDED 2025-10-19 16:30:47.197000 151
EDA_GlueJob SUCCEEDED 2025-10-19 17:31:35.547000 135
scdf-etl-clean-split-job SUCCEEDED 2025-10-19 10:38:51.851000 149
scdf-feature-engineering-job SUCCEEDED 2025-10-20 06:25:24.627000 72
scdf-ml-training-job SUCCEEDED 2025-10-21 06:28:40.961000 115
-----------------------------------------------------------------------------------------------
This confirms that all six Glue jobs completed successfully, with accurate execution timestamps and durations logged.
Verification of Execution Performance
Each Lambda execution concludes with a CloudWatch summary record. This record provides runtime metrics. The metrics include total execution time, memory utilisation, and billed duration.
REPORT RequestId: eda9626d-5266-44c2-a0db-7256a8867355
Duration: 1202.82 ms | Billed Duration: 1203 ms | Memory Size: 128 MB | Max Memory Used: 97 MB
Interpretation of Execution Performance Verification
The execution performance of the AWS Lambda function concludes with a CloudWatch summary record. This record provides crucial metrics on the function’s runtime behavior. Here’s an analysis of the components in this summary:
- Request ID: Each Lambda invocation is assigned a unique identifier. Its current form is
RequestId: eda9626d-5266-44c2-a0db-7256a8867355. This identifier helps in tracking specific executions. It also aids in debugging issues if they arise. - Duration: The function took 1202.82 milliseconds to execute. This metric indicates the total time the Lambda function ran from start to finish. It’s essential for assessing the efficiency and performance of the function, especially in scenarios where response time is critical.
- Billed Duration: This reflects the duration AWS bills the user for the function’s execution. It is recorded at 1203 milliseconds. AWS Lambda uses increments of 1 millisecond for billing. Any part of a millisecond will round up to the next whole millisecond. In this case, the billed duration slightly exceeds the actual duration. This indicates that the function reached beyond the full duration mark.
- Memory Size: The memory allocated to the Lambda function was set to 128 MB. AWS Lambda allows users to select a memory size for their functions, which directly influences CPU allocation. A higher memory allocation can lead to better performance due to higher CPU resources.
- Max Memory Used: The function utilized a maximum of 97 MB of the allocated memory during its execution. This metric indicates that the function ran efficiently. It did not exceed its memory limit. It had more than enough capacity to execute all its tasks.
Overall, these performance metrics are crucial for evaluating the efficiency of the Lambda function. They provide insights into execution time, resource utilization, and potential areas for optimization. Monitoring these metrics allows developers to refine their applications further, ensuring they remain performant and cost-effective in a cloud environment.
Conclusion
In this post, we explored the realm of observability within our data pipeline. We built an API-driven layer that provides real-time visibility into its inner workings. We combined the capabilities of AWS Lambda, Glue, S3, and CloudWatch. This transformation turned what was once a background process into a self-monitoring, transparent, and cloud-native system.
This new layer of programmatic access empowers stakeholders to validate every stage. This includes data ingestion and feature transformation. It extends to model training. Validation happens without manually inspecting the AWS Console. The result is a pipeline that speaks for itself, continuously reporting on its own operational health and performance.
Beyond transparency, this design also enhances agility. It provides a foundation for proactive monitoring. It also offers faster troubleshooting and smarter automation. These features ensure that the system stays responsive as data volumes and workloads evolve.
With this milestone, we have introduced observability. We have also laid the groundwork for what comes next: a fully event-driven monitoring ecosystem. In the next post, we’ll extend this framework using Amazon EventBridge. This will allow our pipeline to schedule automated checks. It will also trigger alerts and respond intelligently to operational events.

Leave a Reply