Building an AI-Powered Examiner: Automated GCSE Marking with Gemini and Python
A complete walkthrough of a local web app that reads handwritten student answers and produces professional feedback reports end to end
Introduction
Marking a class set of handwritten GCSE exam papers is one of the most time-consuming tasks a teacher faces. Every answer needs to be read against a multi-page mark scheme, level descriptors need to be applied consistently, and feedback needs to be written in a form the student can actually act on. For a class of thirty, that is several hours of focused work repeated across every topic, every mock, every term.
This project automates that pipeline. A teacher uploads a scanned PDF of a student’s handwritten answer to a local web app. Within 20-30 seconds, a professionally formatted .docx feedback report downloads automatically complete with the student’s answer transcribed verbatim, a gap analysis table, a level assessment, a mark out of the available total, a confidence rating, and two actionable improvement sentences written directly to the student.
The tool is built on Google Gemini (via the google-generativeai Python SDK), a Flask web server, and python-docx for report generation. It runs entirely locally inside Docker no data leaves your machine except for the Gemini API call itself.
This post walks through every layer of the system: what the app does, how it does it, why the key design decisions were made, and exactly how to replicate it from scratch.
What You Will Build
By the end of this post you will have a running local web application that:
- Accepts scanned PDF uploads of handwritten student answers via a drag-and-drop browser UI
- Converts each PDF page to an image using
poppler-utilsat 200 DPI - Sends the page images to Gemini with a detailed, examiner-grade marking prompt embedded in the request
- Parses Gemini’s structured response and renders it into a formatted
.docxfeedback report usingpython-docx - Auto-downloads the report to the teacher’s machine no login, no cloud storage, no manual steps
- Ships as a single Docker Compose stack that starts with one command
The marking prompt embedded in this project is pre-loaded for the OCR GCSE Economics J205/01 paper, covering Questions 21, 22 and 23 in full, including 1-2 mark questions, 6-mark Analyse questions, and 6-mark Evaluate questions with a three-part supported judgement test. The architecture is subject-agnostic: swapping in a different mark scheme is a single variable edit in app.py.
How It Works: The End-to-End Pipeline
Before diving into the code, it helps to understand the full data flow. A single marking request goes through five distinct stages:
pdftoppm converts each page to PNG at 200 DPIgemini-2.5-flash as base64 inline datapython-docx renders formatted .docx → auto-downloadThe whole round trip for a three-page PDF typically takes 15-30 seconds, dominated by Gemini’s inference time on the handwriting images.
Project Structure
The project is deliberately minimal. Everything lives in six files:
gcse_economic_ai_examiner/
├── app.py ← Flask app + Gemini call + docx generator + marking prompt
├── templates/
│ └── index.html ← Single-page drag-and-drop UI
├── requirements.txt ← Python dependencies
├── Dockerfile ← Container definition (Python 3.11 + poppler)
├── docker-compose.yml ← One-command start/stop
└── .env.example ← API key template (copy to .env)
There is no database, no session state, no background worker queue. Each request is fully self-contained. The only external dependency at runtime is the Gemini API.
Prerequisites
You need three things before you start:
- Docker Desktop (or Docker Engine + Compose plugin) installed and running
- A free Google Gemini API key get one at aistudio.google.com → “Get API key”
- Git to clone the repository
No Python installation is required on your host machine everything runs inside the container.
Step-by-Step Setup
Clone the repository
git clone https://github.com/vivekbhadra/gcse_economic_ai_examiner.git
cd gcse_economic_ai_examiner
Create your .env file
The app reads your Gemini API key from a .env file. Copy the template and fill it in:
cp .env.example .env
# Open .env in any editor and replace "your_api_key_here" with your real key
The .env file is not committed to Git (it is listed in .gitignore). Your key stays on your machine.
Build and start the container
docker compose up --build
This builds the Docker image (installs Python packages and poppler-utils) and starts the Flask server. The first build takes 1-2 minutes. Subsequent starts are instant.
Open the app
Navigate to http://localhost:5000 in your browser. You should see the marking interface.
Upload a student answer and mark it
Drag and drop (or click to browse) a scanned PDF of a student’s handwritten answer. Click Mark Answer. The .docx feedback report will download automatically within 15-30 seconds.
Ctrl+C in the terminal running Docker, then run docker compose down to remove the container. To rebuild after editing app.py, run docker compose up --build again.
The Marking Prompt: The Brain of the System
The most important part of the project is not the Flask server or the docx formatter it is the MARKING_PROMPT string in app.py. This is what Gemini reads to understand how to mark. Its quality directly determines the quality of the feedback.
The prompt in this project is the OCR GCSE Economics J205/01 Master Marking Prompt v3.0. It runs to approximately 300 lines and is structured in eight parts:
Prompt Architecture (v3.0)
- Part 1 Identity and limitations: tells Gemini it is a standardised OCR examiner, defines the three highest risk areas (handwriting misreads, level boundary judgements, unusual responses), and frames the output as a “reliable first-pass diagnostic tool, not a substitute for human oversight”
- Part 2 The full mark scheme: all marking criteria for Q21, Q22 and Q23, including indicative content bullet points for every 6-mark question and calculation working for Q23(b)
- Part 3 Model answer training: worked examples showing exactly what a Level 1, Level 2 and Level 3 answer looks like, and why each level boundary was applied
- Part 4 Marking protocol: a mandatory seven-step process for every answer restate question, quote student verbatim, confirm mark scheme section, complete gap analysis table, assess level, run calibration checks, self-audit
- Part 5 Output format: exact schema for the feedback document (question, type, verbatim quote, gap analysis, level assessment, mark, confidence, justification, feedback, improvement sentences)
- Part 6 Reliability standard: defines when marking is correct and when it is not, to guide Gemini’s self-review before outputting
- Part 7 (reserved)
- Part 8 Protocol for out-of-scope questions: a five-step decision tree for handling questions not in the embedded mark scheme, preventing Gemini from inventing indicative content
The key design principle is that every mark awarded or withheld must be directly traceable to a verbatim student quote and a specific mark scheme criterion. The prompt instructs Gemini to refuse marks it cannot justify in this way, and to flag borderline decisions explicitly rather than silently round up.
app.py, find the MARKING_PROMPT variable, and replace the content of Parts 2 and 3 with your own mark scheme and model answers. The protocol in Parts 4-8 is generic and can stay as-is for any GCSE subject.
Deep Dive: The Flask Application (app.py)
The entire backend logic lives in app.py. It has three main sections: PDF-to-image conversion, Gemini API call, and docx report generation.
PDF to Images: pdf_to_images()
Gemini’s vision API works on images, not PDFs. We use pdftoppm (part of poppler-utils) to convert each page of the uploaded PDF to a PNG file at 200 DPI high enough for Gemini to read handwriting reliably, low enough to keep request sizes manageable:
def pdf_to_images(pdf_path: str) -> list:
with tempfile.TemporaryDirectory() as tmpdir:
out_prefix = os.path.join(tmpdir, "page")
result = subprocess.run(
["pdftoppm", "-png", "-r", "200", pdf_path, out_prefix],
capture_output=True, text=True
)
if result.returncode != 0:
raise RuntimeError(f"pdftoppm failed: {result.stderr}")
images = []
for img_path in sorted(Path(tmpdir).glob("page-*.png")):
images.append(img_path.read_bytes())
return images
The function returns a list of raw PNG bytes one item per page. All temporary files are created inside a TemporaryDirectory context manager and cleaned up automatically when it exits.
Sending to Gemini: mark_with_gemini()
The Gemini API accepts a parts list that can mix images and text. We build the list by encoding each page image as base64, then append the marking prompt as the final text part:
def mark_with_gemini(image_bytes_list: list) -> str:
model = genai.GenerativeModel("gemini-2.5-flash")
parts = []
for img_bytes in image_bytes_list:
parts.append({
"inline_data": {
"mime_type": "image/png",
"data": base64.b64encode(img_bytes).decode()
}
})
parts.append({"text": MARKING_PROMPT})
response = model.generate_content(parts)
return response.text
The model used is gemini-2.5-flash the fastest Gemini model with strong vision capability. The images come first in the parts list, so Gemini sees the handwriting before it sees the instructions, which empirically produces better transcription accuracy.
Generating the Report: create_docx_report()
Gemini returns its feedback as structured text using Markdown conventions ## headings, |table| syntax, **bold** inline markers, and - bullet lists. The create_docx_report() function parses this line by line and translates each element into native python-docx formatting:
| Gemini output | docx output |
|---|---|
| ## Heading | Bold blue heading with underline border |
| # Heading | Smaller bold blue heading |
| | table | rows | | Real Word table with shaded header row |
| – bullet point | List Bullet paragraph style |
| **bold** | Bold run within paragraph |
| Question: / Mark: labels | Blue bold label + normal text run |
| — | Horizontal rule (top border on empty paragraph) |
The report opens with a styled header block and closes with a footer disclaimer reminding the teacher that AI-generated marks on borderline decisions should be verified before being returned to students.
The Flask Routes
There are only two routes. GET / serves the upload UI. POST /mark handles the full marking pipeline:
@app.route('/mark', methods=['POST'])
def mark():
pdf_file = request.files['pdf']
# 1. Save to temp file
with tempfile.NamedTemporaryFile(suffix='.pdf', delete=False) as tmp:
pdf_file.save(tmp.name)
# 2. Convert pages to images
images = pdf_to_images(tmp.name)
os.unlink(tmp.name)
# 3. Send to Gemini
feedback = mark_with_gemini(images)
# 4. Build .docx
output_path = tempfile.mktemp(suffix='.docx')
create_docx_report(feedback, output_path)
# 5. Return as download
return send_file(output_path, as_attachment=True,
download_name='marking_feedback.docx')
Flask’s send_file() with as_attachment=True triggers the browser’s download dialogue automatically. No JavaScript is needed to initiate the download on the client side the browser handles it from the response headers.
The Frontend: index.html
The entire UI is a single HTML file served by Flask’s render_template(). It has no JavaScript framework dependencies just vanilla JS and a small block of CSS.
Key UI behaviours:
- Drag-and-drop zone with
dragover/dropevent listeners that accept.pdffiles only - File info bar showing filename and size, with a remove button
- Animated progress bar that steps through labelled stages (uploading, converting, sending to AI, generating report) while the request is in flight the actual API call is not trackable mid-flight, so the steps are time-based approximations
- Status banner that shows a green success message or red error message once the request completes
- Automatic download: on success, the response blob is turned into an object URL and clicked programmatically, triggering the browser download without any redirect
Docker Setup
The Dockerfile
The container uses python:3.11-slim as the base image a minimal Debian build that keeps the image small. The only system package installed is poppler-utils, which provides pdftoppm:
FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
poppler-utils \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
COPY templates/ ./templates/
EXPOSE 5000
CMD ["python", "app.py"]
Python packages are installed before copying the application code. This order ensures Docker’s layer cache is used: rebuilding after an app.py change does not reinstall packages.
Docker Compose
The Compose file wires the container to a .env file for the API key, exposes port 5000, and mounts app.py as a volume for live reload during development:
services:
marker-app:
build: .
ports:
- "5000:5000"
env_file:
- .env
volumes:
- ./app.py:/app/app.py # live reload during development
app.py as a volume means changes to the file on your host are reflected inside the running container immediately without a full rebuild. To pick up changes, just restart the container with docker compose restart rather than docker compose up --build.
Python Dependencies
| Package | Version | Purpose |
|---|---|---|
| flask | 3.0.3 | Web framework routes, file upload, response handling |
| google-generativeai | 0.7.2 | Gemini API SDK model configuration and content generation |
| Werkzeug | 3.0.3 | Flask dependency WSGI utilities |
| python-docx | 1.1.2 | Word document creation paragraphs, tables, styles, borders |
Confidence Levels in the Feedback Report
One of the most important features of the marking prompt is its three-tier confidence system. Every marked question in the report carries one of three ratings:
| Rating | Meaning | Action required |
|---|---|---|
| High | Every mark traces directly to a verbatim student quote and a clear mark scheme criterion | None safe to return to student |
| Moderate | One borderline decision was made could reasonably go either way | Teacher verification recommended before returning mark |
| Low | Transcription uncertainty or unusual response Gemini flagged words as [UNCLEAR] | Teacher must verify before communicating mark to student |
This matters because the alternative a tool that always returns a confident mark regardless of legibility or ambiguity is actively dangerous in an educational context. The confidence system makes the AI’s uncertainty visible rather than hiding it behind a number.
Limitations and Honest Caveats
This tool is a reliable first-pass diagnostic assistant. It is not a replacement for a trained examiner, and the marking prompt says so explicitly. Key limitations to be aware of:
- Handwriting quality matters. Poor scan resolution or very unclear writing will produce more [UNCLEAR] flags and Low confidence ratings. 200 DPI is the minimum; 300 DPI produces better results for difficult handwriting.
- Level boundary judgements are approximations. The Level 2 / Level 3 boundary is particularly difficult to judge consistently even for human examiners. The AI’s judgements at this boundary should be treated as indicative, not definitive.
- The mark scheme is embedded, not live. The OCR J205/01 mark scheme embedded in this project was current at time of writing. Official mark schemes are updated periodically check the OCR website for the latest version before relying on this tool for formal assessments.
- One paper, one model. The embedded prompt covers Q21-Q23 of J205/01 only. Questions from other papers require the mark scheme to be updated in the prompt.
- API costs. Gemini Flash has a generous free tier as of writing. Very high volume usage (hundreds of papers per day) may incur costs check the current Gemini API pricing on Google AI Studio.
Extending the Project
The architecture is intentionally simple so it is easy to adapt. Some natural next steps:
- Different subjects or papers: Replace the
MARKING_PROMPTvariable with a different mark scheme. The protocol in Parts 4-8 is generic and reusable. - Batch marking: Modify the upload UI to accept multiple PDFs and queue them for sequential processing, writing all feedback reports into a single ZIP file for download.
- Class summary view: After marking a batch, extract the marks from each report and generate a summary spreadsheet showing marks by question for the whole class.
- Higher DPI for difficult handwriting: Change
-r 200in thepdftoppmcall to-r 300. This increases image size and slightly slows the API call, but can significantly improve transcription accuracy for poor handwriting. - Gemini model upgrade: Replace
gemini-2.5-flashwithgemini-2.5-profor more thorough reasoning on complex 6-mark answers. Expect longer response times and higher API usage. - Authentication: If deploying on a school network rather than localhost, add Flask-Login or a simple API key gate to prevent unauthorised access.
Quick Reference
| Command | What it does |
|---|---|
| docker compose up –build | Build image and start the app (first run or after code changes) |
| docker compose up | Start without rebuilding (faster, uses cached image) |
| docker compose down | Stop and remove the container |
| docker compose restart | Restart after editing app.py (uses volume mount, no rebuild needed) |
| docker compose logs -f | Stream container logs (useful for debugging API errors) |
Troubleshooting
| Error | Cause and fix |
|---|---|
| GEMINI_API_KEY not set | The .env file is missing or the key is still set to “your_api_key_here”. Check that .env exists in the project root and contains your real key. |
| pdftoppm failed | Should not occur inside Docker as poppler-utils is installed in the Dockerfile. If running app.py locally (outside Docker), install poppler: brew install poppler on Mac, apt install poppler-utils on Linux. |
| No file uploaded (400 error) | The uploaded file was not received. Check the file is a valid .pdf and under 32MB (the Flask MAX_CONTENT_LENGTH limit). |
| Very slow response (>60s) | Normal for a 5+ page PDF. Gemini processes each page image sequentially. Reduce the PDF to only the relevant answer pages before uploading. |
| Blank or garbled docx | Gemini returned an unexpected response format. Run docker compose logs -f to see the raw response in the Flask output and check for API error messages. |
Conclusion
This project demonstrates that a genuinely useful, professionally structured AI marking tool can be built with a small amount of Python and a carefully engineered prompt. The technology stack is not exotic: Flask, a vision-capable LLM, and a document library. What makes the tool effective is the quality and structure of the marking prompt the effort invested in encoding the mark scheme, the level descriptors, the model answer examples, and the calibration checks is what separates diagnostic feedback from generic AI commentary.
If you use this for your own teaching, adapt it for a different subject, or build on it I would be glad to hear how it goes. Feel free to open an issue or pull request on the GitHub repository.
Leave a Reply