Benchmark Arena TCG
TL;DR
A tech-free wellbeing day forced me to think differently, so I built a physical AI trading card game where students built and battled their own AI models using real concepts like capability trade-offs and benchmark optimisation.
Benjamin Hyde
Education Leader & AI Builder
Last week our school ran a wellbeing day where students were given a longer lunch, food trucks, live music, and a break from the normal routine. Staff were challenged to go completely tech-free for the day. No PowerPoints. No smart boards. No laptops.
As a Digital Technologies teacher currently teaching AI models to Year 10 students, that created a bit of a problem.
So naturally, I made an AI trading card game.
The Challenge
The goal of the lesson was to continue our work around AI models, specifically: how different AI models are designed, the trade-offs that exist between models, how models are evaluated against benchmarks, and why there is rarely a single "best" model.
At the same time, I had a prac teacher with me, Daniel, and during one of those completely random brainstorming moments I threw out the idea of making a physical trading card game for the students.
What started as a throwaway comment escalated very quickly.
Building Benchmark Arena
Using a mix of ChatGPT and Claude, I started rapidly prototyping mechanics, balancing systems, and card concepts. I tend to use different models for different stages of creative work.
For me: ChatGPT is excellent as an ideation partner, Claude is great at critique and evaluation, and OpenAI's image generation model has become my preferred tool for creating artwork.
The project itself unintentionally became an example of model specialisation.
Once the mechanics were refined and the visual style was locked in, Benchmark Arena was born.
How the Game Works
Each student started with a single base model card. Some models specialised in reasoning. Others focused on creativity, speed, low cost, or contextual understanding. Every model had strengths and weaknesses.
For example, the Edge Device Model was extremely fast and low cost, but had lower contextual understanding and creativity.
Students were then given modifier cards that allowed them to customise and improve their models. One example was the "Long Context Upload" card, which gave: +3 Context, -1 Speed.
This meant students had to evaluate whether the trade-off was worth it. If a model originally had Context: 1 and Speed: 5, then after applying the modifier it became Context: 4 and Speed: 4.
Students quickly realised that improving one capability often came at the expense of another.
Importantly, students weren't just adding a single modifier. Each student received one base model card and nine modifier cards. Students had to evaluate combinations of cards, make optimisation decisions, and determine what type of AI model they actually wanted to build.
Once all modifiers were applied, students calculated their final scores across all eight capabilities.
Benchmark Battles
Once models were finalised, the competition began. I quickly generated a tournament bracket and students competed head-to-head on randomly selected benchmarks.
Each benchmark prioritised three different capabilities: one capability scored at x3, one at x2, and one at x1. The final weighted score determined the winner.
This meant students immediately discovered something important: there was no universally "best" model. Some models dominated creative benchmarks but performed terribly on reasoning tasks. Other models excelled in speed-focused benchmarks but collapsed when contextual understanding became important.
The randomness of the benchmark system also created authentic moments where students realised their carefully optimised model was simply not suited to the challenge it faced. Which, honestly, is pretty reflective of real-world AI systems.
Students eliminated from the main bracket were moved into a lower bracket where they were allowed to swap two cards from their build in an attempt to improve performance. That small rule change led to some really interesting discussions around iteration, evaluation, and optimisation.
What Worked
What surprised me most was how quickly the conversations became sophisticated. Students weren't just trying to "win". They were debating optimisation strategies, discussing whether reasoning was worth the speed penalty, and arguing over which capabilities mattered most in different scenarios.
The physical nature of the game also changed the dynamic of the lesson. Students were manually building systems, calculating trade-offs, comparing outcomes, and defending their design choices face-to-face. In a topic area that is almost always taught digitally, the tactile element actually made the thinking more visible.
For a completely tech-free lesson, it ended up being one of the most authentic discussions about AI systems we've had all term.
The students were so invested in the idea that many immediately started asking whether an online version of the game could be created.
Unfortunately for me, that probably means this project isn't finished yet.
Final Thoughts
What started as a throwaway idea during a wellbeing day became one of the most engaging AI lessons I've run in a long time.
It was creative, competitive, chaotic, strategic, and surprisingly thoughtful.
Most importantly, it helped students understand that AI systems are ultimately a series of design decisions and trade-offs rather than magic black boxes.
And honestly, any lesson that gets Year 10 boys arguing passionately about benchmark optimisation without touching a laptop is probably worth revisiting.
Photos



Build Notes
Approach
Design a physical card game that teaches AI model capability trade-offs and benchmark optimisation through competitive play. No technology required.
Tools Used
ChatGPT (ideation and mechanics), Claude (critique and evaluation), OpenAI image generation (card artwork), printed cards
What Worked
Students moved past trying to win and started genuinely debating optimisation strategies, trade-offs, and design decisions, exactly the thinking the lesson was designed to prompt. The tactile, physical format made the thinking more visible than a digital equivalent would have.
What Failed
Nothing critical, but the randomised benchmark format meant some students felt unlucky rather than outplayed. Worth refining the reveal mechanic and giving students more information about what benchmarks might appear.
What's Next
Build an online version of Benchmark Arena. Students asked for it immediately, and it's already started.