Parallel Testing of 4 Architectural Approaches: How SuperSlide Chose Between LLM Strategies

2026-03-10 #ai#architecture#superslide#testing#llm

You have 4 ways to solve a problem. Each sounds convincing, but none can be dismissed without an experiment. The classic approach — try one at a time. But when each option requires 3-6 hours of data preparation, sequential testing stretches into weeks. In SuperSlide, I decided to run everything in parallel — and delegate each stream to a separate agent.

The Task

SuperSlide makes presentations from text. We have 57 templates — from cover_dark to roadmap_horizontal. The LLM needs to understand that “30% sales growth” is a slide with KPI blocks, not a bullet list.

The first version worked through OpenRouter. I gave the model a text catalog description, it chose a suitable option. Result — 60% accuracy. The remaining 40% — repeated favorite layouts or wrong format.

Needed a better approach. Which one — was unclear.

4 Approaches I Tested

Option A — current baseline. LLM gets the template catalog as text, chooses by description, fills placeholders. Simple, but the model “sticks” to 5-6 favorite layouts out of 57.

Option B — examples instead of descriptions. The model gets 57 filled examples instead of empty templates with {{PLACEHOLDER}}. Hypothesis: specifics teach better than abstractions.

Option B2 — Qdrant + semantic search. Each template gets a semantic_description (“3 numerical indicators, KPI metrics, quarterly results”) and is indexed in Qdrant. Logic: POST /api/context/semantic → search nearest templates → filter duplicates. Hypothesis: let vector search decide, not the LLM.

Option C — components. Templates broken into 48 atomic blocks: backgrounds (bg_dark_rings, bg_gradient), headers (comp_title_hero), content (comp_kpi_block, comp_bullets, comp_table). Each has constraints, compatible_with[], grid_area. LLM assembles slides from building blocks. Hypothesis: flexible assembly gives more variety.

How I Ran This in Parallel

Data preparation for each option took 2 to 6 hours. Doing it manually in sequence — a week of work. I delegated each stream to a separate agent in Claude Code.

For option B2, I sent this prompt:

Prepare data for Qdrant indexing of 57 slide templates.
For each template create:
- semantic_description (2-3 sentences, visual and content description)
- best_for (3-5 use cases)
- never_for (limitations)
- visual_tags[] (filter tags)
Format: SQL INSERT into slide_templates table.
Reference HTML in /var/www/superslide/templates/

For the component option — different agent, different prompt:

Decompose the 36 most complex templates into atomic components.
Each component: html_fragment, css_classes, description,
constraints (max_text, max_items, needs_number),
compatible_with[], grid_area.
Format: SQL INSERT into slide_components table.

Routing infrastructure — an n8n workflow with a Switch node. The user selects an approach via radio button, the request goes to the right branch.

What I Measured

Logs are written to the generation_log table. I track:

Number of unique templates in a presentation
Category diversity (slide type coverage)
Theme alternation (dark/light backgrounds)
Generation time
Approach used

5 test generations per option — 20 runs. Enough to see the pattern.

What I Learned

Semantic search (B2) gave the best template diversity. Qdrant doesn’t “stick” to popular layouts — averaging 9-11 unique templates per 12-slide presentation vs. 4-5 for baseline. But requires precise descriptions: if semantic_description is poorly written, the nearest vector will be wrong.

Examples (B) work better than baseline, but the model starts copying content from examples instead of generating new. Had to explicitly add to the prompt: “use the structure, not the content.”

Components (C) gave maximum flexibility but also maximum assembly errors. LLM sometimes combines incompatible blocks despite compatible_with[].

The main takeaway — parallel execution saved 4-5 days compared to sequential testing.

When NOT to Test in Parallel

If you have 2 options and one is obviously simpler — start with the simple one. Parallel testing is justified when:

There are 3 or more options
None can be dismissed without an experiment
Data preparation for each takes hours, not minutes
You have the ability to delegate (agents, team)

Can verify a hypothesis in 20 minutes? Don’t build infrastructure for parallel execution.

FAQ

Why didn’t you add Function Calling (tools) as a separate option?

That was option D. But it requires tools support in the API. claude -p doesn’t support tools directly, would have needed OpenRouter. That added a variable that interfered with clean comparison.

How much did 20 test generations cost?

About 3-4 dollars through OpenRouter on Claude model. Data preparation cost (agent tokens) — another 8-10 dollars. Total under 15 dollars for full validation of 4 architectural approaches.

Which approach did you choose?

A hybrid of B2 + elements of C. Semantic search for template selection, component constraints for compatibility validation. Pure component approach is too fragile, pure semantic doesn’t control assembly quality.

Can you test like this without Claude Code?

Yes, with any tool that allows running independent work streams. The point isn’t the specific tool, but the decomposition: each option is an isolated task with clear input and output.