Experiment 06 — Long dialog retention | Viable Prompt Protocol (VPP)

Overview

Experiment 06 examines whether VPP can maintain its structural commitments (tag headers + footers) over longer dialogs, and how this compares to a baseline that sees tags in user messages but has never been given the VPP spec.

Where earlier experiments focused on short, 2–4 turn dialogs, Experiment 06 asks:

“If we run the conversation out to ~10+ tagged turns, does VPP still hold? And what does the baseline do when faced with tags alone?”

Experiment 06 lives in
experiments/exp6-long-dialog/.

Directory contents

Under experiments/exp6-long-dialog/:

run-exp6-long-dialog.mjs
Runner that orchestrates long, multi-tag dialogs.
analyze-exp6.mjs
Script that computes long-dialog structural metrics.
configs.jsonl
Configuration for:
- vpp_longdialog_grounded
- vpp_longdialog_tags_only
- baseline_longdialog_tags
prompts/ (optional)
- dialog-templates/ — sequences of tagged user turns simulating a realistic workflow.

Outputs go into corpus/v1.4/sessions/ and corpus/v1.4/index.jsonl.

Hypothesis

H₆: Over multi-turn dialogs:

VPP will maintain near-perfect structural adherence (headers, footers, tag mirroring) across all turns, whether or not the first turn explicitly re-explains the protocol.
Baseline (no VPP header) will not spontaneously discover the protocol, even when user messages contain tags.

Concretely:

header_present_ratio(VPP) ≈ 1.0
footer_present_ratio(VPP) ≈ 1.0
tag_mirrors_user_ratio(VPP) ≈ 1.0
Baseline ratios remain low, with any headers being incidental or malformed.

Conditions

Each line in configs.jsonl specifies:

condition — one of:
- vpp_longdialog_grounded
- vpp_longdialog_tags_only
- baseline_longdialog_tags
model, temperature, top_p, seed
dialog_template_id — which scripted long dialog to run.

Semantics:

vpp_longdialog_grounded
- System message includes the VPP header snippet.
- First user turn is a !<g> grounding message that restates the protocol and obligations.
- Subsequent turns use tags normally (!<q>, !<o>, !<c>, !<o_f>).
vpp_longdialog_tags_only
- System message includes the VPP header snippet.
- No explicit grounding; user turns simply begin using tags.
- Tests whether tags + header are sufficient without a “protocol explainer” turn.
baseline_longdialog_tags
- No VPP header snippet.
- User turns still contain tags (same dialog templates as the VPP conditions).
- Tests whether tags alone induce any stable VPP-like behavior.

Metrics

analyze-exp6.mjs reports:

sessions analyzed — number of valid long-dialog sessions per condition.
header_present_ratio — fraction of assistant turns that start with a valid VPP-style tag header (e.g., <g>, <q>, <o>, <c>, <o_f>).
footer_present_ratio — fraction of assistant turns that end with a footer line.
footer_v14_ratio — fraction of assistant turns whose footer parses as a valid VPP v1.4 footer.
tag_mirrors_user_ratio — fraction of assistant turns where the assistant’s tag matches the user’s tag for that turn.
first_structural_failure_turn — mean index (across sessions) of the first structural failure; "none" if no structural failures were observed.
task_coverage_ok — fraction of sessions that appear to answer all requested sub-tasks in the long dialog template.

Results (current run)

From an initial run (3 sessions per condition):

VPP grounded (vpp_longdialog_grounded)
- sessions analyzed: 3
- header_present_ratio: 1.00
- footer_present_ratio: 1.00
- footer_v14_ratio: 1.00
- tag_mirrors_user_ratio: 1.00
- first_structural_failure_turn: none
- task_coverage_ok: 100.0%
VPP tags-only (vpp_longdialog_tags_only)
- sessions analyzed: 3
- header_present_ratio: 1.00
- footer_present_ratio: 1.00
- footer_v14_ratio: 1.00
- tag_mirrors_user_ratio: 1.00
- first_structural_failure_turn: none
- task_coverage_ok: 100.0%
Baseline with tags (baseline_longdialog_tags)
- sessions analyzed: 3
- header_present_ratio: 0.80
- footer_present_ratio: 0.80
- footer_v14_ratio: 0.00
- tag_mirrors_user_ratio: 0.00
- first_structural_failure_turn: 1.00
- task_coverage_ok: 0.0%

Interpretation and limitations

These early results suggest:

Under VPP (with or without an explicit grounding turn), long dialogs remain fully compliant with the protocol:
- Every assistant turn has a tag header and v1.4 footer.
- Tags reliably mirror the user’s tags.
- The model stays on task across the entire scripted dialog.
Under baseline, tags appearing in user messages do not induce VPP behavior:
- Headers/footers appear only partially (0.80 ratios), likely reflecting incidental formatting rather than protocol adoption.
- No valid v1.4 footer is ever produced.
- Tag mirroring is effectively 0.0.
- Task coverage in the long-dialog templates is poor.

Limitations

Dialog lengths are modest (on the order of ~10 turns); much longer sessions might reveal different behavior.
Only a single model and a single long-dialog template family were used in this first run.
Baseline behavior might change if the system prompt explicitly mentions tags but not the full protocol; this condition is not tested here.

Notes

Exp6 provides a long-horizon structural sanity check for VPP-aware models. If future updates introduce structural drift, this experiment should catch it.
For more realism, you can replace the scripted dialogs with logs from real users (e.g., from Exp7) and reuse the same analysis script.

Overview ​

Directory contents ​

Hypothesis ​

Conditions ​

Metrics ​

Results (current run) ​

Interpretation and limitations ​

Limitations ​

Notes ​