Appearance
Overview
Experiment 06 examines whether VPP can maintain its structural commitments (tag headers + footers) over longer dialogs, and how this compares to a baseline that sees tags in user messages but has never been given the VPP spec.
Where earlier experiments focused on short, 2–4 turn dialogs, Experiment 06 asks:
“If we run the conversation out to ~10+ tagged turns, does VPP still hold? And what does the baseline do when faced with tags alone?”
Experiment 06 lives inexperiments/exp6-long-dialog/.
Directory contents
Under experiments/exp6-long-dialog/:
run-exp6-long-dialog.mjs
Runner that orchestrates long, multi-tag dialogs.analyze-exp6.mjs
Script that computes long-dialog structural metrics.configs.jsonl
Configuration for:vpp_longdialog_groundedvpp_longdialog_tags_onlybaseline_longdialog_tags
prompts/(optional)dialog-templates/— sequences of tagged user turns simulating a realistic workflow.
Outputs go into corpus/v1.4/sessions/ and corpus/v1.4/index.jsonl.
Hypothesis
H₆: Over multi-turn dialogs:
- VPP will maintain near-perfect structural adherence (headers, footers, tag mirroring) across all turns, whether or not the first turn explicitly re-explains the protocol.
- Baseline (no VPP header) will not spontaneously discover the protocol, even when user messages contain tags.
Concretely:
header_present_ratio(VPP)≈ 1.0footer_present_ratio(VPP)≈ 1.0tag_mirrors_user_ratio(VPP)≈ 1.0- Baseline ratios remain low, with any headers being incidental or malformed.
Conditions
Each line in configs.jsonl specifies:
condition— one of:vpp_longdialog_groundedvpp_longdialog_tags_onlybaseline_longdialog_tags
model,temperature,top_p,seeddialog_template_id— which scripted long dialog to run.
Semantics:
vpp_longdialog_grounded- System message includes the VPP header snippet.
- First user turn is a
!<g>grounding message that restates the protocol and obligations. - Subsequent turns use tags normally (
!<q>,!<o>,!<c>,!<o_f>).
vpp_longdialog_tags_only- System message includes the VPP header snippet.
- No explicit grounding; user turns simply begin using tags.
- Tests whether tags + header are sufficient without a “protocol explainer” turn.
baseline_longdialog_tags- No VPP header snippet.
- User turns still contain tags (same dialog templates as the VPP conditions).
- Tests whether tags alone induce any stable VPP-like behavior.
Metrics
analyze-exp6.mjs reports:
sessions analyzed— number of valid long-dialog sessions per condition.header_present_ratio— fraction of assistant turns that start with a valid VPP-style tag header (e.g.,<g>,<q>,<o>,<c>,<o_f>).footer_present_ratio— fraction of assistant turns that end with a footer line.footer_v14_ratio— fraction of assistant turns whose footer parses as a valid VPP v1.4 footer.tag_mirrors_user_ratio— fraction of assistant turns where the assistant’s tag matches the user’s tag for that turn.first_structural_failure_turn— mean index (across sessions) of the first structural failure;"none"if no structural failures were observed.task_coverage_ok— fraction of sessions that appear to answer all requested sub-tasks in the long dialog template.
Results (current run)
From an initial run (3 sessions per condition):
VPP grounded (
vpp_longdialog_grounded)sessions analyzed: 3header_present_ratio: 1.00footer_present_ratio: 1.00footer_v14_ratio: 1.00tag_mirrors_user_ratio: 1.00first_structural_failure_turn: nonetask_coverage_ok: 100.0%
VPP tags-only (
vpp_longdialog_tags_only)sessions analyzed: 3header_present_ratio: 1.00footer_present_ratio: 1.00footer_v14_ratio: 1.00tag_mirrors_user_ratio: 1.00first_structural_failure_turn: nonetask_coverage_ok: 100.0%
Baseline with tags (
baseline_longdialog_tags)sessions analyzed: 3header_present_ratio: 0.80footer_present_ratio: 0.80footer_v14_ratio: 0.00tag_mirrors_user_ratio: 0.00first_structural_failure_turn: 1.00task_coverage_ok: 0.0%
Interpretation and limitations
These early results suggest:
Under VPP (with or without an explicit grounding turn), long dialogs remain fully compliant with the protocol:
- Every assistant turn has a tag header and v1.4 footer.
- Tags reliably mirror the user’s tags.
- The model stays on task across the entire scripted dialog.
Under baseline, tags appearing in user messages do not induce VPP behavior:
- Headers/footers appear only partially (0.80 ratios), likely reflecting incidental formatting rather than protocol adoption.
- No valid v1.4 footer is ever produced.
- Tag mirroring is effectively 0.0.
- Task coverage in the long-dialog templates is poor.
Limitations
- Dialog lengths are modest (on the order of ~10 turns); much longer sessions might reveal different behavior.
- Only a single model and a single long-dialog template family were used in this first run.
- Baseline behavior might change if the system prompt explicitly mentions tags but not the full protocol; this condition is not tested here.
Notes
- Exp6 provides a long-horizon structural sanity check for VPP-aware models. If future updates introduce structural drift, this experiment should catch it.
- For more realism, you can replace the scripted dialogs with logs from real users (e.g., from Exp7) and reuse the same analysis script.