Validation / Final Validation.docx

Back to data browser
This page is a uniformly styled HTML rendering of the original Word file. The original Word file is retained in the data folder for verification.

Question:

I have revised the experiments again, please review the annotated outputs again and compare them for any discrepancies. Please note that in Example 2, "What is ordered in the diagram?" is part of GPT's answer, not a new prompt.

Review criteria

Summary

No remaining substantive discrepancies were found in the latest revised experiments. The assigned semantic tasks now appear aligned across all three LLMs.

Example Discrepancy? Finding
1 No Tasks now match substantively across all three models. Gemini’s earlier extra combine-style task appears resolved.
2 No Tasks now match. All three use the same ordering questions, abstraction task, merge task for i2 and i3, trace task, and LTS task.
3 No Tasks match across all three.
4 No Tasks match. The previous GPT repeated-question issue appears resolved.
5 No Tasks match across all three.
6 No Tasks match across all three.
7 No Core HMSC task matches. Extra “draw the two MSCs” follow-ups for GPT/Qwen were ignored.
8 No Core HMSC task matches. GPT’s redraw request is drawing-related and was ignored.
9 No Tasks match across all three.

Details

Example 2 is now aligned

The task sequence is consistent across Gemini, GPT, and Qwen:

  1. identify the events in the MSC;

  2. ask the ordering between sending m1 and sending m2;

  3. ask the ordering between receiving m1 and receiving m2;

  4. abstract away message m1;

  5. merge instances i2 and i3;

  6. produce the trace set;

  7. produce a labelled transition system.

Note: The line “What is ordered in the diagram?” appears inside GPT’s answer text and does not affect the prompt comparison.

Minor non-discrepancies observed

Conclusion

Bottom line: I found no remaining substantive discrepancies in the tasks assigned to Gemini, GPT, and Qwen in the latest revised experiments.