Validation / Second Validation Round.docx
Prompt:
I have now re-done some of the experiments to address some of the discrepancies identified before. In the attached zip file, I am including the annotated results of the new experiments. Please check if there is any discrepancy in the tasks assigned to the different LLMs; the tasks assigned to all three LLMs are supposed to be identical. Please note that asking the LLMs to draw the MSC should not be considered a discrepancy as sometimes LLMs do not produce the requested MSC.
Answer:
Summary
I reviewed the new annotated zip and compared the task prompts across Gemini, GPT, and Qwen for all 9 examples. Requests to draw or redraw MSCs are not treated as discrepancies, because some LLMs do not produce the requested MSC and may receive follow-up drawing prompts.
Most of the earlier discrepancies appear to have been fixed. The remaining substantive discrepancies are mainly in Example 2 and Example 9.
| Example | Discrepancy? | Notes |
|---|---|---|
| 1 | No substantive discrepancy | The ordering question is phrased in reverse order for Gemini - "receipt of m3 and sending m1" vs "sending m1 and receiving m3" - but it compares the same two events. This should not be treated as a substantive discrepancy. |
| 2 | Yes | Several task mismatches remain. Gemini, GPT, and Qwen are not assigned the same sequence of tasks. |
| 3 | No | Core tasks match across all three. Extra MSC drawing requests were ignored. |
| 4 | No | Core horizontal/vertical composition and ordering tasks match. Drawing prompts ignored. |
| 5 | No | Core tasks match. Drawing prompts ignored. |
| 6 | No | Tasks match across all three. |
| 7 | No | Core HMSC task matches. Extra "draw the two MSCs" follow-up for GPT/Qwen ignored. |
| 8 | No | Core HMSC task matches. GPT’s "redraw the second MSC" follow-up ignored as a drawing-related request. |
| 9 | Yes | One ordering question differs: Gemini asks about "sending Success and sending Deliver," while GPT/Qwen ask "sending Success and sending receiving Deliver." |
Detailed remaining discrepancies
Example 2 - substantive discrepancies remain
The tasks are still not aligned.
Gemini asks:
What is the correct ordering between the receipt of m2 and receiving m1?
What is the correct ordering between sending m2 and sending m1?
Based on the diagram above, generate a message sequence chart in which message m1 is abstracted away.
Produce the list of traces of the original message sequence chart given below.
Draw a labelled transition system...
GPT asks:
What is the correct ordering between sending m1 and sending m2?
What is ordered in the diagram?
What is the correct ordering between receiving m1 and sending receiving m2?
Based on the diagram above, generate a message sequence chart in which message m1 is abstracted away...
Merge instance i1 and i2...
Produce the set of all traces...
Draw a labelled transition system...
Qwen asks:
What is the correct ordering between sending m1 and sending m2?
What is the correct ordering between receiving m1 and receiving m2?
Merge instance i2 and i3...
Produce the set of all traces...
Draw a labelled transition system...
The main problems are:
GPT has an extra or ambiguous task: "What is ordered in the diagram?"
GPT has a typo or malformed question: "receiving m1 and sending receiving m2."
The merge task differs: GPT asks to merge i1 and i2; Qwen asks to merge i2 and i3; Gemini does not show the corresponding merge task in the annotated file.
The abstraction task appears for Gemini and GPT but not Qwen, at least not in the extracted prompt sequence.
Conclusion for Example 2: it is still not comparable as an identical-task experiment.
Example 9 - minor but real discrepancy
Gemini asks:
What is the correct ordering between sending Success and sending Deliver?
GPT and Qwen ask:
What is the correct ordering between sending Success and sending receiving Deliver?
This appears to be a wording error in GPT/Qwen. It may have been intended to mean either "sending Success and receiving Deliver" or "sending Success and sending Deliver." As written, it is not identical to Gemini’s task. Example 9 should therefore be flagged as having a minor but real prompt discrepancy.
Note on diagrams
The first embedded source images are consistent across the three models for most examples. In Example 2, GPT’s first image has a different size/hash from Gemini/Qwen, but visually it appears to be the same MSC, just scaled or cropped differently. This should not be treated as a substantive discrepancy unless exact image-file identity is required.
Overall conclusion
After excluding drawing/redrawing requests, Examples 1 and 3-8 appear comparable as identical-task experiments. Examples 2 and 9 should still be treated as having task discrepancies, with Example 2 being the more serious issue.