Overview / Browse data / Converted Word document

Validation / First Validation Round.docx

This page is a uniformly styled HTML rendering of the original Word file. The original Word file is retained in the data folder for verification.

Prompt:

I have performed a series of 9 experiments on how much three different LLMs (Gemini, GPT, and Qwen) understand the semantics of MSCs. In the attached zip file, I am including the annotated results of the experiments. Please check if there is any discrepancy in the tasks assigned to the different LLMs; the tasks assigned to all three LLMs are supposed to be identical.

Response:

I checked the 27 DOCX files in the zip, grouped by model and example. There are discrepancies in the tasks assigned to the LLMs. The tasks are not identical across Gemini, GPT, and Qwen.

Detailed findings by example

Example	Identical across all three?	Discrepancy found
1	No	The event-ordering questions differ substantially. Gemini asks about “receipt of m3 vs sending m1” and “sending m0 vs sending m4”; GPT asks “sending m1 vs sending m4”; Qwen asks “receiving m1 vs sending m2” and “sending m1 vs sending m3.” GPT/Qwen also have an explicit merge task for i2 and i3, while Gemini’s extracted prompt sequence does not show the same explicit merge task.
2	No	Ordering questions differ in direction/wording: Gemini asks “receipt of m2 vs receiving m1” and “sending m2 vs sending m1”; GPT/Qwen ask “sending m1 vs sending m2” and “receiving m1 vs receiving m2.” The merge task also differs: GPT says merge i1 and i2, while Qwen says merge i2 and i3; Gemini does not show the same explicit merge task.
3	Mostly yes	The core task sequence appears consistent: identify events, order receiving m1 vs sending m4, abstract m4, ask order after abstraction, merge i1 and i2, produce traces, draw LTS. Minor follow-up wording differs, such as image-drawing requests.
4	Mostly yes	Core tasks match. GPT appears to repeat the same horizontal-composition ordering question once. Drawing-request phrasing varies, but the assigned semantic tasks are the same.
5	Yes	No substantive task discrepancy found.
6	Yes	No substantive task discrepancy found.
7	Mostly yes	Initial HMSC task is the same. GPT and Qwen include an extra follow-up “Please draw the two MSCs”; Gemini does not show it as a separate prompt, though the initial task already asks to generate two MSCs.
8	No / partial	Initial HMSC task is the same, but GPT has an extra follow-up task: “Please redraw the second MSC (MSC 2).” Gemini and Qwen do not show this extra task.
9	No / minor but real	Most tasks match, but the second ordering question differs: Gemini asks “between sending Success and sending Deliver”; GPT and Qwen ask “between sending Success and sending receiving Deliver,” which is either a typo or a different event reference.

Additional observation

I also spot-checked the embedded source diagrams. The first/source image is visually the same across the three models for most examples; Example 2’s GPT source image has different dimensions/hash, but it appears visually to be the same MSC scaled differently. The main discrepancies are therefore in the textual prompts/tasks, not necessarily the diagrams.

Bottom line

Examples 1 and 2 are clearly not comparable as identical-task experiments. Examples 8 and 9 also contain task inconsistencies. Examples 3–7 are mostly consistent, with only minor follow-up or drawing-prompt differences.