Analysis / research_question_answers_summary.docx

Back to data browser
This page is a uniformly styled HTML rendering of the original Word file. The original Word file is retained in the data folder for verification.

Question:

Please answer the following research questions based on the data in the attached annotation file: RQ1: To what extent can LLMs identify and reason about the basic semantic elements of MSCs? RQ2: How well do LLMs perform semantic transformations (abstraction, composition) over MSCs? RQ3: Can LLMs reason about the behavioural semantics (traces, LTS) of MSCs? RQ4: How does LLM performance vary with model complexity and task complexity? RQ5: Which MSC concepts constitute recurring "pain patterns" for LLMs (i.e., those patterns that cannot be handled correctly in most cases)?

Answer:

Summary of Research Question Answers

This document summarises the answers to the five research questions using the annotation-derived accuracy tables. The interpretation is based on 129 judged tasks, balanced across three models with 43 tasks each.

RQ1: To what extent can LLMs identify and reason about the basic semantic elements of MSCs?

The models performed strongly on basic MSC semantics. Identifying events reached 91.7% aggregate accuracy and identifying event orders reached 85.7%. These were the strongest categories in the dataset.

Basic MSC task Correct Total Accuracy
Identifying events 11 12 91.7%
Identifying event orders 18 21 85.7%
Combined basic semantic tasks 29 33 87.9%
Answer: LLMs are generally competent at recognising basic MSC semantic elements, especially event identification and event ordering.

RQ2: How well do LLMs perform semantic transformations over MSCs?

Semantic transformations were substantially harder. Abstraction was the weakest task type overall, while composition showed model-dependent behaviour.

Transformation task Correct Total Accuracy
Abstraction 6 30 20.0%
Composition 20 42 47.6%
Combined transformation tasks 26 72 36.1%
Model Abstraction accuracy Composition accuracy
GPT-5.4-Thinking Deep 20.0% 35.7%
Gemini 3 - Thinking 10.0% 71.4%
Qwen 3.6 Plus - Thinking 30.0% 35.7%
Answer: LLMs perform poorly on MSC semantic transformations overall. Abstraction is a severe cross-model weakness, whereas composition is inconsistent and strongly model-dependent.

RQ3: Can LLMs reason about the behavioural semantics of MSCs?

Behavioural semantics were challenging. Trace reasoning was moderate, but LTS reasoning was poor across the dataset.

Behavioural task Correct Total Accuracy
Traces 7 12 58.3%
LTS 3 12 25.0%
Combined behavioural semantics tasks 10 24 41.7%
Model Traces accuracy LTS accuracy
GPT-5.4-Thinking Deep 75.0% 50.0%
Gemini 3 - Thinking 50.0% 0.0%
Qwen 3.6 Plus - Thinking 50.0% 25.0%
Answer: LLMs can reason about traces to a limited or moderate extent, but LTS-based behavioural semantics remain a major difficulty.

RQ4: How does LLM performance vary with model complexity and task complexity?

Overall model performance was relatively close, and there was no simple monotonic relationship between model identity and performance. Task complexity had a clearer effect: performance fell as tasks moved from recognition to transformation and behavioural modelling.

Model Correct Total Overall accuracy
Gemini 3 - Thinking 24 43 55.8%
Qwen 3.6 Plus - Thinking 21 43 48.8%
GPT-5.4-Thinking Deep 20 43 46.5%
Task group Accuracy
Basic semantic tasks: events + event orders 87.9%
Behavioural semantics: traces + LTS 41.7%
Transformations: abstraction + composition 36.1%
Answer: Task complexity explains performance variation better than model choice. Basic recognition is strong, while transformations and behavioural semantics are much weaker.

RQ5: Which MSC concepts constitute recurring pain patterns for LLMs?

Recurring pain patterns are task types or examples where most model outputs were incorrect. Abstraction and LTS reasoning are the clearest recurring failures. Difficult composition examples also caused repeated failures, particularly Example 8, which all models failed.

Pain pattern Accuracy Interpretation
Abstraction 20.0% Severe recurring failure across all models
LTS 25.0% Severe difficulty with state-transition behavioural semantics
Composition 47.6% Mostly incorrect overall and model-dependent
Example 8 0.0% Failed by all models
Answer: The main pain patterns are abstraction, LTS reasoning, and difficult composition cases. These concepts are not handled correctly in most cases.

Overall Conclusion

The results show a clear capability hierarchy. LLMs are strongest at event identification and event ordering, moderately reliable on traces, weaker on composition, and weakest on abstraction and LTS reasoning.

Capability level MSC task types Observed performance
Strong Event identification, event ordering High accuracy
Moderate Trace reasoning Mixed accuracy
Weak Composition Model-dependent, often poor
Very weak Abstraction, LTS Recurring failure patterns
Final summary: LLMs can usually recognise basic MSC semantics, but they are far less reliable for semantic transformations and formal behavioural semantics.