Analysis / research_question_answers_summary.docx
Question:
Please answer the following research questions based on the data in the attached annotation file: RQ1: To what extent can LLMs identify and reason about the basic semantic elements of MSCs? RQ2: How well do LLMs perform semantic transformations (abstraction, composition) over MSCs? RQ3: Can LLMs reason about the behavioural semantics (traces, LTS) of MSCs? RQ4: How does LLM performance vary with model complexity and task complexity? RQ5: Which MSC concepts constitute recurring "pain patterns" for LLMs (i.e., those patterns that cannot be handled correctly in most cases)?
Answer:
Summary of Research Question Answers
This document summarises the answers to the five research questions using the annotation-derived accuracy tables. The interpretation is based on 129 judged tasks, balanced across three models with 43 tasks each.
RQ1: To what extent can LLMs identify and reason about the basic semantic elements of MSCs?
The models performed strongly on basic MSC semantics. Identifying events reached 91.7% aggregate accuracy and identifying event orders reached 85.7%. These were the strongest categories in the dataset.
| Basic MSC task | Correct | Total | Accuracy |
|---|---|---|---|
| Identifying events | 11 | 12 | 91.7% |
| Identifying event orders | 18 | 21 | 85.7% |
| Combined basic semantic tasks | 29 | 33 | 87.9% |
| Answer: LLMs are generally competent at recognising basic MSC semantic elements, especially event identification and event ordering. | |||
RQ2: How well do LLMs perform semantic transformations over MSCs?
Semantic transformations were substantially harder. Abstraction was the weakest task type overall, while composition showed model-dependent behaviour.
| Transformation task | Correct | Total | Accuracy | ||
|---|---|---|---|---|---|
| Abstraction | 6 | 30 | 20.0% | ||
| Composition | 20 | 42 | 47.6% | ||
| Combined transformation tasks | 26 | 72 | 36.1% | ||
| Model | Abstraction accuracy | Composition accuracy | |||
| GPT-5.4-Thinking Deep | 20.0% | 35.7% | |||
| Gemini 3 - Thinking | 10.0% | 71.4% | |||
| Qwen 3.6 Plus - Thinking | 30.0% | 35.7% | |||
| Answer: LLMs perform poorly on MSC semantic transformations overall. Abstraction is a severe cross-model weakness, whereas composition is inconsistent and strongly model-dependent. | |||||
RQ3: Can LLMs reason about the behavioural semantics of MSCs?
Behavioural semantics were challenging. Trace reasoning was moderate, but LTS reasoning was poor across the dataset.
| Behavioural task | Correct | Total | Accuracy | ||
|---|---|---|---|---|---|
| Traces | 7 | 12 | 58.3% | ||
| LTS | 3 | 12 | 25.0% | ||
| Combined behavioural semantics tasks | 10 | 24 | 41.7% | ||
| Model | Traces accuracy | LTS accuracy | |||
| GPT-5.4-Thinking Deep | 75.0% | 50.0% | |||
| Gemini 3 - Thinking | 50.0% | 0.0% | |||
| Qwen 3.6 Plus - Thinking | 50.0% | 25.0% | |||
| Answer: LLMs can reason about traces to a limited or moderate extent, but LTS-based behavioural semantics remain a major difficulty. | |||||
RQ4: How does LLM performance vary with model complexity and task complexity?
Overall model performance was relatively close, and there was no simple monotonic relationship between model identity and performance. Task complexity had a clearer effect: performance fell as tasks moved from recognition to transformation and behavioural modelling.
| Model | Correct | Total | Overall accuracy |
|---|---|---|---|
| Gemini 3 - Thinking | 24 | 43 | 55.8% |
| Qwen 3.6 Plus - Thinking | 21 | 43 | 48.8% |
| GPT-5.4-Thinking Deep | 20 | 43 | 46.5% |
| Task group | Accuracy | ||
| Basic semantic tasks: events + event orders | 87.9% | ||
| Behavioural semantics: traces + LTS | 41.7% | ||
| Transformations: abstraction + composition | 36.1% | ||
| Answer: Task complexity explains performance variation better than model choice. Basic recognition is strong, while transformations and behavioural semantics are much weaker. | |||
RQ5: Which MSC concepts constitute recurring pain patterns for LLMs?
Recurring pain patterns are task types or examples where most model outputs were incorrect. Abstraction and LTS reasoning are the clearest recurring failures. Difficult composition examples also caused repeated failures, particularly Example 8, which all models failed.
| Pain pattern | Accuracy | Interpretation |
|---|---|---|
| Abstraction | 20.0% | Severe recurring failure across all models |
| LTS | 25.0% | Severe difficulty with state-transition behavioural semantics |
| Composition | 47.6% | Mostly incorrect overall and model-dependent |
| Example 8 | 0.0% | Failed by all models |
| Answer: The main pain patterns are abstraction, LTS reasoning, and difficult composition cases. These concepts are not handled correctly in most cases. | ||
Overall Conclusion
The results show a clear capability hierarchy. LLMs are strongest at event identification and event ordering, moderately reliable on traces, weaker on composition, and weakest on abstraction and LTS reasoning.
| Capability level | MSC task types | Observed performance |
|---|---|---|
| Strong | Event identification, event ordering | High accuracy |
| Moderate | Trace reasoning | Mixed accuracy |
| Weak | Composition | Model-dependent, often poor |
| Very weak | Abstraction, LTS | Recurring failure patterns |
| Final summary: LLMs can usually recognise basic MSC semantics, but they are far less reliable for semantic transformations and formal behavioural semantics. | ||