Overview / Browse data / Converted Word document

Analysis / research_question_answers_summary.docx

This page is a uniformly styled HTML rendering of the original Word file. The original Word file is retained in the data folder for verification.

Question:

Please answer the following research questions based on the data in the attached annotation file: RQ1: To what extent can LLMs identify and reason about the basic semantic elements of MSCs? RQ2: How well do LLMs perform semantic transformations (abstraction, composition) over MSCs? RQ3: Can LLMs reason about the behavioural semantics (traces, LTS) of MSCs? RQ4: How does LLM performance vary with model complexity and task complexity? RQ5: Which MSC concepts constitute recurring "pain patterns" for LLMs (i.e., those patterns that cannot be handled correctly in most cases)?

Answer:

Summary of Research Question Answers

This document summarises the answers to the five research questions using the annotation-derived accuracy tables. The interpretation is based on 129 judged tasks, balanced across three models with 43 tasks each.

RQ1: To what extent can LLMs identify and reason about the basic semantic elements of MSCs?

The models performed strongly on basic MSC semantics. Identifying events reached 91.7% aggregate accuracy and identifying event orders reached 85.7%. These were the strongest categories in the dataset.

Basic MSC task	Correct	Total	Accuracy
Identifying events	11	12	91.7%
Identifying event orders	18	21	85.7%
Combined basic semantic tasks	29	33	87.9%
Answer: LLMs are generally competent at recognising basic MSC semantic elements, especially event identification and event ordering.

RQ2: How well do LLMs perform semantic transformations over MSCs?

Semantic transformations were substantially harder. Abstraction was the weakest task type overall, while composition showed model-dependent behaviour.

Transformation task	Correct		Total		Accuracy
Abstraction	6		30		20.0%
Composition	20		42		47.6%
Combined transformation tasks	26		72		36.1%
Model		Abstraction accuracy		Composition accuracy
GPT-5.4-Thinking Deep		20.0%		35.7%
Gemini 3 - Thinking		10.0%		71.4%
Qwen 3.6 Plus - Thinking		30.0%		35.7%
Answer: LLMs perform poorly on MSC semantic transformations overall. Abstraction is a severe cross-model weakness, whereas composition is inconsistent and strongly model-dependent.

RQ3: Can LLMs reason about the behavioural semantics of MSCs?

Behavioural semantics were challenging. Trace reasoning was moderate, but LTS reasoning was poor across the dataset.

Behavioural task	Correct		Total		Accuracy
Traces	7		12		58.3%
LTS	3		12		25.0%
Combined behavioural semantics tasks	10		24		41.7%
Model		Traces accuracy		LTS accuracy
GPT-5.4-Thinking Deep		75.0%		50.0%
Gemini 3 - Thinking		50.0%		0.0%
Qwen 3.6 Plus - Thinking		50.0%		25.0%
Answer: LLMs can reason about traces to a limited or moderate extent, but LTS-based behavioural semantics remain a major difficulty.

RQ4: How does LLM performance vary with model complexity and task complexity?

Overall model performance was relatively close, and there was no simple monotonic relationship between model identity and performance. Task complexity had a clearer effect: performance fell as tasks moved from recognition to transformation and behavioural modelling.

Model	Correct	Total	Overall accuracy
Gemini 3 - Thinking	24	43	55.8%
Qwen 3.6 Plus - Thinking	21	43	48.8%
GPT-5.4-Thinking Deep	20	43	46.5%
Task group		Accuracy
Basic semantic tasks: events + event orders		87.9%
Behavioural semantics: traces + LTS		41.7%
Transformations: abstraction + composition		36.1%
Answer: Task complexity explains performance variation better than model choice. Basic recognition is strong, while transformations and behavioural semantics are much weaker.

RQ5: Which MSC concepts constitute recurring pain patterns for LLMs?

Recurring pain patterns are task types or examples where most model outputs were incorrect. Abstraction and LTS reasoning are the clearest recurring failures. Difficult composition examples also caused repeated failures, particularly Example 8, which all models failed.

Pain pattern	Accuracy	Interpretation
Abstraction	20.0%	Severe recurring failure across all models
LTS	25.0%	Severe difficulty with state-transition behavioural semantics
Composition	47.6%	Mostly incorrect overall and model-dependent
Example 8	0.0%	Failed by all models
Answer: The main pain patterns are abstraction, LTS reasoning, and difficult composition cases. These concepts are not handled correctly in most cases.

Overall Conclusion

The results show a clear capability hierarchy. LLMs are strongest at event identification and event ordering, moderately reliable on traces, weaker on composition, and weakest on abstraction and LTS reasoning.

Capability level	MSC task types	Observed performance
Strong	Event identification, event ordering	High accuracy
Moderate	Trace reasoning	Mixed accuracy
Weak	Composition	Model-dependent, often poor
Very weak	Abstraction, LTS	Recurring failure patterns
Final summary: LLMs can usually recognise basic MSC semantics, but they are far less reliable for semantic transformations and formal behavioural semantics.