Overview / Browse data / Converted Word document

Analysis / annotation_tables_summary.docx

This page is a uniformly styled HTML rendering of the original Word file. The original Word file is retained in the data folder for verification.

Question:

Based on these annotations. Please analyse the accuracy of each and every LLM per task type in a separate table. Please also produce a table summarising the performance of all LLMs per task type and per Example. Please note that "Judgment" and "Judgement" should be considered the same in counting the number of correct and incorrect tasks. If there is any discrepancy among the number or type of tasks across models, please report them. Please note that all models should be tested on the same (number of) tasks.

Answer:

I’ll treat “Judgment” and “Judgement” as the same label, inspect every annotation file in the zip, and check whether each model has the same task coverage before calculating accuracy.

I found 129 annotated task judgments across the 27 DOCX files. Coverage is balanced by model: each model has 43 judged tasks, with the same task counts per example.

Thought for 2m 19s

Done. I treated “Judgment” and “Judgement” as the same field, and counted any “Partially Incorrect” / “Incorrect?” as Incorrect.

You can download the full workbook here:

llm_annotation_accuracy_analysis.xlsx

SummaryCoverage notesGPT-5.4-Thinking DeepGemini 3 - ThinkingQwen 3.6 Plus - ThinkingRaw extracted judgments

Coverage check

No discrepancy was found in the number of tasks across models after normalization.

Check	Result
Total judged tasks	129
Models	3
Judged tasks per model	43
Examples per model	9
Unknown/unclassifiable judgments	0

I normalized these task-type variants for counting: “Identifying event order”, “Identifying event orders”, and “Event ordering” were counted as Identifying event orders. “Question type” was treated as equivalent to “Task type”.

GPT-5.4-Thinking Deep — accuracy by task type

Task type	Correct	Incorrect	Total	Accuracy
Identifying events	3	1	4	75.0%
Identifying event orders	5	2	7	71.4%
Abstraction	2	8	10	20.0%
Traces	3	1	4	75.0%
LTS	2	2	4	50.0%
Composition	5	9	14	35.7%
Overall	20	23	43	46.5%

Gemini 3 - Thinking — accuracy by task type

Task type	Correct	Incorrect	Total	Accuracy
Identifying events	4	0	4	100.0%
Identifying event orders	7	0	7	100.0%
Abstraction	1	9	10	10.0%
Traces	2	2	4	50.0%
LTS	0	4	4	0.0%
Composition	10	4	14	71.4%
Overall	24	19	43	55.8%

Qwen 3.6 Plus - Thinking — accuracy by task type

Task type	Correct	Incorrect	Total	Accuracy
Identifying events	4	0	4	100.0%
Identifying event orders	6	1	7	85.7%
Abstraction	3	7	10	30.0%
Traces	2	2	4	50.0%
LTS	1	3	4	25.0%
Composition	5	9	14	35.7%
Overall	21	22	43	48.8%

Summary performance by task type

Task type	GPT-5.4 correct/total	GPT-5.4 accuracy	Gemini correct/total	Gemini accuracy	Qwen correct/total	Qwen accuracy
Identifying events	3/4	75.0%	4/4	100.0%	4/4	100.0%
Identifying event orders	5/7	71.4%	7/7	100.0%	6/7	85.7%
Abstraction	2/10	20.0%	1/10	10.0%	3/10	30.0%
Traces	3/4	75.0%	2/4	50.0%	2/4	50.0%
LTS	2/4	50.0%	0/4	0.0%	1/4	25.0%
Composition	5/14	35.7%	10/14	71.4%	5/14	35.7%

Summary performance by example

Example	GPT-5.4 correct/total	GPT-5.4 accuracy	Gemini correct/total	Gemini accuracy	Qwen correct/total	Qwen accuracy
1	4/7	57.1%	2/7	28.6%	2/7	28.6%
2	5/7	71.4%	5/7	71.4%	4/7	57.1%
3	4/7	57.1%	2/7	28.6%	3/7	42.9%
4	1/4	25.0%	4/4	100.0%	3/4	75.0%
5	2/4	50.0%	3/4	75.0%	2/4	50.0%
6	1/4	25.0%	2/4	50.0%	0/4	0.0%
7	1/1	100.0%	1/1	100.0%	0/1	0.0%
8	0/1	0.0%	0/1	0.0%	0/1	0.0%
9	2/8	25.0%	5/8	62.5%	7/8	87.5%

Key takeaways

Gemini has the best overall accuracy at 55.8%, followed by Qwen at 48.8%, then GPT-5.4 at 46.5%.

Gemini is strongest on event identification/order and composition, but weakest on LTS and abstraction. Qwen is best on abstraction among the three, though still low at 30.0%. GPT-5.4 is comparatively strongest on traces and identifying events, but weak on abstraction and composition.

Question:

Please extract these into a Word and a LaTex file.

Answer:

Summary of Annotation Tables

This document summarises the annotation-derived performance tables. The analysis treats “Judgment” and “Judgement” as the same field and counts partially incorrect judgements as incorrect. Task-type variants were normalised, including variants of event ordering.

Coverage and Normalisation

Item	Finding
Total judgments	129 judged tasks across 27 DOCX files.
Per-model judgments	43 judged tasks per model.
Coverage	No discrepancy found in number of tasks per model, per task type, or per example after normalization.
Judgment/Judgement	Both spellings were treated identically. Raw headers: judgment=115; judgement=14.
Task-type normalization	Identifying event order, Identifying event orders, and Event ordering were counted as Identifying event orders; Identifying events. was counted as Identifying events; Question type was treated as Task type.
Unknown labels	None. All judgments were classifiable as Correct or Incorrect.
Partially incorrect	Counted as Incorrect because it contains an incorrect judgment.
Coverage finding: The dataset is balanced: 129 judged tasks, 43 tasks per model, and no observed discrepancy in task counts by model, task type, or example.

Accuracy by Model and Task Type

GPT-5.4-Thinking Deep

Task type	Correct	Incorrect	Total	Accuracy
Identifying events	3	1	4	75.0%
Identifying event orders	5	2	7	71.4%
Abstraction	2	8	10	20.0%
Traces	3	1	4	75.0%
LTS	2	2	4	50.0%
Composition	5	9	14	35.7%

Gemini 3 - Thinking

Task type	Correct	Incorrect	Total	Accuracy
Identifying events	4	0	4	100.0%
Identifying event orders	7	0	7	100.0%
Abstraction	1	9	10	10.0%
Traces	2	2	4	50.0%
LTS	0	4	4	0.0%
Composition	10	4	14	71.4%

Qwen 3.6 Plus - Thinking

Task type	Correct	Incorrect	Total	Accuracy
Identifying events	4	0	4	100.0%
Identifying event orders	6	1	7	85.7%
Abstraction	3	7	10	30.0%
Traces	2	2	4	50.0%
LTS	1	3	4	25.0%
Composition	5	9	14	35.7%

Cross-Model Summary by Task Type

Task type	GPT-5.4 correct/total	GPT-5.4 accuracy	Gemini correct/total	Gemini accuracy	Qwen correct/total	Qwen accuracy
Identifying events	3/4	75.0%	4/4	100.0%	4/4	100.0%
Identifying event orders	5/7	71.4%	7/7	100.0%	6/7	85.7%
Abstraction	2/10	20.0%	1/10	10.0%	3/10	30.0%
Traces	3/4	75.0%	2/4	50.0%	2/4	50.0%
LTS	2/4	50.0%	0/4	0.0%	1/4	25.0%
Composition	5/14	35.7%	10/14	71.4%	5/14	35.7%

Cross-Model Summary by Example

Example	GPT-5.4 correct/total	GPT-5.4 accuracy	Gemini correct/total	Gemini accuracy	Qwen correct/total	Qwen accuracy
1	4/7	57.1%	2/7	28.6%	2/7	28.6%
2	5/7	71.4%	5/7	71.4%	4/7	57.1%
3	4/7	57.1%	2/7	28.6%	3/7	42.9%
4	1/4	25.0%	4/4	100.0%	3/4	75.0%
5	2/4	50.0%	3/4	75.0%	2/4	50.0%
6	1/4	25.0%	2/4	50.0%	0/4	0.0%
7	1/1	100.0%	1/1	100.0%	0/1	0.0%
8	0/1	0.0%	0/1	0.0%	0/1	0.0%
9	2/8	25.0%	5/8	62.5%	7/8	87.5%

Task Count Coverage Checks

Coverage by task type

Task type	GPT-5.4	Gemini	Qwen
Identifying events	4	4	4
Identifying event orders	7	7	7
Abstraction	10	10	10
Traces	4	4	4
LTS	4	4	4
Composition	14	14	14

Coverage by example

Example	GPT-5.4	Gemini	Qwen
1	7	7	7
2	7	7	7
3	7	7	7
4	4	4	4
5	4	4	4
6	4	4	4
7	1	1	1
8	1	1	1
9	8	8	8

Interpretive Summary

Gemini 3 - Thinking had the highest overall accuracy at 55.8%, followed by Qwen 3.6 Plus - Thinking at 48.8% and GPT-5.4-Thinking Deep at 46.5%.
All models performed well on basic MSC semantic tasks, especially identifying events and event orders.
Abstraction was the most difficult task category across all models, with only 20.0% aggregate accuracy.
LTS reasoning was also a recurring weakness, with 25.0% aggregate accuracy.
Composition was model-dependent: Gemini performed strongly at 71.4%, while GPT-5.4 and Qwen both scored 35.7%.