Analysis / annotation_tables_summary.docx

Back to data browser
This page is a uniformly styled HTML rendering of the original Word file. The original Word file is retained in the data folder for verification.

Question:

Based on these annotations. Please analyse the accuracy of each and every LLM per task type in a separate table. Please also produce a table summarising the performance of all LLMs per task type and per Example. Please note that "Judgment" and "Judgement" should be considered the same in counting the number of correct and incorrect tasks. If there is any discrepancy among the number or type of tasks across models, please report them. Please note that all models should be tested on the same (number of) tasks.

Answer:

I’ll treat “Judgment” and “Judgement” as the same label, inspect every annotation file in the zip, and check whether each model has the same task coverage before calculating accuracy.

I found 129 annotated task judgments across the 27 DOCX files. Coverage is balanced by model: each model has 43 judged tasks, with the same task counts per example.

Thought for 2m 19s

Done. I treated “Judgment” and “Judgement” as the same field, and counted any “Partially Incorrect” / “Incorrect?” as Incorrect.

You can download the full workbook here:

llm_annotation_accuracy_analysis.xlsx

SummaryCoverage notesGPT-5.4-Thinking DeepGemini 3 - ThinkingQwen 3.6 Plus - ThinkingRaw extracted judgments

Coverage check

No discrepancy was found in the number of tasks across models after normalization.

Check Result
Total judged tasks 129
Models 3
Judged tasks per model 43
Examples per model 9
Unknown/unclassifiable judgments 0

I normalized these task-type variants for counting: “Identifying event order”, “Identifying event orders”, and “Event ordering” were counted as Identifying event orders. “Question type” was treated as equivalent to “Task type”.

GPT-5.4-Thinking Deep — accuracy by task type

Task type Correct Incorrect Total Accuracy
Identifying events 3 1 4 75.0%
Identifying event orders 5 2 7 71.4%
Abstraction 2 8 10 20.0%
Traces 3 1 4 75.0%
LTS 2 2 4 50.0%
Composition 5 9 14 35.7%
Overall 20 23 43 46.5%

Gemini 3 - Thinking — accuracy by task type

Task type Correct Incorrect Total Accuracy
Identifying events 4 0 4 100.0%
Identifying event orders 7 0 7 100.0%
Abstraction 1 9 10 10.0%
Traces 2 2 4 50.0%
LTS 0 4 4 0.0%
Composition 10 4 14 71.4%
Overall 24 19 43 55.8%

Qwen 3.6 Plus - Thinking — accuracy by task type

Task type Correct Incorrect Total Accuracy
Identifying events 4 0 4 100.0%
Identifying event orders 6 1 7 85.7%
Abstraction 3 7 10 30.0%
Traces 2 2 4 50.0%
LTS 1 3 4 25.0%
Composition 5 9 14 35.7%
Overall 21 22 43 48.8%

Summary performance by task type

Task type GPT-5.4 correct/total GPT-5.4 accuracy Gemini correct/total Gemini accuracy Qwen correct/total Qwen accuracy
Identifying events 3/4 75.0% 4/4 100.0% 4/4 100.0%
Identifying event orders 5/7 71.4% 7/7 100.0% 6/7 85.7%
Abstraction 2/10 20.0% 1/10 10.0% 3/10 30.0%
Traces 3/4 75.0% 2/4 50.0% 2/4 50.0%
LTS 2/4 50.0% 0/4 0.0% 1/4 25.0%
Composition 5/14 35.7% 10/14 71.4% 5/14 35.7%

Summary performance by example

Example GPT-5.4 correct/total GPT-5.4 accuracy Gemini correct/total Gemini accuracy Qwen correct/total Qwen accuracy
1 4/7 57.1% 2/7 28.6% 2/7 28.6%
2 5/7 71.4% 5/7 71.4% 4/7 57.1%
3 4/7 57.1% 2/7 28.6% 3/7 42.9%
4 1/4 25.0% 4/4 100.0% 3/4 75.0%
5 2/4 50.0% 3/4 75.0% 2/4 50.0%
6 1/4 25.0% 2/4 50.0% 0/4 0.0%
7 1/1 100.0% 1/1 100.0% 0/1 0.0%
8 0/1 0.0% 0/1 0.0% 0/1 0.0%
9 2/8 25.0% 5/8 62.5% 7/8 87.5%

Key takeaways

Gemini has the best overall accuracy at 55.8%, followed by Qwen at 48.8%, then GPT-5.4 at 46.5%.

Gemini is strongest on event identification/order and composition, but weakest on LTS and abstraction. Qwen is best on abstraction among the three, though still low at 30.0%. GPT-5.4 is comparatively strongest on traces and identifying events, but weak on abstraction and composition.

Question:

Please extract these into a Word and a LaTex file.

Answer:

Summary of Annotation Tables

This document summarises the annotation-derived performance tables. The analysis treats “Judgment” and “Judgement” as the same field and counts partially incorrect judgements as incorrect. Task-type variants were normalised, including variants of event ordering.

Coverage and Normalisation

Item Finding
Total judgments 129 judged tasks across 27 DOCX files.
Per-model judgments 43 judged tasks per model.
Coverage No discrepancy found in number of tasks per model, per task type, or per example after normalization.
Judgment/Judgement Both spellings were treated identically. Raw headers: judgment=115; judgement=14.
Task-type normalization Identifying event order, Identifying event orders, and Event ordering were counted as Identifying event orders; Identifying events. was counted as Identifying events; Question type was treated as Task type.
Unknown labels None. All judgments were classifiable as Correct or Incorrect.
Partially incorrect Counted as Incorrect because it contains an incorrect judgment.
Coverage finding: The dataset is balanced: 129 judged tasks, 43 tasks per model, and no observed discrepancy in task counts by model, task type, or example.

Accuracy by Model and Task Type

GPT-5.4-Thinking Deep

Task type Correct Incorrect Total Accuracy
Identifying events 3 1 4 75.0%
Identifying event orders 5 2 7 71.4%
Abstraction 2 8 10 20.0%
Traces 3 1 4 75.0%
LTS 2 2 4 50.0%
Composition 5 9 14 35.7%

Gemini 3 - Thinking

Task type Correct Incorrect Total Accuracy
Identifying events 4 0 4 100.0%
Identifying event orders 7 0 7 100.0%
Abstraction 1 9 10 10.0%
Traces 2 2 4 50.0%
LTS 0 4 4 0.0%
Composition 10 4 14 71.4%

Qwen 3.6 Plus - Thinking

Task type Correct Incorrect Total Accuracy
Identifying events 4 0 4 100.0%
Identifying event orders 6 1 7 85.7%
Abstraction 3 7 10 30.0%
Traces 2 2 4 50.0%
LTS 1 3 4 25.0%
Composition 5 9 14 35.7%

Cross-Model Summary by Task Type

Task type GPT-5.4 correct/total GPT-5.4 accuracy Gemini correct/total Gemini accuracy Qwen correct/total Qwen accuracy
Identifying events 3/4 75.0% 4/4 100.0% 4/4 100.0%
Identifying event orders 5/7 71.4% 7/7 100.0% 6/7 85.7%
Abstraction 2/10 20.0% 1/10 10.0% 3/10 30.0%
Traces 3/4 75.0% 2/4 50.0% 2/4 50.0%
LTS 2/4 50.0% 0/4 0.0% 1/4 25.0%
Composition 5/14 35.7% 10/14 71.4% 5/14 35.7%

Cross-Model Summary by Example

Example GPT-5.4 correct/total GPT-5.4 accuracy Gemini correct/total Gemini accuracy Qwen correct/total Qwen accuracy
1 4/7 57.1% 2/7 28.6% 2/7 28.6%
2 5/7 71.4% 5/7 71.4% 4/7 57.1%
3 4/7 57.1% 2/7 28.6% 3/7 42.9%
4 1/4 25.0% 4/4 100.0% 3/4 75.0%
5 2/4 50.0% 3/4 75.0% 2/4 50.0%
6 1/4 25.0% 2/4 50.0% 0/4 0.0%
7 1/1 100.0% 1/1 100.0% 0/1 0.0%
8 0/1 0.0% 0/1 0.0% 0/1 0.0%
9 2/8 25.0% 5/8 62.5% 7/8 87.5%

Task Count Coverage Checks

Coverage by task type

Task type GPT-5.4 Gemini Qwen
Identifying events 4 4 4
Identifying event orders 7 7 7
Abstraction 10 10 10
Traces 4 4 4
LTS 4 4 4
Composition 14 14 14

Coverage by example

Example GPT-5.4 Gemini Qwen
1 7 7 7
2 7 7 7
3 7 7 7
4 4 4 4
5 4 4 4
6 4 4 4
7 1 1 1
8 1 1 1
9 8 8 8

Interpretive Summary