You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>In recent years, there has been a significant surge in the capabilities of large language models (LLMs) in generating human-like text and performing a wide range of natural language processing tasks. State-of-the-art models like GPT-4o, OpenAI o1/o3, and Google's Gemini have achieved superior performance in knowledge QA, instruction-following, and code generation.</p>
174
174
<p>Despite recent advances, many real-world applications require not only fluency in the content of the output but also precise control over its structure. This includes tasks where the expected output must follow specific formats such as JSON, XML, LaTeX, HTML, or code in frameworks like React or Vue. These types of structured output are essential in domains like software development, data pipelines, user interface generation, and scientific publishing, where incorrect formatting can lead to disrupted pipelines or non-functional outputs.</p>
<!-------------------------------------------------------------------- Image Type SECTION -------------------------------------------------------------------->
<h2class="title is-3">Rendered Images from different format types</h2>
447
447
<divclass="content has-text-justified">
448
448
<p>
449
-
We compare the performance of various models across the most frequent structured output formats.
450
-
Across all types, GPT-4 and Claude consistently outperform other models by a significant margin.
451
-
Open-source models demonstrate relatively strong performance in categories like JSON and YAML, which are more common formats.
452
-
However, for more specialized formats like SVG, LaTeX equations, and React components, all models except the leading proprietary ones obtain lower scores.
453
-
This indicates that existing models still struggle with complex structured output generation tasks.
449
+
Below are examples of various structured output formats rendered by models in our benchmark.
450
+
These examples showcase the diversity of data formats evaluated in StructEval,
451
+
from markup languages like HTML and LaTeX to data-interchange formats like JSON and YAML.
452
+
The quality and correctness of these rendered outputs contribute to the model's overall score.
<imgsrc="static/images/imgs/000107.png" alt="Structured Output Example 1" style="width: 100%; max-width: 350px;">
459
+
<pclass="caption">JSON Generation Example</p>
484
460
</div>
485
-
<!-- Chart 4: XML -->
486
-
<divclass="chart-item">
487
-
<canvasid="chart_XML"></canvas>
488
-
<pclass="chart-label">XML (187)</p>
461
+
<divclass="column is-one-third">
462
+
<imgsrc="static/images/imgs/000829.png" alt="Structured Output Example 2" style="width: 100%; max-width: 350px;">
463
+
<pclass="caption">HTML Generation Example</p>
489
464
</div>
490
-
<!-- Chart 5: Markdown -->
491
-
<divclass="chart-item">
492
-
<canvasid="chart_Markdown"></canvas>
493
-
<pclass="chart-label">Markdown (165)</p>
465
+
<divclass="column is-one-third">
466
+
<imgsrc="static/images/imgs/000936.png" alt="Structured Output Example 3" style="width: 100%; max-width: 350px;">
467
+
<pclass="caption">React Component Example</p>
494
468
</div>
495
-
<!-- Chart 6: CSV -->
496
-
<divclass="chart-item">
497
-
<canvasid="chart_CSV"></canvas>
498
-
<pclass="chart-label">CSV (155)</p>
469
+
<divclass="column is-one-third">
470
+
<imgsrc="static/images/imgs/001126.png" alt="Structured Output Example 4" style="width: 100%; max-width: 350px;">
471
+
<pclass="caption">SVG Generation Example</p>
499
472
</div>
500
-
<!-- Chart 7: LaTeX -->
501
-
<divclass="chart-item">
502
-
<canvasid="chart_LaTeX"></canvas>
503
-
<pclass="chart-label">LaTeX (142)</p>
473
+
<divclass="column is-one-third">
474
+
<imgsrc="static/images/imgs/001210.png" alt="Structured Output Example 5" style="width: 100%; max-width: 350px;">
475
+
<pclass="caption">LaTeX Equation Example</p>
504
476
</div>
505
-
<!-- Chart 8: SQL -->
506
-
<divclass="chart-item">
507
-
<canvasid="chart_SQL"></canvas>
508
-
<pclass="chart-label">SQL (135)</p>
509
-
</div>
510
-
<!-- Chart 9: SVG -->
511
-
<divclass="chart-item">
512
-
<canvasid="chart_SVG"></canvas>
513
-
<pclass="chart-label">SVG (127)</p>
514
-
</div>
515
-
<!-- Chart 10: React -->
516
-
<divclass="chart-item">
517
-
<canvasid="chart_React"></canvas>
518
-
<pclass="chart-label">React (118)</p>
477
+
<divclass="column is-one-third">
478
+
<imgsrc="static/images/imgs/001312.png" alt="Structured Output Example 6" style="width: 100%; max-width: 350px;">
479
+
<pclass="caption">YAML to XML Conversion Example</p>
519
480
</div>
520
481
</div>
521
-
<pclass="bottom-text">Selected models' performance on different structured format types. The numbers indicate the count of examples for each format in the benchmark.</p>
482
+
<pclass="bottom-text">Examples of different structured format types generated by models in our benchmark.</p>
522
483
</div>
523
484
</div>
524
485
</div>
@@ -562,6 +523,23 @@ <h2 class="title is-3">Generation VS Conversion Tasks</h2>
0 commit comments