面向大语言模型的结构化数据格式比较

2025年10月12日 · 3 min read

当我们开始将大语言模型作为智能体进行训练时，必须思考如何以最佳方式在模型与现实世界环境之间传递信息。如果模型需要调用外部函数，应该如何传递参数？来自环境的数据又该如何输入给模型？最简单（也最通用）的解决方案是使用结构化数据格式，例如 JSON。这类格式能够编码任意嵌套、类型异构的数据结构。

但 JSON 是最佳选择吗？我们其实有很多选项，比如 TOML、YAML、XML 等。在本文中，我们将探讨并衡量一些关键指标，以帮助我们做出正确的选择。

令牌效率

当前大语言模型的一个基本限制是有限的上下文。我们无法将整个数据世界随时间全部输入模型。这意味着我们希望用最少的令牌传递尽可能多的有用信息。

因此，我们来测试哪种结构化数据格式在分词时效率最高，无需实际遥测数据。为此，我们随机生成结构化数据（嵌套的 dict 和 list），其键使用系统词典构造。例如，在 JSON 中：

[
  {
    "leptomeninx_xylitone": [
      147.396,
      { "asellus_spinelike_triliterality": null }
    ]
  },
  {
    "costively_zoetic": [["decipherably_wheat"]],
    "neurectome_sorcery_tangleproof": null
  }
]

采样过程包括选择一个树大小 $N$ ，然后递归地随机选择容器类型和终端值。你可能会注意到，使用的结构令牌数量取决于我们处理的数据类型。如果你的智能体输入不处理任意嵌套的数据，一个更简单的规范可能就足够了。考虑到这一点，我们定义了一组形状：

nested：以标量值结尾的深层字典/列表组合。
示例
[ { "comforter": { "dosadh_disruption_prosodiac": { "unsnatch_moslem": 837 }, "tone_redefine": { "cribrose_aoul": [ [ "christianization-casuariiformes-overbravery-chevronel" ] ] } } }, { "bovarysm": [ "oropharynx_consentant_fibronuclear", "bajardo-liquidy-calibered-belucki" ], "materialistic": { "paleostylic": -27.23, "praediality_juvenilify_benempt": 104, "roquelaure": -407 } }, { "filicites": [ "unpalatableness-allocaffeine", 126.204, { "manesheet": "emery_tricyclene" } ], "imposing_elchee_mentation": 3, "inadvisability": -12.726 } ]
sparse：主要是空值，在嵌套布局中偶尔出现数值或文本标量。
示例
[ { "areole_auramine_kojiki": { "hyperabsorption_uraniscorrhaphy": -776 }, "maplebush_piete": [ { "shadowgraphist": null, "stakeholder_busybodyness_crebrity": 644 } ], "preadamite": null }, { "bellmaking_brachydont": { "jalapin_chandelier_accelerando": null, "mandative": -79, "totora_peristaphylitis_graphy": null }, "subferryman_dephlegmator": [ { "manuka_uncriminally_archdeceiver": null } ] }, { "daytime": [ { "overfeminine_catholicist": -242.239, "sulfophthalein_irreciprocal": null } ], "gata": null, "macaranga_circuitman": null, "ostraciidae_subsidiariness": "throneward" } ]
tabular：基于列的表格，包含
示例
```
                                          
```
标量值行和共享模式。
onclick="var d = this.closest('details'); d.open = false; d.scrollIntoView({block: 'center'});">

class="chroma">

{ "columns": [ "viragoish_isogonality_swarming", "supralocally_nuncioship", "zoomorph", "cavitary_visie", "permutableness_impunity_bipack", "forby_archly", "rivinian", "unheal_annelidian_samurai" ], "rows": [ [ true, false, "cincinnatia-cyanhidrosis-auto", false, true, null, "acetosoluble nonexclamatory homogangliate croupal", -219 ], [ null, 836, -904, "metasomatic-mundanism-hotchpotchly-secantly", null, 309.642, "floodgate-baluchitherium-unimaginary-sheepkeeper", -396 ], [ "postcritical-tug", true, -948, 0.135, 399.166, -123, "palaeoniscus", true ] ] class="p">} { 5px, var(--content-secondary) 5px, var(--content-secondary) 6px, transparent 6px); 5px, var(--content-primary) 5px, var(--content-primary) 6px, transparent 6px);


我们考虑以下格式：


json：完全压缩的 JSON，键已排序，分隔符紧凑。 

  示例
  
    
    
      [{"backlotter_overboast":"calligraphist_megabar_uninstructively","landspout_souper":[null],"liquefier_unconvicting":-151.898,"unbegot":[961],"unreformedness":-189.15},{"detriment_muckender":[469.486,{"aspergillum_sharebroker_akebia":337},-302.978],"heeder_aerophyte_unbase":499.655,"metamer_powsoddy":null},{"fascicled_fibrous_bajardo":{"octaeterid_pharmacolite_tentativeness":{"underfellow":83.76},"plethysmography_unchangeably_positioned":432.985,"transvestitism":82},"mirror":{"uninfallibility_benny":null}}]

    
  





yaml：块样式的 YAML 序列化，具有确定的键顺序。 

  示例
  
    
    
      - backlotter_overboast: calligraphist_megabar_uninstructively
  landspout_souper:
  - null
  liquefier_unconvicting: -151.898
  unbegot:
  - 961
  unreformedness: -189.15
- detriment_muckender:
  - 469.486
  - aspergillum_sharebroker_akebia: 337
  - -302.978
  heeder_aerophyte_unbase: 499.655
  metamer_powsoddy: null
- fascicled_fibrous_bajardo:
    octaeterid_pharmacolite_tentativeness:
      underfellow: 83.76
    plethysmography_unchangeably_positioned: 432.985
    transvestitism: 82
  mirror:
    uninfallibility_benny: null

    
  





toml：将记录包装在记录数组下的 TOML 文档，空值被字符串化。

  示例
  
    
    
      [[records]]
landspout_souper = [
    "null",
]
backlotter_overboast = "calligraphist_megabar_uninstructively"
liquefier_unconvicting = -151.898
unreformedness = -189.15
unbegot = [
    961,
]

[[records]]
detriment_muckender = [
    469.486,
    { aspergillum_sharebroker_akebia = 337 },
    -302.978,
]
heeder_aerophyte_unbase = 499.655
metamer_powsoddy = "null"

[[records]]

[records.fascicled_fibrous_bajardo]
transvestitism = 82
plethysmography_unchangeably_positioned = 432.985

[records.fascicled_fibrous_bajardo.octaeterid_pharmacolite_tentativeness]
underfellow = 83.76

[records.mirror]
uninfallibility_benny = "null"

    
  





xml：使用语义标签和显式类型名称的详细 XML 树。 

  示例
  
    
    
      <records>
  <object name="record" index="0">
    <array name="landspout_souper">
      <null name="0" />
    </array>
    <string name="backlotter_overboast">calligraphist_megabar_uninstructively</string>
    <number name="liquefier_unconvicting">-151.898</number>
    <number name="unreformedness">-189.15</number>
    <array name="unbegot">
      <number name="0">961</number>
    </array>
  </object>
  <object name="record" index="1">
    <array name="detriment_muckender">
      <number name="0">469.486</number>
      <object name="1">
        <number name="aspergillum_sharebroker_akebia">337</number>
      </object>
      <number name="2">-302.978</number>
    </array>
    <number name="heeder_aerophyte_unbase">499.655</number>
    <null name="metamer_powsoddy" />
  </object>
  <object name="record" index="2">
    <object name="fascicled_fibrous_bajardo">
      <number name="transvestitism">82</number>
      <object name="octaeterid_pharmacolite_tentativeness">
        <number name="underfellow">83.76</number>
      </object>
      <number name="plethysmography_unchangeably_positioned">432.985</number>
    </object>
    <object name="mirror">
      <null name="uninfallibility_benny" />
    </object>
  </object>
</records>

    
  





csv：从表格记录生成的带标题的逗号分隔行。 

  示例
  
    
    
      bicellular_russification_unsinister,crude_paynim,isoetales,postembryonic_encrisp
braza apology catalufa tofu,,rampager,triformous
,True,481.226,
421.281,868,photodysphoria,escortage

    
  





现在，对于每种格式，然后对于每种形状，我们可以绘制每个节点的平均令牌数的热图。令牌计数来自
Qwen 3、Llama 3.2 和 gpt-oss 分词器的平均值。

  
    
  
  
    
  
  


一眼看去，我们可以发现 csv 是表格数据的明显赢家，而 json 平均表现最佳。
为了更清晰地了解，我们可以取每种形状的平均值，以查看每种格式的平均令牌数。

  
    
  
  
    
  
  


这表明，仅就令牌效率而言，排名是 json > yaml > toml > xml。
然而，格式紧凑并不意味着它好。但我们如何量化这一点呢？
什么使一个格式对大语言模型来说是好的？我提出一个简单的指标——它恰好也作为一个长上下文/精度基准——来概括这一点。
格式直观性
一种直观的格式对于语言模型来说易于解析和生成。为了衡量直观性，我们提出以下基准测试。所有运行均使用
DeepSeek V3 (2025-09) 的原始聊天模式，不使用工具，因此模型必须“在脑中”执行
Python 代码片段。

给定一个格式 
 $F$ ，一个输入树大小 
 $N$  和一个输出树大小 
 $M$ 。
生成一个包含 
 $N$  个节点的输入数据树
生成一个 Python 程序，该程序定义一个变量 target，其求值结果为一个大小为 
 $M$ 
的嵌套数据树，该树查询输入数据树
提示模型以我们的格式 
 $F$  生成序列化的 target


  JSON 示例提示
  
    
    
      格式：json_min 观察到的输入节点数：8 目标输出节点数：9
指令：

将数据集解析为名为 data 的 Python 变量。
执行下面的 Python 代码片段以填充名为 target 的变量。
使用原始格式 (json_min) 序列化 target，并将结果放入标记为 json
的围栏代码块中。
代码块应仅包含序列化数据。
请务必确保格式和结构完全匹配。

示例： 示例 1： 数据集：
{"results":{"lo_unaddicted":[{"fleeting_geneserine_desmodynia":[-163.354]},{"subcrepitation_maddeningly":{"homoanisic":-3}},"helminth_vengeable"],"touchiness":[{"cataphyllum_educand":"remilitarize","unhumiliated_poorwill_oryctognostically":"resound","herrnhuter":false},["uptrace",["subastringent"],"scruff","theurgically_tritonymph",[-123]]],"ichthyornithes_revisionary":{"alcogel_freckle":{"inquisition":"lehi"},"oniomaniac_flamineous_ledgerdom":{"tylotoxeate":-141,"hemeralopia":272.837},"unremember":[false,[-30],true]},"amphiumidae":{"unenterprised_meltage":[149],"psilanthropist_garrulinae":{"averrable_deporter":399.228,"riotproof_terebratuloid_monophyodontism":-22},"coed":{"indigoid_pulicid":"airbrush_oenothera","paillasse":"rutelinae"},"inhume_photoprinting_pasturability":["chiselly_backfilling"],"route_anisopogonous":[{"kotal_schematization_zestfulness":-91}]},"unexcised_seamless_intwist":{"cordaitean":-108,"unrising":"monarchist"}}}
Python 代码片段：
target = [
    data["amphiumidae"]["route_anisopogonous"][0],
    data["amphiumidae"]["inhume_photoprinting_pasturability"],
    data["touchiness"][1],
]
响应：
{"results":[{"kotal_schematization_zestfulness":-91},["chiselly_backfilling"],["uptrace",["subastringent"],"scruff","theurgically_tritonymph",[-123]]]}
示例 2： 数据集：
{"results":[[["selachostomous",88.259,"altair_assiniboin",{"samphire_symbolology":{"scarfed_wambutti":-28}},"bocca_ponerid"],[["gibberosity","footway_antecardium",[true],["myxosporous"],"repopulate"]],{"prairied":-13,"amara_huccatoon_massivity":34,"alehouse_uncumber":154}],{"tartary_loculose":[[{"counterwind":"endophasic"}],[{"subhyaline_asiatical_tobikhar":"angolar_cheeriness","scutelliform_riverweed_putback":-7,"thirdsman_phlogistical_tropacocaine":"bawdry"}]],"hydrophore":[{"insubvertible":119,"overwomanize":{"cobble_orography_caprice":-127},"queriman_episcopally_railway":{"unadoration":["weedage"]},"stactometer_toggle_cleavability":[453.262]},{"forejudge_tacnode":{"undersupport":105},"floorward":-170,"dormer_abysmal_occasional":-484.491,"wheatgrower":346.849,"phobism_intendingly":91.698}]},{"conirostres":[{"monorhymed_kioway":"taxlessly","ungloriousness_urosternite":true},["pendanting_allegation",-30],["hemiobol","monont_paradoxial"]],"sistrum":[{"untaintable_polladz":true},[-162,true],{"preclassic_standoffishness_pagina":true}]},[{"earlock_unmantled":{"philoradical_micranthropos":-10,"derout":["unfrock",90.415]},"hepatologist_unrushed":-270.882},[[["argyrol_art"]],["daftness"],[-12,149.452]],[[{"loatuko":"floriken_tecali"},[-153.065],-51,153.874,"pile"]],{"hexacanth":[[-3,-19]]}]]}
Python 代码片段：
target = [
    data[1]["tartary_loculose"][0][0],
    data[1]["hydrophore"][1]["wheatgrower"],
    data[1]["tartary_loculose"][1],
    data[1]["hydrophore"][0]["queriman_episcopally_railway"]["unadoration"],
    data[0][1][0][1],
    data[0][2],
    data[2]["conirostres"][0]["monorhymed_kioway"],
    data[3][2],
]
响应：
{"results":[{"counterwind":"endophasic"},346.849,[{"subhyaline_asiatical_tobikhar":"angolar_cheeriness","scutelliform_riverweed_putback":-7,"thirdsman_phlogistical_tropacocaine":"bawdry"}],["weedage"],"footway_antecardium",{"prairied":-13,"amara_huccatoon_massivity":34,"alehouse_uncumber":154},"taxlessly",[[{"loatuko":"floriken_tecali"},[-153.065],-51,153.874,"pile"]]]}
数据集：
{"results":["relict",{"intolerant_ignify":"cragginess_reapprobation","detriment_wholesalely_spillway":-49},true,"stewardess",-94]}
Python 代码片段：
target = [
    data[1]["intolerant_ignify"],
    data[4],
    data[1]["detriment_wholesalely_spillway"],
    data[2],
    data[3],
    data[1],
]

    
  



我们将省略 XML，因为它极其冗长。 对于每个输入和输出大小，我们生成 5
个数据树并提示 LLM。绘制正确答案的比例，我们得到

      
      

            JSON
          

      
      

            YAML
          

      
      

            TOML
          

        JSON 和 TOML 表现相似，但 TOML 更易于阅读。YAML 对 Deepseek 来说生成困难。
      

这些图可以解释如下：如果我们在 Y
轴向上看到绿色，意味着分数对于大型输入扩展性良好，并且格式清晰易读。如果我们在
X 轴向下远处看到绿色，意味着分数对于大型输出树扩展性良好，并且格式易于生成。YAML
的表现出乎意料地差，这与我关于它是一种更符合人体工程学的格式的直觉相悖。模型似乎同样偏好
TOML 和 JSON。
然而，使用精确匹配作为度量标准可能过于严格。相反，我们可以给那些与参考答案共享更多结构的尝试分配更多分数。我们通过计算
Jaccard
指数，即提交答案与参考答案之间的交并比来实现这一点。使用与前一图相同的数据绘制此图，我们得到

      
      

            JSON
          

      
      

            YAML
          

      
      

            TOML
          

        Jaccard 指数为我们提供了更平滑的准确度表示。我们看到 TOML 表现非常出色，JSON 紧随其后。
      

我们看到 JSON 和 TOML 之间的性能差异更加明显。与 JSON 相比，模型使用 TOML
时能够与正确答案有更高的重叠度。YAML 的表现仍然不佳。
结论
从数据中得出的主要结论是不要使用
YAML。我见过很多人在网上说它对于大语言模型来说比 JSON
更好，但这几乎肯定不是真的。它平均多消耗约 19% 的令牌，并且可读性差得多。与 JSON
相比，TOML 的读写性能似乎扩展性更好，但它编码相同数据要多消耗约 44%
的令牌。对于大多数用途，JSON 是最佳选择。
可通过 GitHub 上的代码
复现这些结果。



  

  

  
  
    AI 透明度
    本文使用 Codex CLI 编写了绘图与实验代码，代码由我本人审核和测试。所有文字内容均为本人撰写。
  
  
  

  
  
  








  

  

  
  

    
    
        ← Previous
        从第一性原理理解熵
    
    

    
    
        Next →
        持续学习并非持续中期训练