Comparing Structured Data Formats for LLMs

October 12, 2025

As we start training LLMs as Agents, we need to think about how to best pass information to and from the model back into the real world environment. For example, if it calls an external function, how should arguments be passed? How should data from the environment be fed to the model? The simplest (and most general) solution is through structured data formats, such as JSON. These formats can encode arbitrarily nested data structures, with various types.

But is JSON the right choice? We have many options, such as TOML, YAML, XML, etc. In this post, we think about and measure some metrics that will help us make the right choice.

Token efficiency

A fundamental limit of current LLMs is finite context. We can’t just stream the entire world of data into the model across time. This means we want to communicate the most useful information possible in the fewest tokens.

So, let’s test which structured data formats get tokenized in the most efficient way, without real-world telemetry. To do this, we generate structured data (nested dicts and lists) with random keys constructed using the system dictionary. For example, in JSON

[
  {
    "leptomeninx_xylitone": [
      147.396,
      { "asellus_spinelike_triliterality": null }
    ]
  },
  {
    "costively_zoetic": [["decipherably_wheat"]],
    "neurectome_sorcery_tangleproof": null
  }
]

The sampling process involves selecting a tree size $N$ , and recursively randomly selecting container types and terminal values. You might notice that the number of structural tokens used depends on what kind of data we’re dealing with. If the input to your agent doesn’t deal with arbitrarily nested data, a simpler specification might suffice. So, we define a set of shapes, which is exactly what it sounds like:

nested: deep dict/list combinations ending in scalar values.

Sample

[
  {
    "comforter": {
      "dosadh_disruption_prosodiac": {
        "unsnatch_moslem": 837
      },
      "tone_redefine": {
        "cribrose_aoul": [
          [
            "christianization-casuariiformes-overbravery-chevronel"
          ]
        ]
      }
    }
  },
  {
    "bovarysm": [
      "oropharynx_consentant_fibronuclear",
      "bajardo-liquidy-calibered-belucki"
    ],
    "materialistic": {
      "paleostylic": -27.23,
      "praediality_juvenilify_benempt": 104,
      "roquelaure": -407
    }
  },
  {
    "filicites": [
      "unpalatableness-allocaffeine",
      126.204,
      {
        "manesheet": "emery_tricyclene"
      }
    ],
    "imposing_elchee_mentation": 3,
    "inadvisability": -12.726
  }
]

sparse: mostly null values with rare numeric or text scalars in nested layouts.

Sample

[
  {
    "areole_auramine_kojiki": {
      "hyperabsorption_uraniscorrhaphy": -776
    },
    "maplebush_piete": [
      {
        "shadowgraphist": null,
        "stakeholder_busybodyness_crebrity": 644
      }
    ],
    "preadamite": null
  },
  {
    "bellmaking_brachydont": {
      "jalapin_chandelier_accelerando": null,
      "mandative": -79,
      "totora_peristaphylitis_graphy": null
    },
    "subferryman_dephlegmator": [
      {
        "manuka_uncriminally_archdeceiver": null
      }
    ]
  },
  {
    "daytime": [
      {
        "overfeminine_catholicist": -242.239,
        "sulfophthalein_irreciprocal": null
      }
    ],
    "gata": null,
    "macaranga_circuitman": null,
    "ostraciidae_subsidiariness": "throneward"
  }
]

tabular: column-based tables with rows of scalar values and shared schema.

Sample

{
  "columns": [
    "viragoish_isogonality_swarming",
    "supralocally_nuncioship",
    "zoomorph",
    "cavitary_visie",
    "permutableness_impunity_bipack",
    "forby_archly",
    "rivinian",
    "unheal_annelidian_samurai"
  ],
  "rows": [
    [
      true,
      false,
      "cincinnatia-cyanhidrosis-auto",
      false,
      true,
      null,
      "acetosoluble nonexclamatory homogangliate croupal",
      -219
    ],
    [
      null,
      836,
      -904,
      "metasomatic-mundanism-hotchpotchly-secantly",
      null,
      309.642,
      "floodgate-baluchitherium-unimaginary-sheepkeeper",
      -396
    ],
    [
      "postcritical-tug",
      true,
      -948,
      0.135,
      399.166,
      -123,
      "palaeoniscus",
      true
    ]
  ]
}

We consider the following formats:

json: fully minified JSON with sorted keys and compact separators.

Sample

[{"backlotter_overboast":"calligraphist_megabar_uninstructively","landspout_souper":[null],"liquefier_unconvicting":-151.898,"unbegot":[961],"unreformedness":-189.15},{"detriment_muckender":[469.486,{"aspergillum_sharebroker_akebia":337},-302.978],"heeder_aerophyte_unbase":499.655,"metamer_powsoddy":null},{"fascicled_fibrous_bajardo":{"octaeterid_pharmacolite_tentativeness":{"underfellow":83.76},"plethysmography_unchangeably_positioned":432.985,"transvestitism":82},"mirror":{"uninfallibility_benny":null}}]

yaml: YAML serialization in block style with deterministic key ordering.

Sample

- backlotter_overboast: calligraphist_megabar_uninstructively
  landspout_souper:
  - null
  liquefier_unconvicting: -151.898
  unbegot:
  - 961
  unreformedness: -189.15
- detriment_muckender:
  - 469.486
  - aspergillum_sharebroker_akebia: 337
  - -302.978
  heeder_aerophyte_unbase: 499.655
  metamer_powsoddy: null
- fascicled_fibrous_bajardo:
    octaeterid_pharmacolite_tentativeness:
      underfellow: 83.76
    plethysmography_unchangeably_positioned: 432.985
    transvestitism: 82
  mirror:
    uninfallibility_benny: null

toml: TOML document wrapping records under a records array, with nulls stringified.

Sample

[[records]]
landspout_souper = [
    "null",
]
backlotter_overboast = "calligraphist_megabar_uninstructively"
liquefier_unconvicting = -151.898
unreformedness = -189.15
unbegot = [
    961,
]

[[records]]
detriment_muckender = [
    469.486,
    { aspergillum_sharebroker_akebia = 337 },
    -302.978,
]
heeder_aerophyte_unbase = 499.655
metamer_powsoddy = "null"

[[records]]

[records.fascicled_fibrous_bajardo]
transvestitism = 82
plethysmography_unchangeably_positioned = 432.985

[records.fascicled_fibrous_bajardo.octaeterid_pharmacolite_tentativeness]
underfellow = 83.76

[records.mirror]
uninfallibility_benny = "null"

xml: verbose XML tree using semantic tags and explicit type names.

Sample

<records>
  <object name="record" index="0">
    <array name="landspout_souper">
      <null name="0" />
    </array>
    <string name="backlotter_overboast">calligraphist_megabar_uninstructively</string>
    <number name="liquefier_unconvicting">-151.898</number>
    <number name="unreformedness">-189.15</number>
    <array name="unbegot">
      <number name="0">961</number>
    </array>
  </object>
  <object name="record" index="1">
    <array name="detriment_muckender">
      <number name="0">469.486</number>
      <object name="1">
        <number name="aspergillum_sharebroker_akebia">337</number>
      </object>
      <number name="2">-302.978</number>
    </array>
    <number name="heeder_aerophyte_unbase">499.655</number>
    <null name="metamer_powsoddy" />
  </object>
  <object name="record" index="2">
    <object name="fascicled_fibrous_bajardo">
      <number name="transvestitism">82</number>
      <object name="octaeterid_pharmacolite_tentativeness">
        <number name="underfellow">83.76</number>
      </object>
      <number name="plethysmography_unchangeably_positioned">432.985</number>
    </object>
    <object name="mirror">
      <null name="uninfallibility_benny" />
    </object>
  </object>
</records>

csv: headered comma-separated rows generated from tabular records.

Sample

bicellular_russification_unsinister,crude_paynim,isoetales,postembryonic_encrisp
braza apology catalufa tofu,,rampager,triformous
,True,481.226,
421.281,868,photodysphoria,escortage

Now, for each format, and then for each shape, we can plot a heatmap of the average number of tokens per node. Token counts come from averaging between the Qwen 3, Llama 3.2, and gpt-oss tokenizers.

At a glance, we can see that csv is a clear winner for tabular data, and json performs the best on average. To get a clearer picture, we can average over each shape to see the average number of tokens per format.

This shows that on token efficiency alone, the ranking is json > yaml > toml > xml. However, just because a format is dense doesn’t mean it’s good. But how can we quantify this? What makes a format good for LLMs? I propose a simple metric, which happens to double as a long-context/precision benchmark, that encapsulates this.

Format intuitiveness

An intuitive format is easy for language models to parse and generate. To measure intuitiveness, we propose the following benchmark. All runs use DeepSeek V3 (2025-09) in raw chat mode without tool use, so the model has to mentally execute the Python snippet.

Given a format $F$ , an input tree size $N$ and an output tree size $M$ .
Generate an input data tree with $N$ nodes
Generate a python program that defines a variable target, that evaluates to a nested data tree of size $M$ , that queries the input data tree
Prompt the model to generate target serialized in our format $F$

Sample Prompt for JSON

Format: json_min Input nodes observed: 8 Target output nodes: 9

Instructions:

Parse the dataset into a Python variable named data.
Execute the Python snippet below to populate a variable named target.
Serialize target using the original format (json_min) and place the result inside a fenced code block tagged json.
The code block must contain only the serialized data.
Be very careful to make sure the format and structure match exactly.

Examples: Example 1: Dataset:

{"results":{"lo_unaddicted":[{"fleeting_geneserine_desmodynia":[-163.354]},{"subcrepitation_maddeningly":{"homoanisic":-3}},"helminth_vengeable"],"touchiness":[{"cataphyllum_educand":"remilitarize","unhumiliated_poorwill_oryctognostically":"resound","herrnhuter":false},["uptrace",["subastringent"],"scruff","theurgically_tritonymph",[-123]]],"ichthyornithes_revisionary":{"alcogel_freckle":{"inquisition":"lehi"},"oniomaniac_flamineous_ledgerdom":{"tylotoxeate":-141,"hemeralopia":272.837},"unremember":[false,[-30],true]},"amphiumidae":{"unenterprised_meltage":[149],"psilanthropist_garrulinae":{"averrable_deporter":399.228,"riotproof_terebratuloid_monophyodontism":-22},"coed":{"indigoid_pulicid":"airbrush_oenothera","paillasse":"rutelinae"},"inhume_photoprinting_pasturability":["chiselly_backfilling"],"route_anisopogonous":[{"kotal_schematization_zestfulness":-91}]},"unexcised_seamless_intwist":{"cordaitean":-108,"unrising":"monarchist"}}}

Python snippet:

target = [
    data["amphiumidae"]["route_anisopogonous"][0],
    data["amphiumidae"]["inhume_photoprinting_pasturability"],
    data["touchiness"][1],
]

Response:

{"results":[{"kotal_schematization_zestfulness":-91},["chiselly_backfilling"],["uptrace",["subastringent"],"scruff","theurgically_tritonymph",[-123]]]}

Example 2: Dataset:

{"results":[[["selachostomous",88.259,"altair_assiniboin",{"samphire_symbolology":{"scarfed_wambutti":-28}},"bocca_ponerid"],[["gibberosity","footway_antecardium",[true],["myxosporous"],"repopulate"]],{"prairied":-13,"amara_huccatoon_massivity":34,"alehouse_uncumber":154}],{"tartary_loculose":[[{"counterwind":"endophasic"}],[{"subhyaline_asiatical_tobikhar":"angolar_cheeriness","scutelliform_riverweed_putback":-7,"thirdsman_phlogistical_tropacocaine":"bawdry"}]],"hydrophore":[{"insubvertible":119,"overwomanize":{"cobble_orography_caprice":-127},"queriman_episcopally_railway":{"unadoration":["weedage"]},"stactometer_toggle_cleavability":[453.262]},{"forejudge_tacnode":{"undersupport":105},"floorward":-170,"dormer_abysmal_occasional":-484.491,"wheatgrower":346.849,"phobism_intendingly":91.698}]},{"conirostres":[{"monorhymed_kioway":"taxlessly","ungloriousness_urosternite":true},["pendanting_allegation",-30],["hemiobol","monont_paradoxial"]],"sistrum":[{"untaintable_polladz":true},[-162,true],{"preclassic_standoffishness_pagina":true}]},[{"earlock_unmantled":{"philoradical_micranthropos":-10,"derout":["unfrock",90.415]},"hepatologist_unrushed":-270.882},[[["argyrol_art"]],["daftness"],[-12,149.452]],[[{"loatuko":"floriken_tecali"},[-153.065],-51,153.874,"pile"]],{"hexacanth":[[-3,-19]]}]]}

Python snippet:

target = [
    data[1]["tartary_loculose"][0][0],
    data[1]["hydrophore"][1]["wheatgrower"],
    data[1]["tartary_loculose"][1],
    data[1]["hydrophore"][0]["queriman_episcopally_railway"]["unadoration"],
    data[0][1][0][1],
    data[0][2],
    data[2]["conirostres"][0]["monorhymed_kioway"],
    data[3][2],
]

Response:

{"results":[{"counterwind":"endophasic"},346.849,[{"subhyaline_asiatical_tobikhar":"angolar_cheeriness","scutelliform_riverweed_putback":-7,"thirdsman_phlogistical_tropacocaine":"bawdry"}],["weedage"],"footway_antecardium",{"prairied":-13,"amara_huccatoon_massivity":34,"alehouse_uncumber":154},"taxlessly",[[{"loatuko":"floriken_tecali"},[-153.065],-51,153.874,"pile"]]]}

Dataset:

{"results":["relict",{"intolerant_ignify":"cragginess_reapprobation","detriment_wholesalely_spillway":-49},true,"stewardess",-94]}

Python snippet:

target = [
    data[1]["intolerant_ignify"],
    data[4],
    data[1]["detriment_wholesalely_spillway"],
    data[2],
    data[3],
    data[1],
]

We’re going to omit XML because of its extreme verbosity. For each input and output size, we generate 5 data trees and prompt the LLM. Plotting the proportion of correct answers, we get

Minimum accuracy matrix for YAML block style — YAML

JSON and TOML appear to perform similarly, with TOML being easier to read. YAML is difficult for Deepseek to generate.

The plots can be interpreted as follows: if we see green up along the Y axis that means the score scales well for large inputs, and the format is legible. If we see green far down the X axis, that means the score scales for large output trees, and the format is easy to generate. YAML is surprisingly poor, contra my intuition about it being a more ergonomic format. The model seems to prefer TOML and JSON similarly.

However, using exact match as a metric might be too strict. Instead, we can assign more credit to attempts that share more structure with the reference. We do this by computing the Jaccard index, or the intersection-over-union between the submitted answer and the reference. Plotting this, using the same data as the previous plot, we get

Minimum jaccard matrix for YAML block style — YAML

The Jaccard index gives us a smoother representation of accuracy. We see that TOML performs remarkably well, with JSON following closely behind.

We see a starker difference between the performance of JSON and TOML. The model is able to have a much higher overlap with the correct answer with TOML compared to JSON. YAML continues to perform poorly.

Conclusion

For me, the main takeaway from the data is don’t use YAML. I’ve seen many people online say it’s better than JSON for LLMs, but this is definitely not true. It uses ~19% more tokens on average, and is less legible and writable. TOML’s read/write performance seems to scale better compared to JSON, but it uses ~44% more tokens to encode the same data. For most uses, JSON seems to be the best bet.

Reproduce the results with the code: https://github.com/nathom/token-efficiency.

←

Entropy from First Principles