As we start training LLMs as Agents, we need to think about how to best pass information to and from the model back into the real world environment. For example, if it calls an external function, how should arguments be passed? How should data from the environment be fed to the model? The simplest (and most general) solution is through structured data formats, such as JSON. These formats can encode arbitrarily nested data structures, with various types.
But is JSON the right choice? We have many options, such as TOML, YAML, XML, etc. In this post, we think about and measure some metrics that will help us make the right choice.
Token efficiency
A fundamental limit of current LLMs is finite context. We can’t just stream the entire world of data into the model across time. This means we want to communicate the most useful information possible in the fewest tokens.
So, let’s test which structured data formats get tokenized in the most efficient way, without
real-world telemetry. To do this, we generate structured data (nested dicts and lists) with
random keys constructed using the system dictionary. For example, in JSON
[
{
"leptomeninx_xylitone": [
147.396,
{ "asellus_spinelike_triliterality": null }
]
},
{
"costively_zoetic": [["decipherably_wheat"]],
"neurectome_sorcery_tangleproof": null
}
]
The sampling process involves selecting a tree size , and recursively randomly selecting container types and terminal values. You might notice that the number of structural tokens used depends on what kind of data we’re dealing with. If the input to your agent doesn’t deal with arbitrarily nested data, a simpler specification might suffice. So, we define a set of shapes, which is exactly what it sounds like:
nested: deep dict/list combinations ending in scalar values.Sample
[ { "comforter": { "dosadh_disruption_prosodiac": { "unsnatch_moslem": 837 }, "tone_redefine": { "cribrose_aoul": [ [ "christianization-casuariiformes-overbravery-chevronel" ] ] } } }, { "bovarysm": [ "oropharynx_consentant_fibronuclear", "bajardo-liquidy-calibered-belucki" ], "materialistic": { "paleostylic": -27.23, "praediality_juvenilify_benempt": 104, "roquelaure": -407 } }, { "filicites": [ "unpalatableness-allocaffeine", 126.204, { "manesheet": "emery_tricyclene" } ], "imposing_elchee_mentation": 3, "inadvisability": -12.726 } ]sparse: mostly null values with rare numeric or text scalars in nested layouts.Sample
[ { "areole_auramine_kojiki": { "hyperabsorption_uraniscorrhaphy": -776 }, "maplebush_piete": [ { "shadowgraphist": null, "stakeholder_busybodyness_crebrity": 644 } ], "preadamite": null }, { "bellmaking_brachydont": { "jalapin_chandelier_accelerando": null, "mandative": -79, "totora_peristaphylitis_graphy": null }, "subferryman_dephlegmator": [ { "manuka_uncriminally_archdeceiver": null } ] }, { "daytime": [ { "overfeminine_catholicist": -242.239, "sulfophthalein_irreciprocal": null } ], "gata": null, "macaranga_circuitman": null, "ostraciidae_subsidiariness": "throneward" } ]tabular: column-based tables with rows of scalar values and shared schema.Sample
{ "columns": [ "viragoish_isogonality_swarming", "supralocally_nuncioship", "zoomorph", "cavitary_visie", "permutableness_impunity_bipack", "forby_archly", "rivinian", "unheal_annelidian_samurai" ], "rows": [ [ true, false, "cincinnatia-cyanhidrosis-auto", false, true, null, "acetosoluble nonexclamatory homogangliate croupal", -219 ], [ null, 836, -904, "metasomatic-mundanism-hotchpotchly-secantly", null, 309.642, "floodgate-baluchitherium-unimaginary-sheepkeeper", -396 ], [ "postcritical-tug", true, -948, 0.135, 399.166, -123, "palaeoniscus", true ] ] }
We consider the following formats:
json: fully minified JSON with sorted keys and compact separators.Sample
[{"backlotter_overboast":"calligraphist_megabar_uninstructively","landspout_souper":[null],"liquefier_unconvicting":-151.898,"unbegot":[961],"unreformedness":-189.15},{"detriment_muckender":[469.486,{"aspergillum_sharebroker_akebia":337},-302.978],"heeder_aerophyte_unbase":499.655,"metamer_powsoddy":null},{"fascicled_fibrous_bajardo":{"octaeterid_pharmacolite_tentativeness":{"underfellow":83.76},"plethysmography_unchangeably_positioned":432.985,"transvestitism":82},"mirror":{"uninfallibility_benny":null}}]yaml: YAML serialization in block style with deterministic key ordering.Sample
- backlotter_overboast: calligraphist_megabar_uninstructively landspout_souper: - null liquefier_unconvicting: -151.898 unbegot: - 961 unreformedness: -189.15 - detriment_muckender: - 469.486 - aspergillum_sharebroker_akebia: 337 - -302.978 heeder_aerophyte_unbase: 499.655 metamer_powsoddy: null - fascicled_fibrous_bajardo: octaeterid_pharmacolite_tentativeness: underfellow: 83.76 plethysmography_unchangeably_positioned: 432.985 transvestitism: 82 mirror: uninfallibility_benny: nulltoml: TOML document wrapping records under a records array, with nulls stringified.Sample
[[records]] landspout_souper = [ "null", ] backlotter_overboast = "calligraphist_megabar_uninstructively" liquefier_unconvicting = -151.898 unreformedness = -189.15 unbegot = [ 961, ] [[records]] detriment_muckender = [ 469.486, { aspergillum_sharebroker_akebia = 337 }, -302.978, ] heeder_aerophyte_unbase = 499.655 metamer_powsoddy = "null" [[records]] [records.fascicled_fibrous_bajardo] transvestitism = 82 plethysmography_unchangeably_positioned = 432.985 [records.fascicled_fibrous_bajardo.octaeterid_pharmacolite_tentativeness] underfellow = 83.76 [records.mirror] uninfallibility_benny = "null"xml: verbose XML tree using semantic tags and explicit type names.Sample
<records> <object name="record" index="0"> <array name="landspout_souper"> <null name="0" /> </array> <string name="backlotter_overboast">calligraphist_megabar_uninstructively</string> <number name="liquefier_unconvicting">-151.898</number> <number name="unreformedness">-189.15</number> <array name="unbegot"> <number name="0">961</number> </array> </object> <object name="record" index="1"> <array name="detriment_muckender"> <number name="0">469.486</number> <object name="1"> <number name="aspergillum_sharebroker_akebia">337</number> </object> <number name="2">-302.978</number> </array> <number name="heeder_aerophyte_unbase">499.655</number> <null name="metamer_powsoddy" /> </object> <object name="record" index="2"> <object name="fascicled_fibrous_bajardo"> <number name="transvestitism">82</number> <object name="octaeterid_pharmacolite_tentativeness"> <number name="underfellow">83.76</number> </object> <number name="plethysmography_unchangeably_positioned">432.985</number> </object> <object name="mirror"> <null name="uninfallibility_benny" /> </object> </object> </records>csv: headered comma-separated rows generated from tabular records.Sample
bicellular_russification_unsinister,crude_paynim,isoetales,postembryonic_encrisp braza apology catalufa tofu,,rampager,triformous ,True,481.226, 421.281,868,photodysphoria,escortage
Now, for each format, and then for each shape, we can plot a heatmap of the average number of tokens per node. Token counts come from averaging between the Qwen 3, Llama 3.2, and gpt-oss tokenizers.
At a glance, we can see that csv is a clear winner for tabular data, and json performs the best on average.
To get a clearer picture, we can average over each shape to see the average number of tokens per format.
This shows that on token efficiency alone, the ranking is json > yaml > toml > xml.
However, just because a format is dense doesn’t mean it’s good. But how can we quantify this?
What makes a format good for LLMs? I propose a simple metric, which happens to double as a long-context/precision
benchmark, that encapsulates this.
Format intuitiveness
An intuitive format is easy for language models to parse and generate. To measure intuitiveness, we propose the following benchmark. All runs use DeepSeek V3 (2025-09) in raw chat mode without tool use, so the model has to mentally execute the Python snippet.
- Given a format , an input tree size and an output tree size .
- Generate an input data tree with nodes
- Generate a python program that defines a variable
target, that evaluates to a nested data tree of size , that queries the input data tree - Prompt the model to generate
targetserialized in our format
Sample Prompt for JSON
Format: json_min Input nodes observed: 8 Target output nodes: 9
Instructions:
- Parse the dataset into a Python variable named
data. - Execute the Python snippet below to populate a variable named
target. - Serialize
targetusing the original format (json_min) and place the result inside a fenced code block taggedjson. - The code block must contain only the serialized data.
- Be very careful to make sure the format and structure match exactly.
Examples: Example 1: Dataset:
{"results":{"lo_unaddicted":[{"fleeting_geneserine_desmodynia":[-163.354]},{"subcrepitation_maddeningly":{"homoanisic":-3}},"helminth_vengeable"],"touchiness":[{"cataphyllum_educand":"remilitarize","unhumiliated_poorwill_oryctognostically":"resound","herrnhuter":false},["uptrace",["subastringent"],"scruff","theurgically_tritonymph",[-123]]],"ichthyornithes_revisionary":{"alcogel_freckle":{"inquisition":"lehi"},"oniomaniac_flamineous_ledgerdom":{"tylotoxeate":-141,"hemeralopia":272.837},"unremember":[false,[-30],true]},"amphiumidae":{"unenterprised_meltage":[149],"psilanthropist_garrulinae":{"averrable_deporter":399.228,"riotproof_terebratuloid_monophyodontism":-22},"coed":{"indigoid_pulicid":"airbrush_oenothera","paillasse":"rutelinae"},"inhume_photoprinting_pasturability":["chiselly_backfilling"],"route_anisopogonous":[{"kotal_schematization_zestfulness":-91}]},"unexcised_seamless_intwist":{"cordaitean":-108,"unrising":"monarchist"}}}
Python snippet:
target = [
data["amphiumidae"]["route_anisopogonous"][0],
data["amphiumidae"]["inhume_photoprinting_pasturability"],
data["touchiness"][1],
]
Response:
{"results":[{"kotal_schematization_zestfulness":-91},["chiselly_backfilling"],["uptrace",["subastringent"],"scruff","theurgically_tritonymph",[-123]]]}
Example 2: Dataset:
{"results":[[["selachostomous",88.259,"altair_assiniboin",{"samphire_symbolology":{"scarfed_wambutti":-28}},"bocca_ponerid"],[["gibberosity","footway_antecardium",[true],["myxosporous"],"repopulate"]],{"prairied":-13,"amara_huccatoon_massivity":34,"alehouse_uncumber":154}],{"tartary_loculose":[[{"counterwind":"endophasic"}],[{"subhyaline_asiatical_tobikhar":"angolar_cheeriness","scutelliform_riverweed_putback":-7,"thirdsman_phlogistical_tropacocaine":"bawdry"}]],"hydrophore":[{"insubvertible":119,"overwomanize":{"cobble_orography_caprice":-127},"queriman_episcopally_railway":{"unadoration":["weedage"]},"stactometer_toggle_cleavability":[453.262]},{"forejudge_tacnode":{"undersupport":105},"floorward":-170,"dormer_abysmal_occasional":-484.491,"wheatgrower":346.849,"phobism_intendingly":91.698}]},{"conirostres":[{"monorhymed_kioway":"taxlessly","ungloriousness_urosternite":true},["pendanting_allegation",-30],["hemiobol","monont_paradoxial"]],"sistrum":[{"untaintable_polladz":true},[-162,true],{"preclassic_standoffishness_pagina":true}]},[{"earlock_unmantled":{"philoradical_micranthropos":-10,"derout":["unfrock",90.415]},"hepatologist_unrushed":-270.882},[[["argyrol_art"]],["daftness"],[-12,149.452]],[[{"loatuko":"floriken_tecali"},[-153.065],-51,153.874,"pile"]],{"hexacanth":[[-3,-19]]}]]}
Python snippet:
target = [
data[1]["tartary_loculose"][0][0],
data[1]["hydrophore"][1]["wheatgrower"],
data[1]["tartary_loculose"][1],
data[1]["hydrophore"][0]["queriman_episcopally_railway"]["unadoration"],
data[0][1][0][1],
data[0][2],
data[2]["conirostres"][0]["monorhymed_kioway"],
data[3][2],
]
Response:
{"results":[{"counterwind":"endophasic"},346.849,[{"subhyaline_asiatical_tobikhar":"angolar_cheeriness","scutelliform_riverweed_putback":-7,"thirdsman_phlogistical_tropacocaine":"bawdry"}],["weedage"],"footway_antecardium",{"prairied":-13,"amara_huccatoon_massivity":34,"alehouse_uncumber":154},"taxlessly",[[{"loatuko":"floriken_tecali"},[-153.065],-51,153.874,"pile"]]]}
Dataset:
{"results":["relict",{"intolerant_ignify":"cragginess_reapprobation","detriment_wholesalely_spillway":-49},true,"stewardess",-94]}
Python snippet:
target = [
data[1]["intolerant_ignify"],
data[4],
data[1]["detriment_wholesalely_spillway"],
data[2],
data[3],
data[1],
]
We’re going to omit XML because of its extreme verbosity. For each input and output size, we generate 5 data trees and prompt the LLM. Plotting the proportion of correct answers, we get
The plots can be interpreted as follows: if we see green up along the Y axis that means the score scales well for large inputs, and the format is legible. If we see green far down the X axis, that means the score scales for large output trees, and the format is easy to generate. YAML is surprisingly poor, contra my intuition about it being a more ergonomic format. The model seems to prefer TOML and JSON similarly.
However, using exact match as a metric might be too strict. Instead, we can assign more credit to attempts that share more structure with the reference. We do this by computing the Jaccard index, or the intersection-over-union between the submitted answer and the reference. Plotting this, using the same data as the previous plot, we get
We see a starker difference between the performance of JSON and TOML. The model is able to have a much higher overlap with the correct answer with TOML compared to JSON. YAML continues to perform poorly.
Conclusion
For me, the main takeaway from the data is don’t use YAML. I’ve seen many people online say it’s better than JSON for LLMs, but this is definitely not true. It uses ~19% more tokens on average, and is less legible and writable. TOML’s read/write performance seems to scale better compared to JSON, but it uses ~44% more tokens to encode the same data. For most uses, JSON seems to be the best bet.
Reproduce the results with the code: https://github.com/nathom/token-efficiency.