FlexBuffer Binary Serialization

Current Status

reaktor-flexbuffer is now more than a serializer swap: it is a generated-code path over an internalized FlexBuffers runtime, with explicit hot paths for primitive fields, typed vectors, map field indexing, builder pooling, per-platform encode/decode pools, per-platform UTF-8 fast paths, and a reproducible sample-based iOS profiling pipeline.

What is working today

KSP generates FlexCoder<T> + @JvmInline value class Accessor + asXxx() extensions for every @Struct model.
Generated encode writes map keys alphabetically and calls endMap(..., presorted = true).
Generated decode reads fields by index, not key lookup; zero Reference allocation for primitives.
FlexDecoderV2 has descriptor field-index caching + direct map reads for serializer fallback.
PerPlatformPool<T> (expect/actual) gives JVM/Android ThreadLocal, Native @Volatile, JS plain var — each runtime's cheapest single-slot primitive.
fastDecodeUtf8 / fastEncodeUtf8 / fastEncodedLength per-platform UTF-8 codecs. iOS Map.getString for a 62-byte URL dropped from 268 ns to 102 ns.
JVM, JS Node, Android unit, and iOS simulator test suites pass (56+ tests including MicroBench and CrossPlatformBenchmark).
JVM async-profiler pipeline (phaseProfile Gradle task) and iOS sample-based profiling pipeline (profile-ios-sim.sh) both operational.
Native test binaries now compile with -opt — previously 7-8× slower without it.

What is still bounded

FlexBuffers are self-describing, so small string-heavy payloads can be larger than JSON (UserProfile: 833 B vs JSON 710 B).
iOS / Kotlin Native is 2.4-4.5× slower than JVM; the gap is structural (no JIT, no escape analysis, heap-only allocation).
C++ index decode remains the floor (~0.04 µs for UserProfile vs ~0.45 µs for the JVM accessor read).
JS is 7-45× slower than JVM — V8's native JSON.parse is hard to beat for small struct workloads.
Controlled JSON token scans still beat best-path Flex on 16/30 adversarial rows because those scans avoid full parsing.

Most important correction: previous docs claimed UserProfile Kotlin decode beat C++ full decode. The current C++ harness disproves that: C++ key decode is about 0.33 µs, C++ index decode is about 0.04 µs for UserProfile. Kotlin FlexCoder is fast, but raw C++ index access remains the floor.

Five Access Tiers

The library exposes five tiers from fastest to most convenient:

@Struct @Serializable
data class UserProfile(
    val id: Long,
    val username: String,
    val tags: List<String>,
    val address: Address  // nested @Struct
)

#	Tier	API	Allocation profile
1	Accessor (zero-copy, lazy)	`bytes.asUserProfile().username`	`@JvmInline value class` over a Map; no data-class alloc; lazy collection wrappers.
2	FlexCoder (KSP-generated)	`FlexBuffers.decode<UserProfile>(bytes)`	Data class + collections; zero `Reference` per field; index-based map reads.
3	Accelerated serializer	`FlexBuffers.decode(serializer<T>(), bytes)`	Registry routes to FlexCoder; drop-in replacement for `Json.encodeToString`.
4	Raw kotlinx.serialization	Same after `FlexCoderRegistry.clear()`	`FlexDecoderV2`/`FlexEncoderV2` with field-index cache, current-map/vector-index direct reads.
5	JSON baseline	`Json.decodeFromString(serializer<T>(), s)`	For comparison, debugging, or external interop.

Implementation Map

Area	Files	Responsibility
KSP processor	`reaktor-compiler/.../FlexCoderProcessor.kt`	Scans `@Struct`, emits FlexCoder + Accessor + `asXxx()` extensions + registration aggregator.
Public API	`core/FlexBuffers.kt`, `core/FlexCoder.kt`	Encode/decode entry points, registry lookup, fallback to kotlinx.serialization.
Decoder fallback	`core/FlexDecoderV2.kt`	Descriptor-driven serializer decode with field-index cache, currentMapIndex/currentVectorIndex direct reads, and beginStructure fast path.
Encoder fallback	`core/FlexEncoderV2.kt`	kotlinx.serialization `AbstractEncoder` with bulk primitive collection paths.
Per-platform pool	`core/PerPlatformPool.kt` (+ 4 actuals)	Single-slot pool: JVM/Android `ThreadLocal`, Native `@Volatile`, JS plain.
Builder/runtime	`flatbuffers/FlexBuffersBuilder.kt`, `flatbuffers/FlexBuffers.kt`	Internalized FlatBuffers Kotlin runtime with Reaktor-specific builder optimizations.
UTF-8 codec	`flatbuffers/FastDecode.kt` (+ 4 actuals)	`fastDecodeUtf8` / `fastEncodeUtf8` / `fastEncodedLength` expect/actual.
Collections	`core/FlexCollections.kt`	Lazy zero-copy `FlexIntList` / `FlexStringStringMap` / etc.
Builder pool	`core/FlexBufferPool.kt`	16-slot CAS-backed pool of 16 KB pre-grown builders.
iOS profiling	`iosMain/.../bench/IosBench.kt`, `flamechart/profile-ios-sim.sh`	Long-running release executable + sample-based driver script.
JVM profiling	`jvmMain/.../bench/PhaseProfiler.kt`, `flamechart/analyze.py`	Per-phase async-profiler runner + hot-frame aggregator.
C++ reference	`cpp/bench/flexbuffer_bench.cpp`	Native harness for wire-size verification and key-vs-index decode comparisons.
Benchmarks	`src/commonTest/.../Benchmark.kt`, `MicroBench.kt`	KMP cross-platform benchmark + per-operation micro-benchmark.

Runtime Architecture

Generated fast path

T → FlexCoderRegistry → GeneratedFlexCoder.encode(builder, value) → FlexBuffersBuilder → ByteArray

ByteArray → FlexBuffers.getRoot(bytes).asMap → GeneratedFlexCoder.decode(map) → T

Registry resolves coders by KClass or kotlinx serial name.
Generated fields written in alphabetical order at compile time.
Builder receives presorted maps and skips per-map sorting.
Generated decode uses stable field indexes — O(1) per field.

Serializer fallback path

T → kotlinx.serialization descriptor → FlexEncoderV2 → FlexBuffersBuilder → ByteArray

ByteArray → FlexDecoderV2 → descriptor element index → serializer callbacks → T

Keeps third-party and non-@Struct models working.
Descriptor field-index cache + currentMapIndex/currentVectorIndex direct reads avoid Reference allocation.
Pool acquires decoder/encoder via PerPlatformPool; per-thread on JVM/Android.

Zero-copy accessor path

ByteArray → FlexBuffer Map → @JvmInline value-class Accessor → typed property reads

Accessors are for read-heavy paths where the caller does not need a full data class. They wrap FlexBuffer maps and expose typed properties that read directly from the byte buffer. Lazy list wrappers (FlexIntList, FlexStringStringMap) avoid materializing collections until the caller reads an element.

Cross-Platform Results

Encode + decode µs/op, Apple M-series, min of 3 runs of 5,000 iterations with 500 warmup, per CrossPlatformBenchmark (in commonTest, runs identically on every target).

FlexCoder (KSP-generated, the production hot path)

Case	JVM	Android	iOS sim	JS Node
UserProfile (14 fields, nested)	3.1 µs	5.0 µs	7.3 µs	21.9 µs
ChatThread (15 msgs, nested)	7.0 µs	10.1 µs	24.5 µs	96.1 µs
ApiResponse (20 products, lists)	14.4 µs	15.5 µs	55.0 µs	167.0 µs
TimeSeries (256d + 256L typed)	4.1 µs	4.2 µs	18.4 µs	163.3 µs

vs JSON baseline

Case	JVM	Android	iOS	JS
UserProfile	1.0×	2.1×	1.1×	0.4×
ChatThread	1.5×	1.4×	1.7×	0.5×
ApiResponse	1.6×	1.5×	1.5×	0.6×
TimeSeries	10.6×	10.6×	10.0×	0.7×

Headlines: FlexBuffer is 1.4-15× faster than JSON on JVM/Android. TimeSeries numeric bulk dominates everywhere except JS. On iOS, FlexBuffer matches JSON for small structs and wins on numeric/nested. V8's native JSON wins on JS for small struct workloads.

Shipped Optimizations

Every optimization originally proposed in the improvement plan, marked with current implementation status. Items annotated (new) were added during the most recent cross-platform performance pass.

Decoder & runtime

Status	Optimization	Where
✅	Field index cache — per-class `IntArray` mapping descriptor index to alphabetical map position. Replaces O(log n) `map.get(key)` with O(1) array lookup; deterministic from field names alone, computed once per class.	`FlexDecoderV2.fieldIndexCache`
✅	currentMapIndex direct reads — `decodeElementIndex` stores the map position; `decodeInt` / `decodeString` / etc. call `map.getInt(idx)` directly. Zero `Reference` allocation for primitive fields. (new)	`FlexDecoderV2`
✅	currentVectorIndex direct reads — same pattern for `VECTOR` contexts. `decodeElementIndex` records the index; `decodeXxx` calls `vec.readInt(i)` / `readString(i)` / etc. (new)	`FlexDecoderV2`
✅	MAP_ENTRIES value fast path — `Map<K,V>` value side stores `currentMapIndex` instead of allocating a Reference. Halves per-entry allocation. (new)	`FlexDecoderV2.decodeElementIndex`
✅	beginStructure direct-dispatch fast path — nested CLASS / LIST / MAP from a parent map/vector context call `map.getMap(i)` / `vec.readMap(i)` directly, never materialising the intermediate Reference. (new)	`FlexDecoderV2.beginStructure`
✅	Lazy decode-context stack init — `DecodingContextStack` / `StructureStack` no longer pre-fill 16 entries per call. Grow on demand; removed 22% / 23% of decode/encode allocations seen in flamegraph. (new)	`FlexDecoderV2`, `FlexEncoderV2`
✅	@JvmField on hot mutable state — bypasses synthetic Kotlin property getters. `DecodingContext.getType()` / `getFieldIndices()` were 2-3% each in CPU profile; gone after this change. (new)	`DecodingContext`, `StructureEntry`
✅	Map.keyVector lazy init — `keyVectorEnd` / `keyVectorByteWidth` computed on first key-vector access, not in constructor. FlexCoder index reads never touch them, saving 2 buffer reads + 2 field writes per Map construction. (new)	`Map` in `FlexBuffers.kt`
✅	Gated registry & bulk-array dispatch — skip `FlexCoderRegistry.getBySerialName` hash lookup when registry is empty; gate `endsWith("Array")` on `kind == LIST` first. (new)	`FlexDecoderV2`, `FlexEncoderV2`
✅	Exact key comparison — Map binary search treats key prefixes as distinct; `ArrayReadBuffer.findFirst` respects slice offsets. Fixes prefix-lookup correctness while preserving fast ASCII key search.	`Map`, `ArrayReadBuffer`
✅	Map.indexOf + direct keyed scalar getters — `getInt` / `getLong` / `getDouble` / `getStringByteLength` skip `Reference` allocation. Improves schema fallback and sparse missing lookup (sparse miss 1.64 µs vs C++ 11.64 µs).	`Map`

Encoder & builder

Status	Optimization	Where
✅	ValueStack as parallel primitive arrays — `IntArray types`, `LongArray iVals`, `DoubleArray dVals`. Zero per-field `Value` object allocation; cache-friendly contiguous storage.	`FlexBuffersBuilder.ValueStack`
✅	Pre-sorted field order — KSP emits `builder.set("a", ...)`, `builder.set("b", ...)` alphabetically + `endMap(... presorted = true)`. No runtime sort for KSP-generated coders.	`FlexCoderProcessor`, `FlexBuffersBuilder.endMap`
✅	In-place dynamic map sorting — `ValueStack.sortByKeys` sorts the parallel value-stack arrays in place instead of allocating index and temporary arrays for every unsorted dynamic map.	`FlexBuffersBuilder.ValueStack`
✅	Primitive collection writers — `setIntCollection` / `setLongCollection` / `setDoubleCollection` / `setFloatCollection` write serializer fallback collections directly without intermediate primitive-array conversion.	`FlexBuffersBuilder`
✅	Bulk vector reads — `TypedVector.toIntArray()` / `toLongArray()` / `toDoubleArray()` / `toFloatArray()` hoist `byteWidth` outside the loop and walk the buffer sequentially.	`FlexBuffers.kt` (vendored layer)
✅	16 KB default buffer — `FlexBufferPool.DEFAULT_BUFFER_SIZE` raised from 4 KB to 16 KB. ApiResponse-sized payloads (7.5 KB) no longer trigger resize during encode. (new)	`FlexBufferPool.kt`
✅	CAS builder pool — 16-slot atomic pool of reusable 16 KB builders; avoids handing the same builder to two threads.	`FlexBufferPool.kt`
✅	Key sharing — builder string/key cache reuses key strings across repeated schema writes.	`FlexBuffersBuilder`
✅	copyInto / System.arraycopy — bulk copies use `ByteArray.copyInto` which intrinsifies to `memcpy` on JVM and uses stdlib's optimised path on Native.	`Buffers.kt`
✅	Copy-free internal API — `FlexBuffers.encodeToBuffer` lets direct-coder callers consume a `ReadBuffer` without forcing the final `ByteArray` copy.	`FlexBuffers.kt`

KSP code generation

Status	Optimization	Where
✅	@Struct annotation triggers KSP — emits `FlexCoder<T>` + `@JvmInline value class` Accessor + `Reference.asXxx()` / `ByteArray.asXxx()` extensions for every annotated class.	`FlexCoderProcessor`
✅	@JvmInline value class accessors — getters compile to pointer arithmetic, no data class allocation. Lazy collection wrappers and nested accessors.	All generated `XxxAccessor`
✅	Nullable field support — KSP emits `if (value.x != null) builder.set(...) else builder.putNull(...)`, decode branches on `map.isNullAt(i)`.	`FlexCoderProcessor`, `Map.isNullAt`
✅	String byte-length accessors — KSP-generated accessors include `*ByteLength` properties so previews can measure/filter strings without decoding them.	`FlexCoderProcessor`

Cross-platform & per-platform

Status	Optimization	Where
✅	PerPlatformPool<T> (expect/actual) — JVM/Android `ThreadLocal<T?>`, Native `@Volatile var slot: T?`, JS plain `var slot: T?`. Each runtime uses its cheapest single-slot primitive instead of paying cross-platform CAS overhead (~30 ns saved per acquire on Native). (new)	`PerPlatformPool.kt` + 4 actuals
✅	fastDecodeUtf8 (expect/actual) — JVM/Android/JS delegate to stdlib (JIT intrinsifies); iOS uses ASCII fast path + `NSString.create(bytes:length:encoding:)`. Cut iOS `Map.getString` from 268 ns to 102 ns for a 62-byte URL. (new)	`FastDecode.kt` + platform actuals
✅	fastEncodeUtf8 (expect/actual) — Native uses a direct char→byte ASCII loop without the `.also { cc = it }` closure capture that defeats AOT optimisation in the stock encoder. (new)	`FastDecode.kt` + actuals
✅	fastEncodedLength (expect/actual) — ASCII fast-path UTF-8 byte-count. Eliminated 89% of `Utf8.encodedLength` samples on iOS (from 6.9% CPU to 0.8%). (new)	`FastDecode.kt` + actuals
✅	-opt for Native test binaries — default debug-mode iOS tests were 7-8× slower than release; benchmark numbers were misleading. Fixed via shared KMP target config. (new)	`build.gradle.kts`

Profiling pipeline

Status	Tool	Where
✅	JVM phaseProfile — per-phase async-profiler runner, captures CPU + alloc flamegraphs per (tier × payload). HTML + collapsed-format output. Python aggregator filters JIT compiler-thread noise. (new)	`PhaseProfiler.kt`, `flamechart/analyze.py`
✅	iOS sample-based profiler — long-running release executable spawned via simctl, sampled with macOS `sample`. Full Kotlin/Native symbols. Driver script auto-boots sim, finds PID, captures top-of-stack profile. (new)	`IosBench.kt`, `profile-ios-sim.sh`
✅	MicroBench — cross-platform per-operation timings (`Map.getString`, `vec.readLong`, allocations) so each platform's hot operations can be measured directly. (new)	`MicroBench.kt` (commonTest)
✅	CrossPlatformBenchmark — KMP-portable 4-tier benchmark in `commonTest`. Runs identically on JVM, Android unit, iOS sim, JS Node. Single source of truth for cross-platform numbers. (new)	`CrossPlatformBenchmark.kt`
✅	C++ harness — standalone native bench for wire-size verification and key-vs-index decode comparisons.	`cpp/bench/flexbuffer_bench.cpp`

iOS Profile (Real Sample Data)

Sample of ApiResponse + UserProfile + ChatThread + TimeSeries encode + decode hot loop (5 s, 1 ms sampling, ~5,200 worker-thread samples). The exact numbers come from flamechart/output/ios/ios-sim-sample.txt after the optimization pass:

Function	Samples	% of worker	Notes
`Kotlin_String_get`	880	16.9%	String indexing in tight loops (UTF-8 ASCII scanners). Mostly inherent.
`ArrayReadWriteBuffer.requestCapacity`	587	11.3%	Bounds check on every `set()`. Most are early-return; cost is the call + TLS access.
`fastDecodeUtf8`	620	11.9%	Our ASCII fast-path decode. Inherent for UTF-8 → String.
`ArrayReadWriteBuffer.put(CharSequence)`	514	9.9%	UTF-8 encode (string write).
`tlv_get_addr`	404	7.8%	Native thread-local storage access (singletons, GC state). Partly inherent.
`FixedBlockPage::Sweep`	299	5.7%	GC sweep — allocation pressure.
`CustomAllocator::Allocate`	182	3.5%	Heap allocations.
`__CFFromUTF8`	153	2.9%	Apple's `NSString.create` for non-ASCII strings.
`Utf8.encodedLength`	41	0.8%	Was 360 / 6.9% before `fastEncodedLength` — −89%.

What this tells us:

GC + allocation overhead is ~10% combined. Closing it requires fewer allocations (Cursor value class, Map pool, primitive-array List decode).
tlv_get_addr at 7.8% is Native compiler-inserted state — partly inherent.
UTF-8 work (string get, decode, encode) totals roughly 38% of cycles. Already heavily optimized; further wins likely require slice-style APIs (no String materialization).
Interface dispatch on ReadBuffer still costs — Native cannot devirtualise through interfaces.

Realistic iOS ceiling with all pending items applied: 1.5-3× JVM. Currently 2.4-4.5× JVM.

Partial & Pending Optimizations

⚠ Partial — could go further

Item	What's done	What's left
`Map<String, String>` decode	Slow path materializes a `LinkedHashMap<String, String>`; lazy `FlexStringStringMap` view exists.	Could emit a `FlexStringStringMap` directly when the field type is `Map<String, String>` — saves N×2 allocations per nested map.
`endMap` key-width calc	`calculateKeyVectorBitWidth` loops over every entry calling `elemWidth`.	KSP could emit a hard-coded key-vector bit-width for fixed schemas (all known offsets at compile time).
Default buffer sizing	16 KB initial; most payloads fit.	Tiered pool: small (4 KB) / medium (16 KB) / large (256 KB) slots; acquire by size hint.
Bulk primitive list decode	`LongArray` / `IntArray` / `DoubleArray` fields use `vec.toLongArray()`.	`List<Long>` / `List<Double>` still allocate `ArrayList<Long>` with boxed elements. `LongArray.asList()` regressed in testing (anonymous AbstractList wrapper dispatch was costly). Needs a custom non-boxing `List<Long>` adapter.

⏳ Pending — not yet attempted

Item	Estimated impact	Why deferred
`@JvmInline value class Cursor(packed: Long)`	Eliminates residual `Reference` heap allocations (~5-9% of decode allocations). Biggest gap to C++.	Requires plumbing a `ReadBuffer` reference through scope-local state — value classes can only have one field.
Unsafe / VarHandle direct reads on JVM	Skips bounds checks + interface dispatch on `ReadBuffer` reads.	Restricted-method warnings on recent JVMs; needs gating behind a flag.
CPointer / MemorySegment on Native / JDK 22+	Similar gain on Native — direct memory loads.	Partially used in iOS UTF-8 path; could extend to all reads.
Concrete `ReadWriteBuffer` type in `FlexBuffersBuilder`	Replaces interface field with `ArrayReadWriteBuffer` concrete type → Native AOT can devirtualise the `buffer.set(...)` calls (12% in iOS profile).	Breaks public API.
Schema evolution safe decode	Generated field layout fingerprint table so index decode can verify the map shape before fast-path reads.	Substantial KSP generator change.
Compile-time `endMap` skip	KSP could emit pre-computed key-vector geometry, skipping both sort and width-calc loops entirely for fixed schemas.	Substantial generator change; biggest remaining encode-side win.
Adaptive key/string sharing policy	C++ shows unique-string encode can be 2.0× slower with sharing.	Two-pass detection can cost too much; needs explicit policy hooks.
FlexUtf8Slice (byte-range view)	Compare / hash UTF-8 without materializing `String`. Useful for filter/preview scans.	Callers must avoid holding slices after backing buffer reuse.
JS-specific Long handling	JS `BigInt` for Long is expensive. For values fitting in 53 bits, `Number` is much faster.	Out of scope unless JS becomes a hot platform.
JMH gates in CI	Cleaner before/after evidence; fewer false regressions.	Longer CI runtime; current benchmark-style tests are sufficient for regression smoke.

Detailed Benchmarks (May 2026 JVM ledger)

The JVM realistic ledger remains the broadest comparable signal — 26 realistic workloads run via RealisticBenchmark.summary:

Case	FlexCoder	Serializer	JSON	Flex B	JSON B	Speedup
UserProfile	4 µs	6 µs	4 µs	833	710	0.9×
ApiResponse	14 µs	34 µs	37 µs	7483	8506	2.7×
EventLog	3 µs	4 µs	3 µs	758	618	0.9×
ChatThread	8 µs	17 µs	18 µs	3372	3380	2.4×
ConfigSnapshot	4 µs	9 µs	7 µs	1059	1138	1.8×
TimeSeries	4 µs	14 µs	45 µs	4340	5835	10.8×
NotificationInbox	22 µs	47 µs	44 µs	12377	12526	2.0×
OrderHistory	18 µs	31 µs	36 µs	7813	11028	2.0×
MediaLibrary	21 µs	38 µs	45 µs	13326	13911	2.1×
SearchResults	21 µs	38 µs	44 µs	10943	11132	2.1×
WorkoutSession	22 µs	36 µs	87 µs	13270	15760	3.9×
BankingLedger	35 µs	65 µs	70 µs	22635	22072	2.0×
RideHistory	62 µs	127 µs	329 µs	49784	55882	5.3×
ProjectBoard	55 µs	94 µs	134 µs	43033	40367	2.4×
DocumentCorpus	203 µs	283 µs	410 µs	134808	147101	2.0×
SecurityAudit	51 µs	102 µs	113 µs	24312	26938	2.2×
GraphSnapshot	81 µs	172 µs	172 µs	27816	27606	2.1×
Recommendation	49 µs	105 µs	125 µs	27704	23211	2.6×
GameWorld	82 µs	161 µs	184 µs	35920	33116	2.3×
IoTFleet	92 µs	192 µs	268 µs	58328	52442	2.9×
CRMPortfolio	57 µs	106 µs	135 µs	38024	35798	2.4×
TravelItinerary	14 µs	25 µs	25 µs	6484	6185	1.8×
CourseRoster	92 µs	190 µs	267 µs	62120	59741	2.9×
ShipmentBatch	65 µs	116 µs	124 µs	33296	33917	1.9×
MarketData	125 µs	241 µs	359 µs	93935	66512	2.9×
SocialGraphDelta	92 µs	187 µs	193 µs	42624	44784	2.1×

30-row JSON-vs-Flex adversarial harness

Metric	Result	Interpretation
Full FlexCoder vs full kotlinx JSON	0/30 losses	Generated full encode/decode beats full JSON parse/materialization on every adversarial case.
Best Flex path vs best JSON path	16/30 losses	Includes controlled JSON token scans — intentionally hostile to Flex partial reads.
Flex size vs JSON size	14/30 losses	Binary wins on repeated/numeric structures; JSON stays compact for many string-heavy payloads.

Flex Kotlin vs Flex C++ (10-row ledger)

Case	Kotlin	C++	Ratio	Winner
TinyStatus key decode	0.27 µs	0.07 µs	3.8×	C++
TinyStatus index decode	0.11 µs	0.01 µs	11.0×	C++
TinyStatus partial key	0.19 µs	0.03 µs	6.4×	C++
Sparse missing lookups	1.64 µs	11.64 µs	0.1×	Kotlin
StringTable scan	4.41 µs	7.84 µs	0.6×	Kotlin
TimeSeries index scan	0.76 µs	1.08 µs	0.7×	Kotlin
Wide random key reads	4.27 µs	4.47 µs	1.0×	Kotlin
Wide sequential index	0.52 µs	0.14 µs	3.7×	C++
Unique strings encode (sharing)	79.35 µs	137.41 µs	0.6×	Kotlin
Unique strings encode (no sharing)	60.04 µs	61.24 µs	1.0×	Kotlin

Overall: Kotlin Flex loses 4/10 to C++ Flex; wins 6/10 on the new helper paths (sparse miss / StringTable / TimeSeries / wide random / both unique-strings rows).

Wire Size Tradeoffs

Case	FlexBuffer	JSON	Delta	Interpretation
UserProfile	833 B	710 B	+17.3%	Small mixed/string payload — JSON is compact.
EventLog	758 B	618 B	+22.7%	String and metadata overhead dominates.
ChatThread	3372 B	3380 B	-0.2%	Essentially size-neutral.
ApiResponse	7483 B	8506 B	-12.0%	Nested product payload benefits from binary encoding.
ConfigSnapshot	1059 B	1138 B	-6.9%	Moderate binary win.
TimeSeries	4340 B	5835 B	-25.6%	Numeric vector workload — ideal FlexBuffer use case.
RideHistory	49784 B	55882 B	-10.9%	Route-point numeric arrays offset nested map overhead.
DocumentCorpus	134808 B	147101 B	-8.4%	Large corpus still wins despite text-heavy segments.
Recommendation	27704 B	23211 B	+19.4%	Ranked feed is string/action heavy — JSON is smaller.
MarketData	93935 B	66512 B	+41.2%	Worst added-size case — map-heavy order books expose self-description cost.

Interpretation

Operational rule: Generated FlexCoders are the right default for rich internal payloads. JSON remains a valid choice for tiny string-heavy public payloads and controlled token checks. Fixed binary stays fastest for closed telemetry rows with no schema-evolution requirement.

Where the optimization pays

JVM/Android generated coders are 1.5-10× faster than JSON on nested and numeric-heavy payloads.
TimeSeries shows the clearest combined speed (10.8×) and wire-size (-26%) win on every platform except JS.
Android unit results match JVM — same HotSpot path, no surprises.
iOS sample profile is now reproducible — we know exactly where Kotlin/Native cycles go.
C++ harness confirms index-based reads are the correct design direction.

Where the data is mixed

UserProfile and EventLog are larger than JSON for tiny string-heavy payloads.
iOS gap to JVM is 2.4-4.5× — structural (no JIT, no escape analysis). 1.5-3× is the realistic ceiling.
JS is 7-45× slower than JVM — V8's native JSON is hard to beat for small structs.
Controlled JSON token scans beat best-path Flex on 16/30 adversarial rows.

Claim discipline: Say "generated FlexCoders are faster than full kotlinx JSON round trips for tested Reaktor payloads." Do not say "FlexBuffers are faster than JSON" without qualifiers — the adversarial harness intentionally disproves that broader statement.

Reproduce the Run

# Tests / correctness
./gradlew :reaktor-flexbuffer:jvmTest
./gradlew :reaktor-flexbuffer:iosSimulatorArm64Test
./gradlew :reaktor-flexbuffer:testReleaseUnitTest
./gradlew :reaktor-flexbuffer:jsNodeTest

# Cross-platform benchmark (4 fixtures × 4 tiers on every target)
./gradlew :reaktor-flexbuffer:jvmTest --tests "*.CrossPlatformBenchmark" --rerun
./gradlew :reaktor-flexbuffer:iosSimulatorArm64Test --tests "*.CrossPlatformBenchmark" --rerun

# Per-operation micro-bench (helps localise platform-specific hot spots)
./gradlew :reaktor-flexbuffer:iosSimulatorArm64Test --tests "*.MicroBench" --rerun

# JVM async-profiler: CPU + alloc flamegraphs per tier × payload
./gradlew :reaktor-flexbuffer:phaseProfile
python3 reaktor-flexbuffer/flamechart/analyze.py --top 12 reaktor-flexbuffer/flamechart/output/phase

# iOS sample-based profile (boots sim, spawns bench.kexe, samples, prints top-of-stack)
./gradlew :reaktor-flexbuffer:linkBenchReleaseExecutableIosSimulatorArm64
./reaktor-flexbuffer/flamechart/profile-ios-sim.sh
# For Instruments-grade traces:
xctrace record --template "Time Profiler" --launch-process bench.kexe --output trace.xctrace

# C++ reference harness
cd reaktor-flexbuffer/cpp/bench
clang++ -O2 -std=c++17 -I ../../../.github_modules/flatbuffers/include flexbuffer_bench.cpp -o flexbuffer_bench
./flexbuffer_bench --quick --verify

Roadmap

Phase 1: Schema evolution & registration hygieneP0

Generate field layout fingerprint per @Struct class; fall back to name lookup when shape doesn't match.
Make registry concurrency explicit (seal() after startup; or copy-on-write).
Add nullable / evolved-schema golden tests.

Phase 2: Close the C++ allocation gapP1

@JvmInline value class Cursor(Long) to replace Reference heap allocations on the hot path.
FlexUtf8Slice for byte-range string compares without materialisation.
Concrete ArrayReadWriteBuffer type in FlexBuffersBuilder to let Native AOT devirtualise the buffer.set 12% hot frame.
KSP: compile-time endMap skip for fixed schemas.

Phase 3: Format-choice policyP2

Teach Service/ObjectStore to route by payload: JSON for tiny/string/public; Flex for internal/nested/cache; compact binary for closed telemetry.
Promote accessors to first-class ObjectStore/actor APIs.
JMH gates for JVM; Instruments-grade trace for iOS, before any external performance claim.

Bottom line: Keep FlexBuffers as the Reaktor internal default for rich/nested/cache payloads. The fastest path is not "decode everything faster" — it is "do not decode everything." Accessors, byte-length helpers, and typed folds matter more than micro-tuning full materialization.