Reaktor DocsArchitecturePerformanceUpdated May 2026

Performance engineering · Architecture

FlexBuffer Binary Serialization

Reaktor's optimized KMP FlexBuffer stack: generated FlexCoders, zero-copy accessors, internalized FlatBuffers runtime, per-platform pools, per-platform UTF-8 fast paths, iOS-specific profiling, builder pooling, and the cross-platform benchmark results that drive the optimization roadmap.

4 platforms benchmarkedJVM, Android (host JVM), iOS simulator arm64, JS Node
0/30 full JSON lossesGenerated FlexCoder beat full kotlinx JSON round trip in every adversarial case
10-15× on numeric bulkTimeSeries dominates JSON on every JVM/Android/iOS path
iOS sample profile availableReproducible end-to-end Kotlin/Native flamegraph pipeline

Current Status

reaktor-flexbuffer is now more than a serializer swap: it is a generated-code path over an internalized FlexBuffers runtime, with explicit hot paths for primitive fields, typed vectors, map field indexing, builder pooling, per-platform encode/decode pools, per-platform UTF-8 fast paths, and a reproducible sample-based iOS profiling pipeline.

What is working today

  • KSP generates FlexCoder<T> + @JvmInline value class Accessor + asXxx() extensions for every @Struct model.
  • Generated encode writes map keys alphabetically and calls endMap(..., presorted = true).
  • Generated decode reads fields by index, not key lookup; zero Reference allocation for primitives.
  • FlexDecoderV2 has descriptor field-index caching + direct map reads for serializer fallback.
  • PerPlatformPool<T> (expect/actual) gives JVM/Android ThreadLocal, Native @Volatile, JS plain var — each runtime's cheapest single-slot primitive.
  • fastDecodeUtf8 / fastEncodeUtf8 / fastEncodedLength per-platform UTF-8 codecs. iOS Map.getString for a 62-byte URL dropped from 268 ns to 102 ns.
  • JVM, JS Node, Android unit, and iOS simulator test suites pass (56+ tests including MicroBench and CrossPlatformBenchmark).
  • JVM async-profiler pipeline (phaseProfile Gradle task) and iOS sample-based profiling pipeline (profile-ios-sim.sh) both operational.
  • Native test binaries now compile with -opt — previously 7-8× slower without it.

What is still bounded

  • FlexBuffers are self-describing, so small string-heavy payloads can be larger than JSON (UserProfile: 833 B vs JSON 710 B).
  • iOS / Kotlin Native is 2.4-4.5× slower than JVM; the gap is structural (no JIT, no escape analysis, heap-only allocation).
  • C++ index decode remains the floor (~0.04 µs for UserProfile vs ~0.45 µs for the JVM accessor read).
  • JS is 7-45× slower than JVM — V8's native JSON.parse is hard to beat for small struct workloads.
  • Controlled JSON token scans still beat best-path Flex on 16/30 adversarial rows because those scans avoid full parsing.
Most important correction: previous docs claimed UserProfile Kotlin decode beat C++ full decode. The current C++ harness disproves that: C++ key decode is about 0.33 µs, C++ index decode is about 0.04 µs for UserProfile. Kotlin FlexCoder is fast, but raw C++ index access remains the floor.

Five Access Tiers

The library exposes five tiers from fastest to most convenient:

@Struct @Serializable
data class UserProfile(
    val id: Long,
    val username: String,
    val tags: List<String>,
    val address: Address  // nested @Struct
)
#TierAPIAllocation profile
1Accessor (zero-copy, lazy)bytes.asUserProfile().username@JvmInline value class over a Map; no data-class alloc; lazy collection wrappers.
2FlexCoder (KSP-generated)FlexBuffers.decode<UserProfile>(bytes)Data class + collections; zero Reference per field; index-based map reads.
3Accelerated serializerFlexBuffers.decode(serializer<T>(), bytes)Registry routes to FlexCoder; drop-in replacement for Json.encodeToString.
4Raw kotlinx.serializationSame after FlexCoderRegistry.clear()FlexDecoderV2/FlexEncoderV2 with field-index cache, current-map/vector-index direct reads.
5JSON baselineJson.decodeFromString(serializer<T>(), s)For comparison, debugging, or external interop.

Implementation Map

AreaFilesResponsibility
KSP processorreaktor-compiler/.../FlexCoderProcessor.ktScans @Struct, emits FlexCoder + Accessor + asXxx() extensions + registration aggregator.
Public APIcore/FlexBuffers.kt, core/FlexCoder.ktEncode/decode entry points, registry lookup, fallback to kotlinx.serialization.
Decoder fallbackcore/FlexDecoderV2.ktDescriptor-driven serializer decode with field-index cache, currentMapIndex/currentVectorIndex direct reads, and beginStructure fast path.
Encoder fallbackcore/FlexEncoderV2.ktkotlinx.serialization AbstractEncoder with bulk primitive collection paths.
Per-platform poolcore/PerPlatformPool.kt (+ 4 actuals)Single-slot pool: JVM/Android ThreadLocal, Native @Volatile, JS plain.
Builder/runtimeflatbuffers/FlexBuffersBuilder.kt, flatbuffers/FlexBuffers.ktInternalized FlatBuffers Kotlin runtime with Reaktor-specific builder optimizations.
UTF-8 codecflatbuffers/FastDecode.kt (+ 4 actuals)fastDecodeUtf8 / fastEncodeUtf8 / fastEncodedLength expect/actual.
Collectionscore/FlexCollections.ktLazy zero-copy FlexIntList / FlexStringStringMap / etc.
Builder poolcore/FlexBufferPool.kt16-slot CAS-backed pool of 16 KB pre-grown builders.
iOS profilingiosMain/.../bench/IosBench.kt, flamechart/profile-ios-sim.shLong-running release executable + sample-based driver script.
JVM profilingjvmMain/.../bench/PhaseProfiler.kt, flamechart/analyze.pyPer-phase async-profiler runner + hot-frame aggregator.
C++ referencecpp/bench/flexbuffer_bench.cppNative harness for wire-size verification and key-vs-index decode comparisons.
Benchmarkssrc/commonTest/.../*Benchmark*.kt, MicroBench.ktKMP cross-platform benchmark + per-operation micro-benchmark.

Runtime Architecture

Generated fast path

T → FlexCoderRegistry → GeneratedFlexCoder.encode(builder, value) → FlexBuffersBuilder → ByteArray
ByteArray → FlexBuffers.getRoot(bytes).asMap → GeneratedFlexCoder.decode(map) → T
  • Registry resolves coders by KClass or kotlinx serial name.
  • Generated fields written in alphabetical order at compile time.
  • Builder receives presorted maps and skips per-map sorting.
  • Generated decode uses stable field indexes — O(1) per field.

Serializer fallback path

T → kotlinx.serialization descriptor → FlexEncoderV2 → FlexBuffersBuilder → ByteArray
ByteArray → FlexDecoderV2 → descriptor element index → serializer callbacks → T
  • Keeps third-party and non-@Struct models working.
  • Descriptor field-index cache + currentMapIndex/currentVectorIndex direct reads avoid Reference allocation.
  • Pool acquires decoder/encoder via PerPlatformPool; per-thread on JVM/Android.

Zero-copy accessor path

ByteArray → FlexBuffer Map → @JvmInline value-class Accessor → typed property reads

Accessors are for read-heavy paths where the caller does not need a full data class. They wrap FlexBuffer maps and expose typed properties that read directly from the byte buffer. Lazy list wrappers (FlexIntList, FlexStringStringMap) avoid materializing collections until the caller reads an element.

Cross-Platform Results

Encode + decode µs/op, Apple M-series, min of 3 runs of 5,000 iterations with 500 warmup, per CrossPlatformBenchmark (in commonTest, runs identically on every target).

FlexCoder (KSP-generated, the production hot path)

CaseJVMAndroidiOS simJS Node
UserProfile (14 fields, nested)3.1 µs5.0 µs7.3 µs21.9 µs
ChatThread (15 msgs, nested)7.0 µs10.1 µs24.5 µs96.1 µs
ApiResponse (20 products, lists)14.4 µs15.5 µs55.0 µs167.0 µs
TimeSeries (256d + 256L typed)4.1 µs4.2 µs18.4 µs163.3 µs

vs JSON baseline

CaseJVMAndroidiOSJS
UserProfile1.0×2.1×1.1×0.4×
ChatThread1.5×1.4×1.7×0.5×
ApiResponse1.6×1.5×1.5×0.6×
TimeSeries10.6×10.6×10.0×0.7×
Headlines: FlexBuffer is 1.4-15× faster than JSON on JVM/Android. TimeSeries numeric bulk dominates everywhere except JS. On iOS, FlexBuffer matches JSON for small structs and wins on numeric/nested. V8's native JSON wins on JS for small struct workloads.

Shipped Optimizations

Every optimization originally proposed in the improvement plan, marked with current implementation status. Items annotated (new) were added during the most recent cross-platform performance pass.

Decoder & runtime

StatusOptimizationWhere
Field index cache — per-class IntArray mapping descriptor index to alphabetical map position. Replaces O(log n) map.get(key) with O(1) array lookup; deterministic from field names alone, computed once per class.FlexDecoderV2.fieldIndexCache
currentMapIndex direct readsdecodeElementIndex stores the map position; decodeInt / decodeString / etc. call map.getInt(idx) directly. Zero Reference allocation for primitive fields. (new)FlexDecoderV2
currentVectorIndex direct reads — same pattern for VECTOR contexts. decodeElementIndex records the index; decodeXxx calls vec.readInt(i) / readString(i) / etc. (new)FlexDecoderV2
MAP_ENTRIES value fast pathMap<K,V> value side stores currentMapIndex instead of allocating a Reference. Halves per-entry allocation. (new)FlexDecoderV2.decodeElementIndex
beginStructure direct-dispatch fast path — nested CLASS / LIST / MAP from a parent map/vector context call map.getMap(i) / vec.readMap(i) directly, never materialising the intermediate Reference. (new)FlexDecoderV2.beginStructure
Lazy decode-context stack initDecodingContextStack / StructureStack no longer pre-fill 16 entries per call. Grow on demand; removed 22% / 23% of decode/encode allocations seen in flamegraph. (new)FlexDecoderV2, FlexEncoderV2
@JvmField on hot mutable state — bypasses synthetic Kotlin property getters. DecodingContext.getType() / getFieldIndices() were 2-3% each in CPU profile; gone after this change. (new)DecodingContext, StructureEntry
Map.keyVector lazy initkeyVectorEnd / keyVectorByteWidth computed on first key-vector access, not in constructor. FlexCoder index reads never touch them, saving 2 buffer reads + 2 field writes per Map construction. (new)Map in FlexBuffers.kt
Gated registry & bulk-array dispatch — skip FlexCoderRegistry.getBySerialName hash lookup when registry is empty; gate endsWith("Array") on kind == LIST first. (new)FlexDecoderV2, FlexEncoderV2
Exact key comparison — Map binary search treats key prefixes as distinct; ArrayReadBuffer.findFirst respects slice offsets. Fixes prefix-lookup correctness while preserving fast ASCII key search.Map, ArrayReadBuffer
Map.indexOf + direct keyed scalar gettersgetInt / getLong / getDouble / getStringByteLength skip Reference allocation. Improves schema fallback and sparse missing lookup (sparse miss 1.64 µs vs C++ 11.64 µs).Map

Encoder & builder

StatusOptimizationWhere
ValueStack as parallel primitive arraysIntArray types, LongArray iVals, DoubleArray dVals. Zero per-field Value object allocation; cache-friendly contiguous storage.FlexBuffersBuilder.ValueStack
Pre-sorted field order — KSP emits builder.set("a", ...), builder.set("b", ...) alphabetically + endMap(... presorted = true). No runtime sort for KSP-generated coders.FlexCoderProcessor, FlexBuffersBuilder.endMap
In-place dynamic map sortingValueStack.sortByKeys sorts the parallel value-stack arrays in place instead of allocating index and temporary arrays for every unsorted dynamic map.FlexBuffersBuilder.ValueStack
Primitive collection writerssetIntCollection / setLongCollection / setDoubleCollection / setFloatCollection write serializer fallback collections directly without intermediate primitive-array conversion.FlexBuffersBuilder
Bulk vector readsTypedVector.toIntArray() / toLongArray() / toDoubleArray() / toFloatArray() hoist byteWidth outside the loop and walk the buffer sequentially.FlexBuffers.kt (vendored layer)
16 KB default bufferFlexBufferPool.DEFAULT_BUFFER_SIZE raised from 4 KB to 16 KB. ApiResponse-sized payloads (7.5 KB) no longer trigger resize during encode. (new)FlexBufferPool.kt
CAS builder pool — 16-slot atomic pool of reusable 16 KB builders; avoids handing the same builder to two threads.FlexBufferPool.kt
Key sharing — builder string/key cache reuses key strings across repeated schema writes.FlexBuffersBuilder
copyInto / System.arraycopy — bulk copies use ByteArray.copyInto which intrinsifies to memcpy on JVM and uses stdlib's optimised path on Native.Buffers.kt
Copy-free internal APIFlexBuffers.encodeToBuffer lets direct-coder callers consume a ReadBuffer without forcing the final ByteArray copy.FlexBuffers.kt

KSP code generation

StatusOptimizationWhere
@Struct annotation triggers KSP — emits FlexCoder<T> + @JvmInline value class Accessor + Reference.asXxx() / ByteArray.asXxx() extensions for every annotated class.FlexCoderProcessor
@JvmInline value class accessors — getters compile to pointer arithmetic, no data class allocation. Lazy collection wrappers and nested accessors.All generated XxxAccessor
Nullable field support — KSP emits if (value.x != null) builder.set(...) else builder.putNull(...), decode branches on map.isNullAt(i).FlexCoderProcessor, Map.isNullAt
String byte-length accessors — KSP-generated accessors include *ByteLength properties so previews can measure/filter strings without decoding them.FlexCoderProcessor

Cross-platform & per-platform

StatusOptimizationWhere
PerPlatformPool<T> (expect/actual) — JVM/Android ThreadLocal<T?>, Native @Volatile var slot: T?, JS plain var slot: T?. Each runtime uses its cheapest single-slot primitive instead of paying cross-platform CAS overhead (~30 ns saved per acquire on Native). (new)PerPlatformPool.kt + 4 actuals
fastDecodeUtf8 (expect/actual) — JVM/Android/JS delegate to stdlib (JIT intrinsifies); iOS uses ASCII fast path + NSString.create(bytes:length:encoding:). Cut iOS Map.getString from 268 ns to 102 ns for a 62-byte URL. (new)FastDecode.kt + platform actuals
fastEncodeUtf8 (expect/actual) — Native uses a direct char→byte ASCII loop without the .also { cc = it } closure capture that defeats AOT optimisation in the stock encoder. (new)FastDecode.kt + actuals
fastEncodedLength (expect/actual) — ASCII fast-path UTF-8 byte-count. Eliminated 89% of Utf8.encodedLength samples on iOS (from 6.9% CPU to 0.8%). (new)FastDecode.kt + actuals
-opt for Native test binaries — default debug-mode iOS tests were 7-8× slower than release; benchmark numbers were misleading. Fixed via shared KMP target config. (new)build.gradle.kts

Profiling pipeline

StatusToolWhere
JVM phaseProfile — per-phase async-profiler runner, captures CPU + alloc flamegraphs per (tier × payload). HTML + collapsed-format output. Python aggregator filters JIT compiler-thread noise. (new)PhaseProfiler.kt, flamechart/analyze.py
iOS sample-based profiler — long-running release executable spawned via simctl, sampled with macOS sample. Full Kotlin/Native symbols. Driver script auto-boots sim, finds PID, captures top-of-stack profile. (new)IosBench.kt, profile-ios-sim.sh
MicroBench — cross-platform per-operation timings (Map.getString, vec.readLong, allocations) so each platform's hot operations can be measured directly. (new)MicroBench.kt (commonTest)
CrossPlatformBenchmark — KMP-portable 4-tier benchmark in commonTest. Runs identically on JVM, Android unit, iOS sim, JS Node. Single source of truth for cross-platform numbers. (new)CrossPlatformBenchmark.kt
C++ harness — standalone native bench for wire-size verification and key-vs-index decode comparisons.cpp/bench/flexbuffer_bench.cpp

iOS Profile (Real Sample Data)

Sample of ApiResponse + UserProfile + ChatThread + TimeSeries encode + decode hot loop (5 s, 1 ms sampling, ~5,200 worker-thread samples). The exact numbers come from flamechart/output/ios/ios-sim-sample.txt after the optimization pass:

FunctionSamples% of workerNotes
Kotlin_String_get88016.9%String indexing in tight loops (UTF-8 ASCII scanners). Mostly inherent.
ArrayReadWriteBuffer.requestCapacity58711.3%Bounds check on every set(). Most are early-return; cost is the call + TLS access.
fastDecodeUtf862011.9%Our ASCII fast-path decode. Inherent for UTF-8 → String.
ArrayReadWriteBuffer.put(CharSequence)5149.9%UTF-8 encode (string write).
tlv_get_addr4047.8%Native thread-local storage access (singletons, GC state). Partly inherent.
FixedBlockPage::Sweep2995.7%GC sweep — allocation pressure.
CustomAllocator::Allocate1823.5%Heap allocations.
__CFFromUTF81532.9%Apple's NSString.create for non-ASCII strings.
Utf8.encodedLength410.8%Was 360 / 6.9% before fastEncodedLength−89%.
What this tells us:
  • GC + allocation overhead is ~10% combined. Closing it requires fewer allocations (Cursor value class, Map pool, primitive-array List decode).
  • tlv_get_addr at 7.8% is Native compiler-inserted state — partly inherent.
  • UTF-8 work (string get, decode, encode) totals roughly 38% of cycles. Already heavily optimized; further wins likely require slice-style APIs (no String materialization).
  • Interface dispatch on ReadBuffer still costs — Native cannot devirtualise through interfaces.
Realistic iOS ceiling with all pending items applied: 1.5-3× JVM. Currently 2.4-4.5× JVM.

Partial & Pending Optimizations

⚠ Partial — could go further

ItemWhat's doneWhat's left
Map<String, String> decodeSlow path materializes a LinkedHashMap<String, String>; lazy FlexStringStringMap view exists.Could emit a FlexStringStringMap directly when the field type is Map<String, String> — saves N×2 allocations per nested map.
endMap key-width calccalculateKeyVectorBitWidth loops over every entry calling elemWidth.KSP could emit a hard-coded key-vector bit-width for fixed schemas (all known offsets at compile time).
Default buffer sizing16 KB initial; most payloads fit.Tiered pool: small (4 KB) / medium (16 KB) / large (256 KB) slots; acquire by size hint.
Bulk primitive list decodeLongArray / IntArray / DoubleArray fields use vec.toLongArray().List<Long> / List<Double> still allocate ArrayList<Long> with boxed elements. LongArray.asList() regressed in testing (anonymous AbstractList wrapper dispatch was costly). Needs a custom non-boxing List<Long> adapter.

⏳ Pending — not yet attempted

ItemEstimated impactWhy deferred
@JvmInline value class Cursor(packed: Long)Eliminates residual Reference heap allocations (~5-9% of decode allocations). Biggest gap to C++.Requires plumbing a ReadBuffer reference through scope-local state — value classes can only have one field.
Unsafe / VarHandle direct reads on JVMSkips bounds checks + interface dispatch on ReadBuffer reads.Restricted-method warnings on recent JVMs; needs gating behind a flag.
CPointer / MemorySegment on Native / JDK 22+Similar gain on Native — direct memory loads.Partially used in iOS UTF-8 path; could extend to all reads.
Concrete ReadWriteBuffer type in FlexBuffersBuilderReplaces interface field with ArrayReadWriteBuffer concrete type → Native AOT can devirtualise the buffer.set(...) calls (12% in iOS profile).Breaks public API.
Schema evolution safe decodeGenerated field layout fingerprint table so index decode can verify the map shape before fast-path reads.Substantial KSP generator change.
Compile-time endMap skipKSP could emit pre-computed key-vector geometry, skipping both sort and width-calc loops entirely for fixed schemas.Substantial generator change; biggest remaining encode-side win.
Adaptive key/string sharing policyC++ shows unique-string encode can be 2.0× slower with sharing.Two-pass detection can cost too much; needs explicit policy hooks.
FlexUtf8Slice (byte-range view)Compare / hash UTF-8 without materializing String. Useful for filter/preview scans.Callers must avoid holding slices after backing buffer reuse.
JS-specific Long handlingJS BigInt for Long is expensive. For values fitting in 53 bits, Number is much faster.Out of scope unless JS becomes a hot platform.
JMH gates in CICleaner before/after evidence; fewer false regressions.Longer CI runtime; current benchmark-style tests are sufficient for regression smoke.

Detailed Benchmarks (May 2026 JVM ledger)

The JVM realistic ledger remains the broadest comparable signal — 26 realistic workloads run via RealisticBenchmark.summary:

CaseFlexCoderSerializerJSONFlex BJSON BSpeedup
UserProfile4 µs6 µs4 µs8337100.9×
ApiResponse14 µs34 µs37 µs748385062.7×
EventLog3 µs4 µs3 µs7586180.9×
ChatThread8 µs17 µs18 µs337233802.4×
ConfigSnapshot4 µs9 µs7 µs105911381.8×
TimeSeries4 µs14 µs45 µs4340583510.8×
NotificationInbox22 µs47 µs44 µs12377125262.0×
OrderHistory18 µs31 µs36 µs7813110282.0×
MediaLibrary21 µs38 µs45 µs13326139112.1×
SearchResults21 µs38 µs44 µs10943111322.1×
WorkoutSession22 µs36 µs87 µs13270157603.9×
BankingLedger35 µs65 µs70 µs22635220722.0×
RideHistory62 µs127 µs329 µs49784558825.3×
ProjectBoard55 µs94 µs134 µs43033403672.4×
DocumentCorpus203 µs283 µs410 µs1348081471012.0×
SecurityAudit51 µs102 µs113 µs24312269382.2×
GraphSnapshot81 µs172 µs172 µs27816276062.1×
Recommendation49 µs105 µs125 µs27704232112.6×
GameWorld82 µs161 µs184 µs35920331162.3×
IoTFleet92 µs192 µs268 µs58328524422.9×
CRMPortfolio57 µs106 µs135 µs38024357982.4×
TravelItinerary14 µs25 µs25 µs648461851.8×
CourseRoster92 µs190 µs267 µs62120597412.9×
ShipmentBatch65 µs116 µs124 µs33296339171.9×
MarketData125 µs241 µs359 µs93935665122.9×
SocialGraphDelta92 µs187 µs193 µs42624447842.1×

30-row JSON-vs-Flex adversarial harness

MetricResultInterpretation
Full FlexCoder vs full kotlinx JSON0/30 lossesGenerated full encode/decode beats full JSON parse/materialization on every adversarial case.
Best Flex path vs best JSON path16/30 lossesIncludes controlled JSON token scans — intentionally hostile to Flex partial reads.
Flex size vs JSON size14/30 lossesBinary wins on repeated/numeric structures; JSON stays compact for many string-heavy payloads.

Flex Kotlin vs Flex C++ (10-row ledger)

CaseKotlinC++RatioWinner
TinyStatus key decode0.27 µs0.07 µs3.8×C++
TinyStatus index decode0.11 µs0.01 µs11.0×C++
TinyStatus partial key0.19 µs0.03 µs6.4×C++
Sparse missing lookups1.64 µs11.64 µs0.1×Kotlin
StringTable scan4.41 µs7.84 µs0.6×Kotlin
TimeSeries index scan0.76 µs1.08 µs0.7×Kotlin
Wide random key reads4.27 µs4.47 µs1.0×Kotlin
Wide sequential index0.52 µs0.14 µs3.7×C++
Unique strings encode (sharing)79.35 µs137.41 µs0.6×Kotlin
Unique strings encode (no sharing)60.04 µs61.24 µs1.0×Kotlin

Overall: Kotlin Flex loses 4/10 to C++ Flex; wins 6/10 on the new helper paths (sparse miss / StringTable / TimeSeries / wide random / both unique-strings rows).

Wire Size Tradeoffs

CaseFlexBufferJSONDeltaInterpretation
UserProfile833 B710 B+17.3%Small mixed/string payload — JSON is compact.
EventLog758 B618 B+22.7%String and metadata overhead dominates.
ChatThread3372 B3380 B-0.2%Essentially size-neutral.
ApiResponse7483 B8506 B-12.0%Nested product payload benefits from binary encoding.
ConfigSnapshot1059 B1138 B-6.9%Moderate binary win.
TimeSeries4340 B5835 B-25.6%Numeric vector workload — ideal FlexBuffer use case.
RideHistory49784 B55882 B-10.9%Route-point numeric arrays offset nested map overhead.
DocumentCorpus134808 B147101 B-8.4%Large corpus still wins despite text-heavy segments.
Recommendation27704 B23211 B+19.4%Ranked feed is string/action heavy — JSON is smaller.
MarketData93935 B66512 B+41.2%Worst added-size case — map-heavy order books expose self-description cost.

Interpretation

Operational rule: Generated FlexCoders are the right default for rich internal payloads. JSON remains a valid choice for tiny string-heavy public payloads and controlled token checks. Fixed binary stays fastest for closed telemetry rows with no schema-evolution requirement.

Where the optimization pays

  • JVM/Android generated coders are 1.5-10× faster than JSON on nested and numeric-heavy payloads.
  • TimeSeries shows the clearest combined speed (10.8×) and wire-size (-26%) win on every platform except JS.
  • Android unit results match JVM — same HotSpot path, no surprises.
  • iOS sample profile is now reproducible — we know exactly where Kotlin/Native cycles go.
  • C++ harness confirms index-based reads are the correct design direction.

Where the data is mixed

  • UserProfile and EventLog are larger than JSON for tiny string-heavy payloads.
  • iOS gap to JVM is 2.4-4.5× — structural (no JIT, no escape analysis). 1.5-3× is the realistic ceiling.
  • JS is 7-45× slower than JVM — V8's native JSON is hard to beat for small structs.
  • Controlled JSON token scans beat best-path Flex on 16/30 adversarial rows.
Claim discipline: Say "generated FlexCoders are faster than full kotlinx JSON round trips for tested Reaktor payloads." Do not say "FlexBuffers are faster than JSON" without qualifiers — the adversarial harness intentionally disproves that broader statement.

Reproduce the Run

# Tests / correctness
./gradlew :reaktor-flexbuffer:jvmTest
./gradlew :reaktor-flexbuffer:iosSimulatorArm64Test
./gradlew :reaktor-flexbuffer:testReleaseUnitTest
./gradlew :reaktor-flexbuffer:jsNodeTest

# Cross-platform benchmark (4 fixtures × 4 tiers on every target)
./gradlew :reaktor-flexbuffer:jvmTest --tests "*.CrossPlatformBenchmark" --rerun
./gradlew :reaktor-flexbuffer:iosSimulatorArm64Test --tests "*.CrossPlatformBenchmark" --rerun

# Per-operation micro-bench (helps localise platform-specific hot spots)
./gradlew :reaktor-flexbuffer:iosSimulatorArm64Test --tests "*.MicroBench" --rerun

# JVM async-profiler: CPU + alloc flamegraphs per tier × payload
./gradlew :reaktor-flexbuffer:phaseProfile
python3 reaktor-flexbuffer/flamechart/analyze.py --top 12 reaktor-flexbuffer/flamechart/output/phase

# iOS sample-based profile (boots sim, spawns bench.kexe, samples, prints top-of-stack)
./gradlew :reaktor-flexbuffer:linkBenchReleaseExecutableIosSimulatorArm64
./reaktor-flexbuffer/flamechart/profile-ios-sim.sh
# For Instruments-grade traces:
xctrace record --template "Time Profiler" --launch-process bench.kexe --output trace.xctrace

# C++ reference harness
cd reaktor-flexbuffer/cpp/bench
clang++ -O2 -std=c++17 -I ../../../.github_modules/flatbuffers/include flexbuffer_bench.cpp -o flexbuffer_bench
./flexbuffer_bench --quick --verify

Roadmap

Phase 1: Schema evolution & registration hygieneP0
  • Generate field layout fingerprint per @Struct class; fall back to name lookup when shape doesn't match.
  • Make registry concurrency explicit (seal() after startup; or copy-on-write).
  • Add nullable / evolved-schema golden tests.
Phase 2: Close the C++ allocation gapP1
  • @JvmInline value class Cursor(Long) to replace Reference heap allocations on the hot path.
  • FlexUtf8Slice for byte-range string compares without materialisation.
  • Concrete ArrayReadWriteBuffer type in FlexBuffersBuilder to let Native AOT devirtualise the buffer.set 12% hot frame.
  • KSP: compile-time endMap skip for fixed schemas.
Phase 3: Format-choice policyP2
  • Teach Service/ObjectStore to route by payload: JSON for tiny/string/public; Flex for internal/nested/cache; compact binary for closed telemetry.
  • Promote accessors to first-class ObjectStore/actor APIs.
  • JMH gates for JVM; Instruments-grade trace for iOS, before any external performance claim.
Bottom line: Keep FlexBuffers as the Reaktor internal default for rich/nested/cache payloads. The fastest path is not "decode everything faster" — it is "do not decode everything." Accessors, byte-length helpers, and typed folds matter more than micro-tuning full materialization.