Current Status
reaktor-flexbuffer is now more than a serializer swap: it is a generated-code path over an internalized FlexBuffers runtime, with explicit hot paths for primitive fields, typed vectors, map field indexing, builder pooling, per-platform encode/decode pools, per-platform UTF-8 fast paths, and a reproducible sample-based iOS profiling pipeline.
What is working today
- KSP generates
FlexCoder<T> + @JvmInline value class Accessor + asXxx() extensions for every @Struct model.
- Generated encode writes map keys alphabetically and calls
endMap(..., presorted = true).
- Generated decode reads fields by index, not key lookup; zero
Reference allocation for primitives.
FlexDecoderV2 has descriptor field-index caching + direct map reads for serializer fallback.
PerPlatformPool<T> (expect/actual) gives JVM/Android ThreadLocal, Native @Volatile, JS plain var — each runtime's cheapest single-slot primitive.
fastDecodeUtf8 / fastEncodeUtf8 / fastEncodedLength per-platform UTF-8 codecs. iOS Map.getString for a 62-byte URL dropped from 268 ns to 102 ns.
- JVM, JS Node, Android unit, and iOS simulator test suites pass (56+ tests including MicroBench and CrossPlatformBenchmark).
- JVM async-profiler pipeline (
phaseProfile Gradle task) and iOS sample-based profiling pipeline (profile-ios-sim.sh) both operational.
- Native test binaries now compile with
-opt — previously 7-8× slower without it.
What is still bounded
- FlexBuffers are self-describing, so small string-heavy payloads can be larger than JSON (UserProfile: 833 B vs JSON 710 B).
- iOS / Kotlin Native is 2.4-4.5× slower than JVM; the gap is structural (no JIT, no escape analysis, heap-only allocation).
- C++ index decode remains the floor (~0.04 µs for UserProfile vs ~0.45 µs for the JVM accessor read).
- JS is 7-45× slower than JVM — V8's native
JSON.parse is hard to beat for small struct workloads.
- Controlled JSON token scans still beat best-path Flex on 16/30 adversarial rows because those scans avoid full parsing.
Most important correction: previous docs claimed UserProfile Kotlin decode beat C++ full decode. The current C++ harness disproves that: C++ key decode is about 0.33 µs, C++ index decode is about 0.04 µs for UserProfile. Kotlin FlexCoder is fast, but raw C++ index access remains the floor.
Five Access Tiers
The library exposes five tiers from fastest to most convenient:
@Struct @Serializable
data class UserProfile(
val id: Long,
val username: String,
val tags: List<String>,
val address: Address // nested @Struct
)
| # | Tier | API | Allocation profile |
| 1 | Accessor (zero-copy, lazy) | bytes.asUserProfile().username | @JvmInline value class over a Map; no data-class alloc; lazy collection wrappers. |
| 2 | FlexCoder (KSP-generated) | FlexBuffers.decode<UserProfile>(bytes) | Data class + collections; zero Reference per field; index-based map reads. |
| 3 | Accelerated serializer | FlexBuffers.decode(serializer<T>(), bytes) | Registry routes to FlexCoder; drop-in replacement for Json.encodeToString. |
| 4 | Raw kotlinx.serialization | Same after FlexCoderRegistry.clear() | FlexDecoderV2/FlexEncoderV2 with field-index cache, current-map/vector-index direct reads. |
| 5 | JSON baseline | Json.decodeFromString(serializer<T>(), s) | For comparison, debugging, or external interop. |
Implementation Map
| Area | Files | Responsibility |
| KSP processor | reaktor-compiler/.../FlexCoderProcessor.kt | Scans @Struct, emits FlexCoder + Accessor + asXxx() extensions + registration aggregator. |
| Public API | core/FlexBuffers.kt, core/FlexCoder.kt | Encode/decode entry points, registry lookup, fallback to kotlinx.serialization. |
| Decoder fallback | core/FlexDecoderV2.kt | Descriptor-driven serializer decode with field-index cache, currentMapIndex/currentVectorIndex direct reads, and beginStructure fast path. |
| Encoder fallback | core/FlexEncoderV2.kt | kotlinx.serialization AbstractEncoder with bulk primitive collection paths. |
| Per-platform pool | core/PerPlatformPool.kt (+ 4 actuals) | Single-slot pool: JVM/Android ThreadLocal, Native @Volatile, JS plain. |
| Builder/runtime | flatbuffers/FlexBuffersBuilder.kt, flatbuffers/FlexBuffers.kt | Internalized FlatBuffers Kotlin runtime with Reaktor-specific builder optimizations. |
| UTF-8 codec | flatbuffers/FastDecode.kt (+ 4 actuals) | fastDecodeUtf8 / fastEncodeUtf8 / fastEncodedLength expect/actual. |
| Collections | core/FlexCollections.kt | Lazy zero-copy FlexIntList / FlexStringStringMap / etc. |
| Builder pool | core/FlexBufferPool.kt | 16-slot CAS-backed pool of 16 KB pre-grown builders. |
| iOS profiling | iosMain/.../bench/IosBench.kt, flamechart/profile-ios-sim.sh | Long-running release executable + sample-based driver script. |
| JVM profiling | jvmMain/.../bench/PhaseProfiler.kt, flamechart/analyze.py | Per-phase async-profiler runner + hot-frame aggregator. |
| C++ reference | cpp/bench/flexbuffer_bench.cpp | Native harness for wire-size verification and key-vs-index decode comparisons. |
| Benchmarks | src/commonTest/.../*Benchmark*.kt, MicroBench.kt | KMP cross-platform benchmark + per-operation micro-benchmark. |
Runtime Architecture
Generated fast path
T → FlexCoderRegistry → GeneratedFlexCoder.encode(builder, value) → FlexBuffersBuilder → ByteArray
ByteArray → FlexBuffers.getRoot(bytes).asMap → GeneratedFlexCoder.decode(map) → T
- Registry resolves coders by
KClass or kotlinx serial name.
- Generated fields written in alphabetical order at compile time.
- Builder receives presorted maps and skips per-map sorting.
- Generated decode uses stable field indexes — O(1) per field.
Serializer fallback path
T → kotlinx.serialization descriptor → FlexEncoderV2 → FlexBuffersBuilder → ByteArray
ByteArray → FlexDecoderV2 → descriptor element index → serializer callbacks → T
- Keeps third-party and non-
@Struct models working.
- Descriptor field-index cache +
currentMapIndex/currentVectorIndex direct reads avoid Reference allocation.
- Pool acquires decoder/encoder via
PerPlatformPool; per-thread on JVM/Android.
Zero-copy accessor path
ByteArray → FlexBuffer Map → @JvmInline value-class Accessor → typed property reads
Accessors are for read-heavy paths where the caller does not need a full data class. They wrap FlexBuffer maps and expose typed properties that read directly from the byte buffer. Lazy list wrappers (FlexIntList, FlexStringStringMap) avoid materializing collections until the caller reads an element.
Encode + decode µs/op, Apple M-series, min of 3 runs of 5,000 iterations with 500 warmup, per CrossPlatformBenchmark (in commonTest, runs identically on every target).
FlexCoder (KSP-generated, the production hot path)
| Case | JVM | Android | iOS sim | JS Node |
| UserProfile (14 fields, nested) | 3.1 µs | 5.0 µs | 7.3 µs | 21.9 µs |
| ChatThread (15 msgs, nested) | 7.0 µs | 10.1 µs | 24.5 µs | 96.1 µs |
| ApiResponse (20 products, lists) | 14.4 µs | 15.5 µs | 55.0 µs | 167.0 µs |
| TimeSeries (256d + 256L typed) | 4.1 µs | 4.2 µs | 18.4 µs | 163.3 µs |
vs JSON baseline
| Case | JVM | Android | iOS | JS |
| UserProfile | 1.0× | 2.1× | 1.1× | 0.4× |
| ChatThread | 1.5× | 1.4× | 1.7× | 0.5× |
| ApiResponse | 1.6× | 1.5× | 1.5× | 0.6× |
| TimeSeries | 10.6× | 10.6× | 10.0× | 0.7× |
Headlines: FlexBuffer is 1.4-15× faster than JSON on JVM/Android. TimeSeries numeric bulk dominates everywhere except JS. On iOS, FlexBuffer matches JSON for small structs and wins on numeric/nested. V8's native JSON wins on JS for small struct workloads.
Shipped Optimizations
Every optimization originally proposed in the improvement plan, marked with current implementation status. Items annotated (new) were added during the most recent cross-platform performance pass.
Decoder & runtime
| Status | Optimization | Where |
| ✅ | Field index cache — per-class IntArray mapping descriptor index to alphabetical map position. Replaces O(log n) map.get(key) with O(1) array lookup; deterministic from field names alone, computed once per class. | FlexDecoderV2.fieldIndexCache |
| ✅ | currentMapIndex direct reads — decodeElementIndex stores the map position; decodeInt / decodeString / etc. call map.getInt(idx) directly. Zero Reference allocation for primitive fields. (new) | FlexDecoderV2 |
| ✅ | currentVectorIndex direct reads — same pattern for VECTOR contexts. decodeElementIndex records the index; decodeXxx calls vec.readInt(i) / readString(i) / etc. (new) | FlexDecoderV2 |
| ✅ | MAP_ENTRIES value fast path — Map<K,V> value side stores currentMapIndex instead of allocating a Reference. Halves per-entry allocation. (new) | FlexDecoderV2.decodeElementIndex |
| ✅ | beginStructure direct-dispatch fast path — nested CLASS / LIST / MAP from a parent map/vector context call map.getMap(i) / vec.readMap(i) directly, never materialising the intermediate Reference. (new) | FlexDecoderV2.beginStructure |
| ✅ | Lazy decode-context stack init — DecodingContextStack / StructureStack no longer pre-fill 16 entries per call. Grow on demand; removed 22% / 23% of decode/encode allocations seen in flamegraph. (new) | FlexDecoderV2, FlexEncoderV2 |
| ✅ | @JvmField on hot mutable state — bypasses synthetic Kotlin property getters. DecodingContext.getType() / getFieldIndices() were 2-3% each in CPU profile; gone after this change. (new) | DecodingContext, StructureEntry |
| ✅ | Map.keyVector lazy init — keyVectorEnd / keyVectorByteWidth computed on first key-vector access, not in constructor. FlexCoder index reads never touch them, saving 2 buffer reads + 2 field writes per Map construction. (new) | Map in FlexBuffers.kt |
| ✅ | Gated registry & bulk-array dispatch — skip FlexCoderRegistry.getBySerialName hash lookup when registry is empty; gate endsWith("Array") on kind == LIST first. (new) | FlexDecoderV2, FlexEncoderV2 |
| ✅ | Exact key comparison — Map binary search treats key prefixes as distinct; ArrayReadBuffer.findFirst respects slice offsets. Fixes prefix-lookup correctness while preserving fast ASCII key search. | Map, ArrayReadBuffer |
| ✅ | Map.indexOf + direct keyed scalar getters — getInt / getLong / getDouble / getStringByteLength skip Reference allocation. Improves schema fallback and sparse missing lookup (sparse miss 1.64 µs vs C++ 11.64 µs). | Map |
Encoder & builder
| Status | Optimization | Where |
| ✅ | ValueStack as parallel primitive arrays — IntArray types, LongArray iVals, DoubleArray dVals. Zero per-field Value object allocation; cache-friendly contiguous storage. | FlexBuffersBuilder.ValueStack |
| ✅ | Pre-sorted field order — KSP emits builder.set("a", ...), builder.set("b", ...) alphabetically + endMap(... presorted = true). No runtime sort for KSP-generated coders. | FlexCoderProcessor, FlexBuffersBuilder.endMap |
| ✅ | In-place dynamic map sorting — ValueStack.sortByKeys sorts the parallel value-stack arrays in place instead of allocating index and temporary arrays for every unsorted dynamic map. | FlexBuffersBuilder.ValueStack |
| ✅ | Primitive collection writers — setIntCollection / setLongCollection / setDoubleCollection / setFloatCollection write serializer fallback collections directly without intermediate primitive-array conversion. | FlexBuffersBuilder |
| ✅ | Bulk vector reads — TypedVector.toIntArray() / toLongArray() / toDoubleArray() / toFloatArray() hoist byteWidth outside the loop and walk the buffer sequentially. | FlexBuffers.kt (vendored layer) |
| ✅ | 16 KB default buffer — FlexBufferPool.DEFAULT_BUFFER_SIZE raised from 4 KB to 16 KB. ApiResponse-sized payloads (7.5 KB) no longer trigger resize during encode. (new) | FlexBufferPool.kt |
| ✅ | CAS builder pool — 16-slot atomic pool of reusable 16 KB builders; avoids handing the same builder to two threads. | FlexBufferPool.kt |
| ✅ | Key sharing — builder string/key cache reuses key strings across repeated schema writes. | FlexBuffersBuilder |
| ✅ | copyInto / System.arraycopy — bulk copies use ByteArray.copyInto which intrinsifies to memcpy on JVM and uses stdlib's optimised path on Native. | Buffers.kt |
| ✅ | Copy-free internal API — FlexBuffers.encodeToBuffer lets direct-coder callers consume a ReadBuffer without forcing the final ByteArray copy. | FlexBuffers.kt |
KSP code generation
| Status | Optimization | Where |
| ✅ | @Struct annotation triggers KSP — emits FlexCoder<T> + @JvmInline value class Accessor + Reference.asXxx() / ByteArray.asXxx() extensions for every annotated class. | FlexCoderProcessor |
| ✅ | @JvmInline value class accessors — getters compile to pointer arithmetic, no data class allocation. Lazy collection wrappers and nested accessors. | All generated XxxAccessor |
| ✅ | Nullable field support — KSP emits if (value.x != null) builder.set(...) else builder.putNull(...), decode branches on map.isNullAt(i). | FlexCoderProcessor, Map.isNullAt |
| ✅ | String byte-length accessors — KSP-generated accessors include *ByteLength properties so previews can measure/filter strings without decoding them. | FlexCoderProcessor |
Cross-platform & per-platform
| Status | Optimization | Where |
| ✅ | PerPlatformPool<T> (expect/actual) — JVM/Android ThreadLocal<T?>, Native @Volatile var slot: T?, JS plain var slot: T?. Each runtime uses its cheapest single-slot primitive instead of paying cross-platform CAS overhead (~30 ns saved per acquire on Native). (new) | PerPlatformPool.kt + 4 actuals |
| ✅ | fastDecodeUtf8 (expect/actual) — JVM/Android/JS delegate to stdlib (JIT intrinsifies); iOS uses ASCII fast path + NSString.create(bytes:length:encoding:). Cut iOS Map.getString from 268 ns to 102 ns for a 62-byte URL. (new) | FastDecode.kt + platform actuals |
| ✅ | fastEncodeUtf8 (expect/actual) — Native uses a direct char→byte ASCII loop without the .also { cc = it } closure capture that defeats AOT optimisation in the stock encoder. (new) | FastDecode.kt + actuals |
| ✅ | fastEncodedLength (expect/actual) — ASCII fast-path UTF-8 byte-count. Eliminated 89% of Utf8.encodedLength samples on iOS (from 6.9% CPU to 0.8%). (new) | FastDecode.kt + actuals |
| ✅ | -opt for Native test binaries — default debug-mode iOS tests were 7-8× slower than release; benchmark numbers were misleading. Fixed via shared KMP target config. (new) | build.gradle.kts |
Profiling pipeline
| Status | Tool | Where |
| ✅ | JVM phaseProfile — per-phase async-profiler runner, captures CPU + alloc flamegraphs per (tier × payload). HTML + collapsed-format output. Python aggregator filters JIT compiler-thread noise. (new) | PhaseProfiler.kt, flamechart/analyze.py |
| ✅ | iOS sample-based profiler — long-running release executable spawned via simctl, sampled with macOS sample. Full Kotlin/Native symbols. Driver script auto-boots sim, finds PID, captures top-of-stack profile. (new) | IosBench.kt, profile-ios-sim.sh |
| ✅ | MicroBench — cross-platform per-operation timings (Map.getString, vec.readLong, allocations) so each platform's hot operations can be measured directly. (new) | MicroBench.kt (commonTest) |
| ✅ | CrossPlatformBenchmark — KMP-portable 4-tier benchmark in commonTest. Runs identically on JVM, Android unit, iOS sim, JS Node. Single source of truth for cross-platform numbers. (new) | CrossPlatformBenchmark.kt |
| ✅ | C++ harness — standalone native bench for wire-size verification and key-vs-index decode comparisons. | cpp/bench/flexbuffer_bench.cpp |
iOS Profile (Real Sample Data)
Sample of ApiResponse + UserProfile + ChatThread + TimeSeries encode + decode hot loop (5 s, 1 ms sampling, ~5,200 worker-thread samples). The exact numbers come from flamechart/output/ios/ios-sim-sample.txt after the optimization pass:
| Function | Samples | % of worker | Notes |
Kotlin_String_get | 880 | 16.9% | String indexing in tight loops (UTF-8 ASCII scanners). Mostly inherent. |
ArrayReadWriteBuffer.requestCapacity | 587 | 11.3% | Bounds check on every set(). Most are early-return; cost is the call + TLS access. |
fastDecodeUtf8 | 620 | 11.9% | Our ASCII fast-path decode. Inherent for UTF-8 → String. |
ArrayReadWriteBuffer.put(CharSequence) | 514 | 9.9% | UTF-8 encode (string write). |
tlv_get_addr | 404 | 7.8% | Native thread-local storage access (singletons, GC state). Partly inherent. |
FixedBlockPage::Sweep | 299 | 5.7% | GC sweep — allocation pressure. |
CustomAllocator::Allocate | 182 | 3.5% | Heap allocations. |
__CFFromUTF8 | 153 | 2.9% | Apple's NSString.create for non-ASCII strings. |
Utf8.encodedLength | 41 | 0.8% | Was 360 / 6.9% before fastEncodedLength — −89%. |
What this tells us:
- GC + allocation overhead is ~10% combined. Closing it requires fewer allocations (Cursor value class, Map pool, primitive-array List decode).
tlv_get_addr at 7.8% is Native compiler-inserted state — partly inherent.
- UTF-8 work (string get, decode, encode) totals roughly 38% of cycles. Already heavily optimized; further wins likely require slice-style APIs (no String materialization).
- Interface dispatch on
ReadBuffer still costs — Native cannot devirtualise through interfaces.
Realistic iOS ceiling with all pending items applied: 1.5-3× JVM. Currently 2.4-4.5× JVM.
Partial & Pending Optimizations
⚠ Partial — could go further
| Item | What's done | What's left |
Map<String, String> decode | Slow path materializes a LinkedHashMap<String, String>; lazy FlexStringStringMap view exists. | Could emit a FlexStringStringMap directly when the field type is Map<String, String> — saves N×2 allocations per nested map. |
endMap key-width calc | calculateKeyVectorBitWidth loops over every entry calling elemWidth. | KSP could emit a hard-coded key-vector bit-width for fixed schemas (all known offsets at compile time). |
| Default buffer sizing | 16 KB initial; most payloads fit. | Tiered pool: small (4 KB) / medium (16 KB) / large (256 KB) slots; acquire by size hint. |
| Bulk primitive list decode | LongArray / IntArray / DoubleArray fields use vec.toLongArray(). | List<Long> / List<Double> still allocate ArrayList<Long> with boxed elements. LongArray.asList() regressed in testing (anonymous AbstractList wrapper dispatch was costly). Needs a custom non-boxing List<Long> adapter. |
⏳ Pending — not yet attempted
| Item | Estimated impact | Why deferred |
@JvmInline value class Cursor(packed: Long) | Eliminates residual Reference heap allocations (~5-9% of decode allocations). Biggest gap to C++. | Requires plumbing a ReadBuffer reference through scope-local state — value classes can only have one field. |
| Unsafe / VarHandle direct reads on JVM | Skips bounds checks + interface dispatch on ReadBuffer reads. | Restricted-method warnings on recent JVMs; needs gating behind a flag. |
| CPointer / MemorySegment on Native / JDK 22+ | Similar gain on Native — direct memory loads. | Partially used in iOS UTF-8 path; could extend to all reads. |
Concrete ReadWriteBuffer type in FlexBuffersBuilder | Replaces interface field with ArrayReadWriteBuffer concrete type → Native AOT can devirtualise the buffer.set(...) calls (12% in iOS profile). | Breaks public API. |
| Schema evolution safe decode | Generated field layout fingerprint table so index decode can verify the map shape before fast-path reads. | Substantial KSP generator change. |
Compile-time endMap skip | KSP could emit pre-computed key-vector geometry, skipping both sort and width-calc loops entirely for fixed schemas. | Substantial generator change; biggest remaining encode-side win. |
| Adaptive key/string sharing policy | C++ shows unique-string encode can be 2.0× slower with sharing. | Two-pass detection can cost too much; needs explicit policy hooks. |
| FlexUtf8Slice (byte-range view) | Compare / hash UTF-8 without materializing String. Useful for filter/preview scans. | Callers must avoid holding slices after backing buffer reuse. |
| JS-specific Long handling | JS BigInt for Long is expensive. For values fitting in 53 bits, Number is much faster. | Out of scope unless JS becomes a hot platform. |
| JMH gates in CI | Cleaner before/after evidence; fewer false regressions. | Longer CI runtime; current benchmark-style tests are sufficient for regression smoke. |
Detailed Benchmarks (May 2026 JVM ledger)
The JVM realistic ledger remains the broadest comparable signal — 26 realistic workloads run via RealisticBenchmark.summary:
| Case | FlexCoder | Serializer | JSON | Flex B | JSON B | Speedup |
| UserProfile | 4 µs | 6 µs | 4 µs | 833 | 710 | 0.9× |
| ApiResponse | 14 µs | 34 µs | 37 µs | 7483 | 8506 | 2.7× |
| EventLog | 3 µs | 4 µs | 3 µs | 758 | 618 | 0.9× |
| ChatThread | 8 µs | 17 µs | 18 µs | 3372 | 3380 | 2.4× |
| ConfigSnapshot | 4 µs | 9 µs | 7 µs | 1059 | 1138 | 1.8× |
| TimeSeries | 4 µs | 14 µs | 45 µs | 4340 | 5835 | 10.8× |
| NotificationInbox | 22 µs | 47 µs | 44 µs | 12377 | 12526 | 2.0× |
| OrderHistory | 18 µs | 31 µs | 36 µs | 7813 | 11028 | 2.0× |
| MediaLibrary | 21 µs | 38 µs | 45 µs | 13326 | 13911 | 2.1× |
| SearchResults | 21 µs | 38 µs | 44 µs | 10943 | 11132 | 2.1× |
| WorkoutSession | 22 µs | 36 µs | 87 µs | 13270 | 15760 | 3.9× |
| BankingLedger | 35 µs | 65 µs | 70 µs | 22635 | 22072 | 2.0× |
| RideHistory | 62 µs | 127 µs | 329 µs | 49784 | 55882 | 5.3× |
| ProjectBoard | 55 µs | 94 µs | 134 µs | 43033 | 40367 | 2.4× |
| DocumentCorpus | 203 µs | 283 µs | 410 µs | 134808 | 147101 | 2.0× |
| SecurityAudit | 51 µs | 102 µs | 113 µs | 24312 | 26938 | 2.2× |
| GraphSnapshot | 81 µs | 172 µs | 172 µs | 27816 | 27606 | 2.1× |
| Recommendation | 49 µs | 105 µs | 125 µs | 27704 | 23211 | 2.6× |
| GameWorld | 82 µs | 161 µs | 184 µs | 35920 | 33116 | 2.3× |
| IoTFleet | 92 µs | 192 µs | 268 µs | 58328 | 52442 | 2.9× |
| CRMPortfolio | 57 µs | 106 µs | 135 µs | 38024 | 35798 | 2.4× |
| TravelItinerary | 14 µs | 25 µs | 25 µs | 6484 | 6185 | 1.8× |
| CourseRoster | 92 µs | 190 µs | 267 µs | 62120 | 59741 | 2.9× |
| ShipmentBatch | 65 µs | 116 µs | 124 µs | 33296 | 33917 | 1.9× |
| MarketData | 125 µs | 241 µs | 359 µs | 93935 | 66512 | 2.9× |
| SocialGraphDelta | 92 µs | 187 µs | 193 µs | 42624 | 44784 | 2.1× |
30-row JSON-vs-Flex adversarial harness
| Metric | Result | Interpretation |
| Full FlexCoder vs full kotlinx JSON | 0/30 losses | Generated full encode/decode beats full JSON parse/materialization on every adversarial case. |
| Best Flex path vs best JSON path | 16/30 losses | Includes controlled JSON token scans — intentionally hostile to Flex partial reads. |
| Flex size vs JSON size | 14/30 losses | Binary wins on repeated/numeric structures; JSON stays compact for many string-heavy payloads. |
Flex Kotlin vs Flex C++ (10-row ledger)
| Case | Kotlin | C++ | Ratio | Winner |
| TinyStatus key decode | 0.27 µs | 0.07 µs | 3.8× | C++ |
| TinyStatus index decode | 0.11 µs | 0.01 µs | 11.0× | C++ |
| TinyStatus partial key | 0.19 µs | 0.03 µs | 6.4× | C++ |
| Sparse missing lookups | 1.64 µs | 11.64 µs | 0.1× | Kotlin |
| StringTable scan | 4.41 µs | 7.84 µs | 0.6× | Kotlin |
| TimeSeries index scan | 0.76 µs | 1.08 µs | 0.7× | Kotlin |
| Wide random key reads | 4.27 µs | 4.47 µs | 1.0× | Kotlin |
| Wide sequential index | 0.52 µs | 0.14 µs | 3.7× | C++ |
| Unique strings encode (sharing) | 79.35 µs | 137.41 µs | 0.6× | Kotlin |
| Unique strings encode (no sharing) | 60.04 µs | 61.24 µs | 1.0× | Kotlin |
Overall: Kotlin Flex loses 4/10 to C++ Flex; wins 6/10 on the new helper paths (sparse miss / StringTable / TimeSeries / wide random / both unique-strings rows).
Wire Size Tradeoffs
| Case | FlexBuffer | JSON | Delta | Interpretation |
| UserProfile | 833 B | 710 B | +17.3% | Small mixed/string payload — JSON is compact. |
| EventLog | 758 B | 618 B | +22.7% | String and metadata overhead dominates. |
| ChatThread | 3372 B | 3380 B | -0.2% | Essentially size-neutral. |
| ApiResponse | 7483 B | 8506 B | -12.0% | Nested product payload benefits from binary encoding. |
| ConfigSnapshot | 1059 B | 1138 B | -6.9% | Moderate binary win. |
| TimeSeries | 4340 B | 5835 B | -25.6% | Numeric vector workload — ideal FlexBuffer use case. |
| RideHistory | 49784 B | 55882 B | -10.9% | Route-point numeric arrays offset nested map overhead. |
| DocumentCorpus | 134808 B | 147101 B | -8.4% | Large corpus still wins despite text-heavy segments. |
| Recommendation | 27704 B | 23211 B | +19.4% | Ranked feed is string/action heavy — JSON is smaller. |
| MarketData | 93935 B | 66512 B | +41.2% | Worst added-size case — map-heavy order books expose self-description cost. |
Interpretation
Operational rule: Generated FlexCoders are the right default for rich internal payloads. JSON remains a valid choice for tiny string-heavy public payloads and controlled token checks. Fixed binary stays fastest for closed telemetry rows with no schema-evolution requirement.
Where the optimization pays
- JVM/Android generated coders are 1.5-10× faster than JSON on nested and numeric-heavy payloads.
- TimeSeries shows the clearest combined speed (10.8×) and wire-size (-26%) win on every platform except JS.
- Android unit results match JVM — same HotSpot path, no surprises.
- iOS sample profile is now reproducible — we know exactly where Kotlin/Native cycles go.
- C++ harness confirms index-based reads are the correct design direction.
Where the data is mixed
- UserProfile and EventLog are larger than JSON for tiny string-heavy payloads.
- iOS gap to JVM is 2.4-4.5× — structural (no JIT, no escape analysis). 1.5-3× is the realistic ceiling.
- JS is 7-45× slower than JVM — V8's native JSON is hard to beat for small structs.
- Controlled JSON token scans beat best-path Flex on 16/30 adversarial rows.
Claim discipline: Say "generated FlexCoders are faster than full kotlinx JSON round trips for tested Reaktor payloads." Do not say "FlexBuffers are faster than JSON" without qualifiers — the adversarial harness intentionally disproves that broader statement.
Reproduce the Run
# Tests / correctness
./gradlew :reaktor-flexbuffer:jvmTest
./gradlew :reaktor-flexbuffer:iosSimulatorArm64Test
./gradlew :reaktor-flexbuffer:testReleaseUnitTest
./gradlew :reaktor-flexbuffer:jsNodeTest
# Cross-platform benchmark (4 fixtures × 4 tiers on every target)
./gradlew :reaktor-flexbuffer:jvmTest --tests "*.CrossPlatformBenchmark" --rerun
./gradlew :reaktor-flexbuffer:iosSimulatorArm64Test --tests "*.CrossPlatformBenchmark" --rerun
# Per-operation micro-bench (helps localise platform-specific hot spots)
./gradlew :reaktor-flexbuffer:iosSimulatorArm64Test --tests "*.MicroBench" --rerun
# JVM async-profiler: CPU + alloc flamegraphs per tier × payload
./gradlew :reaktor-flexbuffer:phaseProfile
python3 reaktor-flexbuffer/flamechart/analyze.py --top 12 reaktor-flexbuffer/flamechart/output/phase
# iOS sample-based profile (boots sim, spawns bench.kexe, samples, prints top-of-stack)
./gradlew :reaktor-flexbuffer:linkBenchReleaseExecutableIosSimulatorArm64
./reaktor-flexbuffer/flamechart/profile-ios-sim.sh
# For Instruments-grade traces:
xctrace record --template "Time Profiler" --launch-process bench.kexe --output trace.xctrace
# C++ reference harness
cd reaktor-flexbuffer/cpp/bench
clang++ -O2 -std=c++17 -I ../../../.github_modules/flatbuffers/include flexbuffer_bench.cpp -o flexbuffer_bench
./flexbuffer_bench --quick --verify
Roadmap
Phase 1: Schema evolution & registration hygieneP0
- Generate field layout fingerprint per
@Struct class; fall back to name lookup when shape doesn't match.
- Make registry concurrency explicit (
seal() after startup; or copy-on-write).
- Add nullable / evolved-schema golden tests.
Phase 2: Close the C++ allocation gapP1
@JvmInline value class Cursor(Long) to replace Reference heap allocations on the hot path.
FlexUtf8Slice for byte-range string compares without materialisation.
- Concrete
ArrayReadWriteBuffer type in FlexBuffersBuilder to let Native AOT devirtualise the buffer.set 12% hot frame.
- KSP: compile-time
endMap skip for fixed schemas.
Phase 3: Format-choice policyP2
- Teach Service/ObjectStore to route by payload: JSON for tiny/string/public; Flex for internal/nested/cache; compact binary for closed telemetry.
- Promote accessors to first-class ObjectStore/actor APIs.
- JMH gates for JVM; Instruments-grade trace for iOS, before any external performance claim.
Bottom line: Keep FlexBuffers as the Reaktor internal default for rich/nested/cache payloads. The fastest path is not "decode everything faster" — it is "do not decode everything." Accessors, byte-length helpers, and typed folds matter more than micro-tuning full materialization.