Skip to main content

Benchmarks: sqlite-wasm-run vs native sqlite3

Status (2026-06-22): living benchmark record. Both the pre-perf-push and post-perf-push snapshots are kept here as the canonical reference. Re-run via make bench after any perf-relevant change to refresh.

First numbers on the table. Until now the project's perf claims ("you can ship a 50 MB wasm cli that runs anywhere") were unverified this doc establishes the actual cost of running SQLite through a wasm component vs the native binary.

The harness is tooling/bench.py. Run it yourself with:

make bench # all workloads, default sizes
python3 tooling/bench.py --sizes 100000 # just the big ones
python3 tooling/bench.py --workloads read # one workload

Methodology

Each measurement is the median of N trials (default 3). Each trial uses a fresh on-disk db in a fresh tempdir, then runs the workload's SQL through subprocess.run with time.perf_counter around the call. The whole-process wall-clock time is what's reported cli startup is INCLUDED, deliberately, so the small sizes show the constant overhead the user pays.

nativesqlite3 3.43.2 (system)
wasmsqlite-wasm-run (wasmtime) + sqlite_cli.component.wasm (libsqlite3-sys 0.38, SQLite 3.53.2)
repeats3 (median)
dbfile-backed in a fresh tempdir per trial
journaldelete unless workload is *-wal

Caveat: native is 3.43.2; wasm is 3.53.2. Planner differences exist but are small at this workload shape.

Results

Snapshot from one run (Apple Silicon, macOS, June 2026).

Two rows per workload: .wasm parsed + compiled on every invocation (the default before this session), and .cwasm precompiled once via make precompile-cli and loaded via Component::deserialize_file. Run make bench for the first, make bench CWASM=1 for the second.

Original snapshot (pre-perf-push, kept for context)

WorkloadSizenative.wasmratio.cwasmratio
insert1,0007 ms628 ms79.5x53 ms7.5x
insert10,00019 ms486 ms23.2x100 ms5.2x
insert100,000138 ms1.32 s9.0x563 ms4.1x
insert-wal1,00012 ms473 ms39.4x~ same~ same
insert-wal100,000150 ms1.16 s7.8x~ same~ same
read1,00017 ms707 ms35.5x109 ms6.5x
read10,000108 ms1.39 s11.2x671 ms6.2x
read100,0001.08 s6.42 s5.5x6.32 s5.8x
agg1,0009 ms434 ms45.0x51 ms5.5x
agg10,00023 ms458 ms20.3x104 ms4.5x
agg100,000159 ms946 ms6.0x614 ms3.9x
join1,00010 ms442 ms47.3x76 ms7.9x
join10,00021 ms468 ms22.8x115 ms5.6x
join100,000129 ms899 ms7.0x569 ms4.4x

Post-perf-push snapshot (current; .cwasm only)

After the six perf rounds shipped in commits 21c941d (pragma defaults + sqlite flags), 14a0baf (pread + LTO + memory_reservation), 026f350 (256 MB page cache), 1325026 (SIMD + vec0 kernels), d62ef61 (fuel-disabled engine for the cli), and 6773484 (WIT vtab fetch_batch):

WorkloadSizenativewasmratiodrop vs pre-push
insert1,0007 ms41 ms6.2x7.5x → 6.2x
insert10,00018 ms72 ms3.9x5.2x → 3.9x
insert100,000135 ms396 ms2.9x4.1x → 2.9x
read1,00015 ms86 ms5.6x6.5x → 5.6x
read10,000105 ms517 ms4.9x6.2x → 4.9x
read100,0001.06 s4.89 s4.6x5.8x → 4.6x
agg1,0008 ms38 ms5.0x5.5x → 5.0x
agg10,00021 ms74 ms3.6x4.5x → 3.6x
agg100,000153 ms415 ms2.7x3.9x → 2.7x
join1,0009 ms51 ms5.7x7.9x → 5.7x
join10,00019 ms79 ms4.1x5.6x → 4.1x
join100,000125 ms377 ms3.0x4.4x → 3.0x

Batched vtab scan (new column)

The WIT vtab fetch_batch path (shipped 6773484) collapses N WIT crossings to N/batch_size crossings. For series (the canonical proof port) at 100k rows:

PathTime
per-row WIT (xColumn + xRowid + xNext + xEof per row × N rows)367 ms
fetch_batch (one crossing per 64 rows)49 ms

7.3x speedup on the WIT-loaded vtab scan; embed-path vtabs were already this fast.

Phase A of PLAN-perf-rollout.md rolled this across 8 more read-only vtabs (listargs, vec_each, completion, text-utils, time-series, trie, pmtiles, vec0); the proof bench (prefixes over a 10000-char input) showed 238 → 210 ms (~12%) where the workload's group_concat aggregation dominates the total time.

Best ratios this run hit

  • agg 100k at 2.7x native — closest the catalog has gotten
  • insert 100k at 2.9x
  • join 100k at 3.0x
  • read 100k at 4.6x — bottleneck is now sqlite-compiled-to- wasm vs sqlite-native plus the wasi shim cost per page; the memvfs experiment (commit 4f83e82) confirmed those calls are already absorbed by the OS file cache on macOS, so further read-side wins need either lower-level wasmtime changes or a workload that pushes past the OS cache.

Precompilation the big startup win

The .wasm .cwasm swap is the headline finding of this session. Wasmtime's Engine::precompile_component AOT-compiles the component to a host-CPU-specific blob; loading via Component::deserialize_file then skips parse + validate + cranelift compile.

Bare SQL throughput:

$ for i in 1..5; do time sqlite-wasm-run cli.wasm <<<'SELECT 1;'; done
~370 ms wall-clock per invocation
$ for i in 1..5; do time sqlite-wasm-run cli.cwasm <<<'SELECT 1;'; done
~10 ms wall-clock per invocation

~37x reduction in startup overhead. From make bench above:

  • 1k-row workloads (startup-dominated): from 28-80x 3-8x
  • 100k-row workloads (steady-state): from 5.5-9x 3.9-5.8x

Cost: the .cwasm blob is ~5x larger (12 MB vs 2.5 MB for the cli) because it embeds native machine code. Not portable across CPU architectures or wasmtime versions must be regenerated on each machine after upgrades. make precompile-cli is the one- liner; depends on the .wasm and the host binary, so it re-runs automatically.

What this tells us

Steady-state overhead is ~4-6x with precompilation, 5-9x without. At 100k rows the ratio converges to a small single- digit multiple. That's the actual cost of the wasm component model + wasmtime instantiation + wasi shims for file I/O. The 1k-row numbers are startup-dominated every .wasm workload paid ~370 ms before the first row was inserted. Precompiling to .cwasm cuts that to ~10 ms.

Read is the closest to native. At 100k rows read lands at 5.5x the B-tree traversal is sqlite3 code that wasm doesn't slow down meaningfully; only the wasi shim cost on each sqlite3_step() boundary shows up.

Insert is around 9x. Every INSERT crosses the wasi boundary for the page-write fsync path. Larger transactions amortize this (insert-wal at 100k is 7.8x; insert without WAL is 9.0x).

WAL costs nothing extra at single-writer scale. The 1k-row insert vs insert-wal numbers are noisy at this size (39x vs 79x, but with 8-12 ms native variance the wasm numbers swing around too). At 10k+ rows the difference is in single-digit ms range on the native side and well within noise on the wasm side. WAL exists for concurrent readers; this single-writer harness can't exercise that.

The constant overhead is real and significant. ~400 ms per invocation is paid up front. That's wasmtime fast-path instantiation, component-model module wiring, and the wasi preopen / argv setup. For an interactive cli session this is paid once (the cli stays running); for batch scripts that invoke sqlite-wasm-run per statement, this dominates anything smaller than a few thousand rows.

WIT extension boundary cost measured

The ext-scalar workload loads the sha3 extension and runs SELECT sum(length(sha3_256(name))) FROM t every row crosses the canonical ABI. builtin-scalar is the same shape but with length(name), a sqlite builtin compiled into the wasm cli with no inter-component crossing. The delta is the WIT cost.

Measured on .cwasm at three sizes (median of 3 trials):

Workload1k10k100k
builtin-scalar54 ms99 ms583 ms
ext-scalar150 ms222 ms955 ms

Marginal per-row cost (subtract the smaller size from the larger):

  • Builtin scalar (length): 5.4 µs/row wasm overhead vs native
  • WIT scalar (sha3_256): 8.1 µs/row wasm + WIT + sha3 work
  • Delta = ~2.7 µs per WIT boundary crossing

That number is sha3-with-empty-payload work plus the canonical ABI cost. The sha3 portion is fast (tens of ns); most of the 2.7 µs is the WIT crossing: serialize args cross-store call deserialize result. For comparison, a native sqlite scalar dispatch is ~100 ns the WIT crossing is ~25-30x slower per call than native.

What that means for real workloads:

  • 100k rows ~270 ms of pure WIT overhead. Not catastrophic at this scale, but it dominates for tight scalar loops.
  • The cost is FIXED PER CALL payload size barely matters.
  • The embed path eliminates this for any extension compiled in at build time: a 100k-row SELECT sha3_256(name) drops from 951 ms (WIT) to 679 ms (embedded) exactly the 272 ms predicted. See PLAN-embed-extensions.md for the user-facing tool (sqlite-wasm-run compose --embed NAME[,NAME...]).

Page-size tuning measured

Two outcomes for two workload shapes (measured on .cwasm):

Bulk insert in a single BEGIN/COMMIT (100k rows):

nativewasm
page_size=4096 (default)138 ms572 ms
page_size=16384 + 200MB cache135 ms548 ms

Effectively a wash. SQLite already batches writes within a transaction so per-page wasi calls aren't the bottleneck.

Auto-commit per row (500 rows, no BEGIN/COMMIT):

nativewasm
page_size=4096 (default)1.94 s18.6 s
page_size=163841.23 s18.7 s

Native got 37% faster. Wasm got nothing.

The diagnosis is in the asymmetry. Native fsync per commit is cheap (macOS is essentially a no-op for HFS+/APFS), so the per-byte cost matters and bigger pages cut total bytes written. Wasm goes through wasmtime wasi host fsync per commit, and the per-call overhead dominates everything else cutting bytes per call doesn't help. The lesson: for our wasm runtime, page_size doesn't move the needle. Tell users to batch their inserts in a transaction; the page_size lever is a native-side trick.

(The cli-smokes/page_size smoke still ships, to prove the pragma reaches wasivfs end-to-end.)

What this still doesn't measure

  • Cold startup every trial here gets a freshly-instantiated wasm. Wasmtime caches compiled modules to disk; repeated runs in the same shell are faster than these numbers suggest.
  • vec0 sqlite-vec parity the catalog claim that vec0 is competitive with sqlite-vec needs a side-by-side KNN bench. Punt for now; needs the extension running through the wasm cli AND a native sqlite3 with sqlite-vec loaded, which is more setup than this scaffolding covers.
  • Concurrent readers under WAL single-process single- threaded WASI can't exercise this. Would need a multi- process driver that spawns several cli readers against the same on-disk db.

Wins this run unlocked

FindingImplication
Precompile drops startup 370 ms 10 ms (37x).The biggest single improvement. Small-workload ratios drop from 50-80x to 5-8x. README-quotable.
Steady-state overhead with .cwasm: 4-6xQuotable. "Wasm cli runs SQLite at ~4-6x cost of native for non-trivial workloads, single-digit multiple."
insert-wal matches insertThe WAL unlock (libsqlite3-sys 0.30 0.38) doesn't add overhead for single-writer workloads it's purely a capability addition.
100k rows in 563 ms (.cwasm)Real production-size workloads complete in half a second; the wasm cli is usable, not a toy.
read lands at 5.5-6xThe read path is the closest to native, because most of the time is in sqlite3's B-tree walk (compiled to wasm just like everything else) the WIT call boundary doesn't dominate.
WIT boundary = ~2.7 µs/call (~25-30x native).This is the cost the composed-cli design would eliminate for embedded extensions. At 100k-row tight scalar loops, it's ~270 ms of pure boundary overhead.
page_size tuning is a native trick.Bigger pages save 37% on native auto-commit workloads but nothing in wasm per-wasi-call overhead dominates byte volume. Tell users to batch in transactions, not tune page_size.

Follow-up items

  • Add a wasm column for wasmtime serve-style persistent process (one instantiation, many invocations) to isolate startup amortization.
  • Add an extension-overhead workload (e.g. SELECT sha3(name) FROM t vs a native shathree.so load) to measure the WIT boundary cost.
  • Compare cold vs warm wasmtime cache to show what the component-cache buys.
  • Run on Linux x86_64 too. Apple Silicon results are reproducible here but might not translate.