stream-json and the side quests

11 May, 2026 · 11 min read

Contents

How a slow first cut, a surprising benchmark, and a tour through V8’s regex engine landed five small libraries on npm that still earn their keep a decade later.

stream-json and the side quests

Pre-history: parsers

It just so happened that during my professional career I wrote a few parsers. At some point I realized that usually it is possible to write a streaming parser that reads a text stream and produces a stream of tokens without reading the whole text file in memory. For some reason I didn’t see any generic streaming parsers available.

Another point was that it makes sense to use a regular expression engine as a lexer. The rationale is simple: at that time I worked mostly with interpreted languages (Python, JavaScript), and interpreted languages are relatively slow, while regular expression engines are usually written in C/C++ and highly optimized. Regular expressions are not some ad hoc code — they have a solid theory behind them. While Donald Knuth famously wrote: “I define UNIX as 30 definitions of regular expressions living under one roof” in his Digital Typography, ch. 33, p. 649 (1999), Stephen Kleene defined them, Ken Thompson implemented them, and Alfred Aho devised efficient algorithms.

So I was confident in regular expression engines, but less so in interpreted languages.

Of course there is this:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Attributed to Jamie Zawinski (see history ).

Working with interpreted languages, I had learned to rely on built-in libraries and compiled code for speed.

So I wrote parser-toolkit — a library that took a grammar in some form, generated a regex-based lexer and a parser from it, and processed streaming text with values of unbounded size (e.g. strings). Obviously only certain grammar types were possible.

Work

During my work I spent a lot of effort processing huge files for my clients. These files frequently exceeded the memory of the computers available back then, so for XML files I used a SAX parser . But over time I was getting more and more JSON files: database dumps, records accumulated over time, data generated programmatically from some original raw data, and so on.

JSON.parse() was useless because it loaded everything into memory. npm didn’t help much either: the few packages that were supposed to solve the problem typically had bugs and were slow. For example, one package looked right, yet didn’t respect backpressure (important for streams) and actually read the file into memory, stuffing buffers without checking whether they were consumed.

First cut

So I remembered my previous experience and wrote stream-json. The first release shipped in August 2013, and unfortunately it was barely usable: performance just wasn’t there. If your file has 1,000,000 records and processing each record takes 1ms (not a lot), you will wait 1,000s — about 17m, not counting any useful work on each record. 100,000,000 records (pretty normal) — 1d 4h. Now imagine that your code has a bug and you learn about it a day in, and have to restart. Then it happens repeatedly.

Obviously we need to make per-record processing as fast as possible. The actual processing is business logic, which is pretty much out of our hands. And it is frequently asynchronous, so stream backpressure must be observed. But everything else — reading files, lexing, parsing, and generic preprocessing such as converting tokens into sub-objects — is on us.

It turned out my code was slow. Time was lost in interpreting the grammar and in regular expressions. I could do something about the former, but the latter looked impossible to fix. Or so I thought.

Regular expressions

There is a way

Then at some point I found an article where the author was talking exactly about this — about how slow regular expressions are in JS. He even implemented a JSON parser (!) in pure JS (!) that, he claimed, was faster than a parser built on regular expressions. He even published the code: he literally looped over characters in JS. I couldn’t believe it!

So I wrote a non-streaming JSON parser using my principle (do everything we reasonably can in regular expressions) and … yes, his code was faster! So I wrote yet another parser using his principles — different code, but similarly fast. By mid-2015, stream-json’s parser had been rewritten the same way, character by character, with no regex anywhere.

What’s going on? How can it be?

Regular expression engines

I dug into V8’s code and looked at the regular expression implementation. It turned out V8 was using JSCRE, a PCRE -derived library. Looking at PCRE I learned that it didn’t use much of the smarts afforded by the theory of regular expressions — it was written with mostly naive algorithms. Big shock. (Russ Cox has the canonical write-up: “Regular Expression Matching Can Be Simple And Fast” .)

Obviously there should be a better way, and so there was: I found Google’s RE2 library, written by Russ Cox . It was written more or less by the book, with all the right incantations like DFA and NFA . And it solved another problem that bothered me: ReDoS , which strikes when your program runs in the wild and the input is not sanitized in any way.

So I started a new project in late 2014: node-re2 AKA re2 on npm. The idea was to take Google’s RE2 and wrap it in an API compatible with RegExp, so I could use both RegExp and RE2 in stream-json. That way users would have a choice — no one was forced to take a binary add-on, but if they wanted the performance boost they could opt in.

Regular expressions revisited

I had to implement a lot of RegExp functionality that wasn’t provided by the underlying library. By the time I finished fleshing out RE2 I re-ran my tests — and native RegExp was now faster than both RE2 and the pure-JS parser. What had changed? V8 had switched regex engines: it dropped JSCRE (an interpreter) and replaced it with Irregexp, a JIT-based engine. By June 2018, stream-json’s parser was rewritten once more to lean on sticky regex (/.../y) — the third major incarnation of the same code.

Yet the new engine inherited a problem from the PCRE family: it is still susceptible to ReDoS. So I still support RE2.

The cost of going native

Solving the regex problem with a C++ binding solved one problem and made another. Every npm install re2 shells out to node-gyp, finds (or fails to find) a C++ toolchain, Python, the right Node headers, and tries to compile from source. On many machines that “just works”; on plenty of others — CI containers without make, Windows boxes without the Build Tools, Alpine Docker images with musl, small servers that can run a compiled binary but lack the CPU, memory, or disk to build one — it doesn’t. Users get a wall of compile errors before the package has ever run.

The fix is unglamorous: ship pre-built binaries for the common platforms, fetch the right one in a postinstall script, and fall through to source compile only when none of them match. I cleaned up that pattern as a separate library so other native-addon authors could reuse it: install-artifact-from-github — a zero-dependency helper that checks the cache, fetches an artifact from your GitHub releases, unpacks it, and gets out of the way. Same principle as stream-chain — small tool, narrow scope, long shelf life.

I inspected the available solutions and found them lacking — too big, too inflexible. And when you fetch code from the web, you have to worry about security. A lot. I, as a person, cannot provide reliable, safe hosting of binary artifacts, nor run a build farm to produce new versions, nor cover the matrix of OS, CPU architecture, libc, and Node ABI that a binary add-on spans. So I decided to reuse GitHub: we trust it with the source we compile — we can trust it with the binaries we don’t. A build farm? GitHub Actions on different runners. No matching runner? GitHub Actions supports Docker images. Attestations? GitHub has those too — see node-re2’s attestations . And, obviously, I added compression options to save users network bytes and download time ⇒ faster install.

The biggest surprise was where the hard part wasn’t. I had expected selecting and unpacking the right binary on each user’s machine — matching OS, CPU architecture, libc, and Node ABI — to be a swamp. With GitHub carrying the trust, the hosting, and the build farm, the client side reduced to a tiny lookup over process.platform, process.arch, and the Node ABI, followed by an unpack. The whole project is basically one small JS file — easy to write and debug, easy to read and to audit for security.

Parsing JSON is hard

I already mentioned the numerous problems with available JSON streaming solutions: they were slow, they violated streaming rules (e.g. for “performance reasons” they simulated stream APIs without supporting the correct semantics), and so on. What I didn’t expect was actual problems with parsing itself.

For example, some projects assumed that a string had a certain maximum size, or that it could always fit in memory — some of my strings ran to gigabytes; they were huge CSV files stored in database columns. Most assumed that a number was short, e.g. 28 characters tops, which was not always the case — some programs generated long numbers padded with leading zeros, or values like "0.0(...a lot of zeros...)0123". I know it is effectively zero, yet it broke restrictive parsers.

The JSON spec lists no size restrictions. Moral of the story: individual values (strings and numbers) should be streamed and assembled into actual values down the line. It is not a parser’s job to interpret values — users should have a say in that if they want to.

CSV is hard

Yes, CSV. What can go wrong with such a simple format? All values are strings; they are comma-separated (yes, they can be tab-separated or whatever — doesn’t change anything). Just read a line, split it on the delimiter, and there you have it. Yep, that’s what most of the available packages did.

Not really. What if a string has a delimiter inside? OK, it should be quoted. What if it has a newline in it? There goes “just read a line and split it”. CSV is not that simple and, just like JSON, there is a standard that defines how to handle all those problems: RFC 4180 . That is exactly what stream-csv-as-json implements, reusing stream-json’s token machinery so the output looks the same to downstream code regardless of whether the source was JSON or CSV.

Streaming data

It turned out that streaming data is a whole suite of helpers:

Frequently data is organized as a huge array of items. The array is too big to read whole, yet individual items fit in memory.
- Similar schema with a dictionary.
Another riff on “huge array of items”: JSONL .
Tools to edit a stream of tokens — remove/ignore items we don’t want, keep only the ones we need.
Use different parsers to produce tokens. For example, CSV.
Serialize tokens back to JSON.
Use objects as a stream of tokens without parsing text.
Provide a controlled way to assemble objects from tokens.
Many more.

All non-JSON-specific things were split off into stream-chain, which is now the foundation of stream-json. It builds pipelines out of streaming components and functions, and can optimize pipelines to make them more efficient. stream-json itself concentrates on JSON and token streams.

Summary

Out of this adventure I got:

stream-chain — streaming helpers for creating pipelines out of streams and functions, covering all basic situations.
stream-json — a toolkit to work with JSON files in a streaming way. Based on stream-chain.
stream-csv-as-json — an adapter for CSV that reuses stream-json for data processing and stream-chain as a foundation.
re2 — a drop-in replacement for RegExp resilient against ReDoS. It doesn’t use JIT, but if you have to process raw files from the internet — this is the tool to use.
install-artifact-from-github — a postinstall helper that fetches pre-built binaries from GitHub releases instead of compiling from source. Born to take the pain out of npm install re2; useful to any native-addon author.

Today these libraries do real work in the wild: stream-chain and stream-json each pull around 6 million downloads a week from npm, re2 another 2 million, and install-artifact-from-github another 2 million. And maintenance is surprisingly light — across all those years stream-chain has had 12 issues filed total, stream-json 137 (all closed), node-re2 132 (the 5 still open are all enhancement trackers, not bugs), and install-artifact-from-github 4 (all closed). Most are questions or small enhancement suggestions. Picking primitives close to the bone — regular expressions, Node streams, the JSON spec — pays back for a long time. Not bad for a side quest that started with one slow parser.

stream-json and the side quests

Pre-history: parsers

Work

First cut

Regular expressions

There is a way

Regular expression engines

Regular expressions revisited

The cost of going native

Parsing JSON is hard

CSV is hard

Streaming data

Summary

Popular tags