Handling CPU-heavy tasks in NodeJS

Alexandre

June 12, 2023

Handling CPU-heavy tasks in NodeJS

Introduction

Over the past months, I have been working on several NestJS backend projects professionally. While my first impression on the NodeJS ecosystem was initially mitigated, by using Typescript (along with well-maintained libraries such as NestJS), it is possible writing quality software for the NodeJS platform (assuming, of course, you want to invest time in design, architecture and best practices ^^). Typescript structural typing is very flexible and allows writing maintainable code, while being quite fast to type-check, which provides a pleasant, quick, feedback loop. Furthermore, type-level typescript opens the avenue for a lot of advance use-cases, and it is a very interesting subject that would deserves its own article.

To thread or not to thread

One major peculiarity of NodeJS is that it is based on a single threaded event loop model. It means, that even if the runtime (libuv) is able to leverage threads under the hood, your own business JS code will run on a single thread.

Fortunately, by leveraging async IO and an event-loop (aka reactor pattern), a single thread is able to handle many concurrent async tasks. By employing async/await (imperative style) or Promise (functional style) we can improve the ergonomics of working with the event loop.

(As a side note, the rise of reactive libraries, then coroutine / continuation / async systems in other ecosystems in the past decade (Java, Scala, Kotlin, Go, Rust, ...), is a good indication of the adoption of the async IO as an industrial standard for highly scalable backends.)

If you are not familiar with the event-loop, there are plenty of articles on the net discussing the subject, and I encourage you to read one of them before diving into this article.

An often understated benefit of the single thread approach, is simplicity. By having a single thread, it eliminates a lot of low level race conditions (races are still possible at the logical level, though) and allow focusing on the business logic, without having to mess with locks, semaphore, atomics and heisenbugs.

My experience with other ecosystems, where the usage of threads is liberal, such as the JVM, is that the codebases are frequently riddled with potential races, hard to debug issues waiting to happen in production. Writing correct multithreaded code is hard and should not be taken lightly. However, it is usually the price to pay for highly performant code.

NodeJS is appropriate for tasks involving orchestration of IO, with lightweight business logic on top. So for a classical backend, where most of the time spent is waiting for the DB to execute your query, and you want to iterate quickly on your business logic, NodeJS will be very a good fit. And unless you have very CPU intensive tasks, or you have a very heavy load and want to get the most of your hardware, NodeJS with Typescript can get you very far (quickly).

In this article, we will examine the case of a long-running, CPU intensive tasks in NodeJS, and discuss the strategies we can employ to cope with the single-threaded design.

Setup

For our use-case, we will bootstrap a simple NestJS application using nx.

npx create-nx-workspace@latest --name=nest-worker --preset=nest

The code for this article is available in this repository

To simulate a CPU-bound task, we will use the following code to approximate Pi using monte carlo approach.

export function computation(nPoints = 1_000_000): number {
  let n = 0;
  for (let i = 0; i < nPoints; i++) {
    const x = Math.random();
    const y = Math.random();
    if (x ** 2 + y ** 2 < 1) {
      n += 1;
    }
  }
  return (4 * n) / nPoints;
}

(Please don't use this method at home, it is one of the least effective ways to approximate Pi ^^)

Let's create a simple controller to expose our computation, along with a health check endpoint.

@Controller()
export class AppController {
  @Get('computation')
  async getData() {
    return { data: computation() };
  }

  @Get('health')
  healthCheck() {
    return { status: 'healthy' };
  }
}

Let's run the server, and fire the autocannon using a small helper script. (see the repository for the full script source code)

npx nx build nest-worker
node dist/apps/nest-worker/main.js

./bench.mjs
$ mkdir -p benches
$ autocannon -n -j http://localhost:3000/api/computation > 'benches/computation.json'
$ autocannon -n -j -R1 http://localhost:3000/api/health >  'benches/health.json'

┌──────────────────────────────────────────┬────────┐
│                 (index)                  │ Values │
├──────────────────────────────────────────┼────────┤
│ computation_request_rate_average (req/s) │  62.3  │
│ computation_request_rate_stddev (req/s)  │  0.9   │
│       health_latency_average (ms)        │  113   │
│        health_latency_stddev (ms)        │ 67.47  │
└──────────────────────────────────────────┴────────┘

The script will bench our main computation endpoint, and ping the health endpoint to measure its latency over time. Please not this not intended to be a scientific benchmark, but give us a rough estimate.

On my machine (a M1 Pro Macbook), with 1 million points per simulation run, we are able to service around 60 requests per second. When we try to reach the /api/health endpoint during the test, you can see it can take longer than usual ( the baseline is sub-millisecond latency for health).

The event loop is pegged with our CPU task and struggles to answer the unrelated health endpoint. The more we add points to our simulation, the more this problem becomes obvious. As a personal anecdote, I had an issue in production where a method unexpectedly took a lot of CPU and caused the health check to remain unresponsive, which in turns made the orchestrator (kubernetes) to kill the service.

Using cluster module

The first approach you should try to improve performance on a multicore system, is to use the cluster module. The cluster module will fork your node process (usually the number of process will match the CPU), and allow a single network port to be load balanced across your forks. A basic integration would look like:

import cluster from 'cluster';
import { bootstrap } from './bootstrap';

if (cluster.isPrimary) {
  console.log(`Primary ${process.pid} is running`);

  // Fork workers.
  for (let i = 0; i < 8; i++) {
    cluster.fork();
  }

  cluster.on('exit', (worker, code, signal) => {
    console.log(`worker ${worker.process.pid} died`);
  });
} else {
  bootstrap();
}

Let's throw another salvo of requests, using autocannon

./bench.mjs
$ mkdir -p benches
$ autocannon -n -j http://localhost:3000/api/computation > 'benches/computation.json'
$ autocannon -n -j -R1 http://localhost:3000/api/health >  'benches/health.json'
┌──────────────────────────────────────────┬────────┐
│                 (index)                  │ Values │
├──────────────────────────────────────────┼────────┤
│ computation_request_rate_average (req/s) │ 444.3  │
│ computation_request_rate_stddev (req/s)  │ 11.07  │
│       health_latency_average (ms)        │  5.87  │
│        health_latency_stddev (ms)        │  3.58  │
└──────────────────────────────────────────┴────────┘

We can see that the average throughput as improved a lot, from ~60 to ~450 requests per second (not far from the theoretical 8x improvement on my 8 core machine, but probably capped by Amdahl's law)

A downside, is that by having more processes, our memory footprint is slightly higher than the single process version. Also, while the /api/health endpoint is much faster, there is still some unwanted latency, indeed even if the load is balanced between the nodes, at some point every fork event loop will be pegged by the CPU-bound computation.

Using worker threads

A limit of our previous approach, is that we have "replicated" the event loop in each process, but each loop can get stuck in a long CPU-bound task. Another approach is to offload the CPU task from the main loop, into a separate worker thread loop. This way, our main server loop is always responsive, and can answer quickly to each endpoint such as the health check.

At low level, we can use worker_threads to spawn individual thread from a standalone script. A WorkerThread will have its own system thread, its own event loop and its own isolated heap storage. To launch a WorkerThread we need to provide its own standalone JS script.

By default, our nx build will build our server into dist/apps/nest-worker/main.js. In order to transpile an additional script for our worker thread, we will tweak our project.json build, to include an additional endpoint :

{
  "targets": {
    "build": {
      "executor": "@nx/webpack:webpack",
      <...>
      "options": {
        <...>
        "additionalEntryPoints": [
          {
            "entryName": "worker",
            "entryPath": "apps/nest-worker/src/worker.ts"
          }
        ]
      }
    }
  }
}

Now, webpack will build worker.js co-located to main.js.

In order to make effective use of the worker_thread, we may want to take the following considerations :

Spawning a thread is costly, and we want to avoid doing it on each request
To exploit all the available CPU cores, we need to spawn a thread per core, and balance the load on those threads.
We need to set up and orchestrate communication between our main thread and worker threads. Due to each thread having its own isolated memory, transferring object may require copy and be potentially costly, or you have to transfer the ownership of the object ( see transfer considerations).

In order to address those points and reduce the boilerplate, we will use workerpool package.

@Injectable()
export class WorkerLauncherService {
  private workerPool: WorkerPool;

  constructor() {
    this.workerPool = pool(path.join(__dirname, 'worker.js'), {
      maxWorkers: 8,
      workerType: 'thread',
      workerThreadOpts: {},
    });
  }

  async computation() {
    return await this.workerPool.exec<() => number>('computation', null);
  }
}

In worker.ts, we will reuse and expose our computation function :

import { worker } from 'workerpool';
import { computation } from './computation/computation';

async function bootstrap() {
  worker({
    computation: () => computation(),
  });
}

bootstrap();

Let's run the benchmark one last time :

./bench.mjs
$ mkdir -p benches
$ autocannon -n -j http://localhost:3000/api/worker > 'benches/computation.json'
$ autocannon -n -j -R1 http://localhost:3000/api/health >  'benches/health.json'
┌──────────────────────────────────────────┬────────┐
│                 (index)                  │ Values │
├──────────────────────────────────────────┼────────┤
│ computation_request_rate_average (req/s) │ 464.3  │
│ computation_request_rate_stddev (req/s)  │  2.87  │
│       health_latency_average (ms)        │  0.5   │
│        health_latency_stddev (ms)        │  0.96  │
└──────────────────────────────────────────┴────────┘

We can see our average rate remain mostly the same, but our health check latency is again in the sub-millisecond range, which is what we expected. Furthermore, we can tweak the memory usage of our worker thread when constructing the pool, so we can have a reduced footprint.

Conclusion

In the article, we have seen the pros and cons of event-loop approach, and some way to mitigate high cpu usage workflow. By using the cluster module, we can scale our service across all available cores, and by using worker threads we can offload heavy work off the main event loop. It is possible to combine both approaches for greater scalability (at the cost of a higher memory footprint)

In a future article, we will investigate the possibility of rewriting our CPU intensive part of our app into a native module, using Rust and Neon bridges. Stay tuned for more !