Multi-threaded Scaling

The ONNX Runtime enables multi-threading and can benefit from multiple CPU cores.

Using multiple threads on a multi-core CPU can reduce the latency for creating a vector for most embedding models. It can also increase the throughput by parallelizing vector creation across requests. The ONNX Runtime automatically sizes thread pools for intra-op and inter-op parallelism based on your workload.

Parent topic: Considerations for the Embedding Service