Simple model service

How I built a simple embedding model service using open-source tools.

2023-07-30

Introduction

I've been working on a side project which involves computing, storing, and retrieving text embeddings. (Isn't everyone these days?)

In this post, I'll share how I built a simple model service to handle the computation part, using open-source tools and only a little bit of my time.

Most online tutorials I've seen in this space have you write all your code in a single script, which is difficult to integrate and maintain. But not on my watch, err, blog! We're going to do this the software engineering way.

The accompanying code is available in this GitHub repository. It's a template repository, so you can use it as a starting point for your own projects. I suggest reading this post before diving into the code.

Embeddings

In order to compute embeddings, you need to perform inference on a model. Model inference is the process of using a trained model to compute outputs given some inputs. In my case, the input is a text string, and the output is a vector of floats.

This output vector, often referred to as an embedding, is the representation of the input text in the model's latent space. Although the input text makes sense to us humans, the model can only understand numbers. The embedding is the model's translation of the text into a set of numbers which captures the text's meaning for the model.

Because embeddings are specific to the model that computed them, you can't compare embeddings from different models to each other. This means that if you want to switch models, you need to re-compute any stored embeddings.

Transformer models

Transformer models, in particular the BERT model family, are the most popular type of model for computing embeddings. BERT models are bidirectional transformers, which means they consider both the left and right context of each token when computing embeddings. This makes BERT models well-suited for tasks where contextual understanding is important, like sentiment analysis and question answering.

In contrast, GPT models are autoregressive transformers, which means they only consider the left context of each token. This makes GPT models well-suited for tasks involving text generation, like summarisation and translation. GPT models are also often larger than BERT models and are trained on more data, although this is not an inherent characteristic. This can make them better at understanding nuance too.

The original transformer model has an encoder-decoder architecture, which means that it can be used for both encoding and decoding. In the context of embeddings, encoding refers to generating embeddings for a given input text, and decoding refers to generating text from given input tokens. As you might be realising, BERT models are encoder-only, and GPT models are decoder-only.

Model services

The most common way to obtain embeddings these days is to make requests to a hosted model via an API. For example, you can use OpenAI's Embeddings API to make requests to their text-embedding-ada-002 model.

OpenAI, and similar services, "serve" their models via a model service. Model serving encompasses the entire process of making a model available for inference. This includes managing the infrastructure, hosting the model, receiving requests, performing inference, and returning responses.

Why build my own?

I've been curious about open-source alternatives for model serving for a while now. Part of this is because, as an ex-Cohere software engineer, I witnessed first-hand how much hard work and brilliance my colleagues invested into building a production-grade model service. I wanted to see how easy it would be to build a more simple model service myself, off-loading the difficult parts to open-source tools.

As a bonus, serving my own model means that I don't have to pay for usage while I'm in the early stages of iterating on my project. My entire project can be self-contained, without reliance on external APIs. I can choose to use a model with a more recent knowledge cut-off, which is important for my use case. It also keeps the door open for fine-tuning a custom model with full visibility into the training process.

Building the model service

In this section, I'll walk you through all the steps involved in building the model service. Although I built an embedding model service, you can use a similar approach to build model services for other types of models. I'll also share how to integrate your model service with your app service, and how to deploy both services using Docker Compose.

I chose to use Python for my model service because the bulk of the "work" being done by the service is the model inference. Therefore, Python's machine learning ecosystem is more valuable to me than the marginal performance gains I might have gotten from using a lower-level language. (I'm not overhead-bound.)

Because each service will be deployed in its own container, you can still use your preferred language for your app service. For my side project, I'm using Go. For this tutorial, we'll use Ruby.

Choosing a model

Hugging Face is a popular open-source platform for sharing models, datasets, and applications. They are famous for their Transformers Python package, which makes it easy to use (you guessed it) transformer models.

On Hugging Face's model hub, you can find a wide variety of open-source models. You can filter models by tasks, datasets, licenses, and more.

Hugging Face also provides benchmarks which compare model performance on a variety of tasks. For example, the Massive Text Embedding Benchmark (MTEB) evaluates embedding models on tons of datasets and several tasks. It displays both overall performance and performance on each individual task. This is important because you might care more about some tasks than others.

You should also consider model size, sequence length (or "context length"), and embedding dimensions (or "vector length"). Models with more parameters tend to perform better, but they also require more memory to store. This means they will take longer to download and require more resources to host. (Other factors affect model size, like floating point precision, but I won't go into that here.) Larger sequence lengths allow models to consider more context, but you might not need this for your use case. Larger vector lengths allow models to capture more information in each embedding, but they require more memory to store.

I personally like to skim the paper for a given model to get an idea of how it was trained and what data went into it. I'm not a fan of using models where the weights are publicly available, but the training details are not. All in all, use the model that best fits your needs, not just the one that's at the top of the leaderboard. Better models are being released so frequently that it doesn't make sense for me to prescribe a specific model to use.

At the time of writing, Microsoft's E5 V2 models are pretty good for general-purpose embeddings. For this tutorial, I'll use e5-base-v2, which is the latest medium-sized E5 model. Microsoft also has multi-lingual versions of these models, like multilingual-e5-base.

Take note of the model ID, sequence length, and vector length for your chosen model. For us that's intfloat/e5-base-v2, 512, and 768 respectively. We will need the model ID and sequence length for our model service, and the vector length is important if you want to store the embeddings. Remember to replace these values with your own if you're following along and using a different model.

Creating our project scaffolding

Initialise a new Git repository for your project. (Or just a regular directory, I'm not your boss.)

Create model/app.py with the following contents.

print("Hello from the model service!")

This simple script will be the placeholder for our model server.

Create compose.yaml with the following contents.

version: '3.8'

services:
  model:
    build:
      context: ./model
      dockerfile: Dockerfile
    command: ["python", "app.py"]
  app:
    image: docker/whalesay:latest
    command: ["cowsay", "Hello from the app service!"]

This is our Docker Compose file, which we will use to deploy our services. It defines two services, model and app. The model service will be built from the Dockerfile in the model directory, and will run the app.py script. For now, we're using a placeholder image and command for the app service.

Create model/Dockerfile with the following contents.

FROM python:3.11

# Set the working directory inside the container
WORKDIR /app

# Install Rust (fixes Python package installation issues)
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"

# Install Python requirements
COPY requirements.txt .
RUN pip install --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# Copy the rest of the files
COPY . .

This Dockerfile specifies the build instructions for our model service image. It uses the official Python image as the base image, installs the requirements, and copies the files into the container. We use relative paths because of the build context, which was set in the Compose file.

We install Rust because some of our Python requirements will have native dependencies. It's common for image builds to fail while installing these requirements. Installing Rust increases the size of the image, but avoids issues with packages that build from source.

Create model/requirements.txt, but leave it empty for now.

Your project structure should now look like this.

.
├── compose.yaml
└── model
    ├── Dockerfile
    ├── app.py
    └── requirements.txt

Developing locally

Instead of testing our changes by deploying our services every time, we can test them locally. This requires us to install our model service's requirements on our local machine.

First, make sure you're using the same version of Python as the base image. Use pyenv to install and set local and global Python versions without affecting your system Python version.

pyenv install 3.11.4
pyenv local 3.11.4

Create and activate a virtual environment for our model service. This will allow us to install the requirements for our model service without polluting our global Python installation.

python -m venv venv/model
source venv/model/bin/activate

Create a .gitignore file in your project root with the following contents.

# pyenv version file
.python-version

# Virtual environments
venv/

To actually deploy our services, we will use the docker compose command. Let's walk through the commands we'll use. You should see the output from both services in your terminal when you start them.

# Build the images
docker compose build
# Start the services
# Note: Type ^C (control-C) to stop them
docker compose up
# Build and start the services
docker compose up --build
# Remove the containers
docker compose down

For each of these commands, you can add the service name to the end to run the command on a specific service. Read the Compose docs if you're interested in learning more.

I recommend using Postman to send requests to the model service, once we have one. However, good old cURL will work too.

Inference boilerplate

Okay, now it's time to actually write the inference code. Your model's documentation will usually tell you what code you need to use it. In my case, it looked something like this.

import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-base-v2')
model = AutoModel.from_pretrained('intfloat/e5-base-v2')

def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def get_embedding(input_text: str) -> list[float]:
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = average_pool(outputs.last_hidden_state, inputs['attention_mask'])
        embeddings = F.normalize(embeddings, p=2, dim=1)
    return embeddings[0].tolist()

Do I understand what this code is doing? Yes actually, but not well enough to explain it simply. But that's okay! This is just boilerplate for us.

Replace the contents of model/app.py with the above code. Add the following to the bottom so we can make sure it's really working.

print(get_embedding("Hello world!"))

Add the requirements to model/requirements.txt.

torch
transformers

Build the model service and run it. You should see a vector of 768 floats printed to your terminal. Congrats, you've successfully run model inference!

docker compose up --build model

Now, try running both services. You should see both of their outputs in the terminal, but do you notice anything interesting?

docker compose up

The model service takes a long time to start, so the app service outputs its message first. But in a real-world scenario, our app service should wait for the model service to be ready before it starts. Otherwise, the app service might unknowingly send doomed requests to the unavailable model service. We'll fix this later.

Model-specific adjustments

Remember to always check the details of the model you're using. Depending on how a model was trained, you might need to make some changes to the inference code.

For example, you're supposed to prefix input texts with either "query: " or "passage: " when using Microsoft's E5 V2 models. I didn't integrate this into the tutorial code to keep it general, but you should do it if you're using these models.

Even if it's not mentioned in the model card, it's generally a good idea to format your input data the same way that the model was trained. This is also true for other types of models. For example, some GPT models have been trained with specific tags to denote special texts, like system prompts.

Simplifying inference

The boilerplate code we copied works, but it's pretty cluttered considering that we're not going to modify the math-y parts.

Luckily, the Sentence Transformers package provides a much simpler interface for computing embeddings. If you're using a different type of model, like a GPT model, you won't be able to use this package.

It wasn't totally obvious to me when I skimmed the package docs, but Sentence Transformers can be used for arbitrary models. I checked the source code and the SentenceTransformer class looks up Hugging Face models by ID upon initialisation, so I think it should work for other PyTorch-based embedding models in the Hugging Face model hub.

Replace the contents of model/app.py with the following.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('intfloat/e5-base-v2')

def single_embedding(input_text: str) -> list[float]:
    embedding = model.encode(input_text, normalize_embeddings=True)
    return embedding.tolist()

print(single_embedding("Hello world!"))

You can also use the encode method to compute embeddings for multiple texts at once.

def batch_embeddings(input_texts: list[str]) -> list[list[float]]:
    embeddings = model.encode(input_texts, normalize_embeddings=True)
    return embeddings.tolist()

print(batch_embeddings(["Hello world!", "Goodbye world!"]))

By default, the maximum sequence length for a Sentence Transformer model is 128. Input texts that are longer than the sequence length will be truncated, which means information loss. Our model has a sequence length of 512, and we should use it!

model = SentenceTransformer('intfloat/e5-base-v2')
model.max_seq_length = 512

Replace the contents of our model/requirements.txt file to reflect our new dependency.

sentence-transformers

Build the model service and run it. It should work the same as before.

docker compose up --build model

Downloading the model

Currently, our model service downloads the model every time it starts. This takes a long time, wastes resources unnecessarily, and fails if the model is unavailable. Instead, we can download the model files and store them in a directory. Then, we can load the model from this directory when the service starts.

Download Git LFS. Run the following commands to download the model files to a new directory model/model_files. Remember to replace the model ID with the one you're using.

# Initialise Git LFS
git lfs install

# Download with SSH (recommended)
git clone git@hf.co:intfloat/e5-base-v2 model/model_files

# Or download with HTTPS
git clone https://huggingface.co/intfloat/e5-base-v2 model/model_files

Update the model service in our Compose file to mount the model_files directory into the container.

volumes:
  - ./model/model_files:/app/model_files

This is not strictly required. The model files will already be copied into this location at build time since our Dockerfile copies everything in /model into /app. However, we're mounting the volume anyway because it allows us to update the model files without updating the Dockerfile and rebuilding the image.

Since we're defining the path to the model files in our Compose file, we should surface it as an environment variable. This will allow us to update the path without updating the service code and rebuilding the image. For good measure, set the sequence length too.

There are other ways to manage configuration, but I like the simplicity of environment variables when using Docker Compose.

Create a new file env/model.env with the following contents.

MODEL_PATH=/app/model_files
MODEL_SEQ_LEN=512

Update the model service in our Compose file to use this env file.

env_file:
  - env/model.env

Update model/app.py to load the model from the local path instead of fetching it from Hugging Face, and to set the sequence length.

import os
# ...

model_path = os.environ.get("MODEL_PATH")
model_seq_len = os.environ.get("MODEL_SEQ_LEN")

model = SentenceTransformer(model_path)
model.max_seq_length = int(model_seq_len)

Add the env and model_files directories to our .gitignore file so that you don't accidentally commit them to your repository.

# Environment variables
env/

# Model files
model/model_files/

Build the model service.

docker-compose build model

Now try running the model service again. See, it's lightning fast!

docker-compose up model

Creating a model server

So far we have a script that loads the model and performs inference on a hard-coded input text. But we want to be able to interact with our model! We need to create a server that can receive requests and return responses. The requests will be the input texts, and the responses will be the embeddings.

My favourite Python package for creating web servers is FastAPI. It's fast (hah), easy to use, and has solid documentation. We'll use Uvicorn as the ASGI server, as recommended by FastAPI.

We can turn our script into a server by adding a few lines of code. Let's simplify the function header while we're at it. Remember to delete the print statement.

from fastapi import FastAPI
# ...

app = FastAPI()
# ...

@app.post("/embed")
def embed(text: str) -> list[float]:
    embedding = model.encode(text, normalize_embeddings=True)
    return embedding.tolist()

We created a new FastAPI app and registered a route to an /embed endpoint. This endpoint accepts POST requests with a text field. Now our model service actually contains a server for making requests to the model!

Update the model service in our Compose file to run our FastAPI server and expose the port it's running on.

command: ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]
ports:
  - 8000:80

Add FastAPI and Uvicorn to our requirements file. I also needed to explicitly require httpx, which is a FastAPI dependency.

fastapi
httpx
uvicorn

You know the drill by now. Build and run the model service.

docker-compose up --build model

Test out our new endpoint by sending a request to it. Here's how you can do that with cURL. The endpoint currently uses query parameters, but we'll change that next.

curl -X POST "http://localhost:8000/embed?text=Hello%20world!" \
  -H "accept: application/json"

Creating a Pydantic model

Instead of using query parameters for our endpoint, we can create a model with Pydantic, which will also validate the request body. This will allow us to send a JSON body instead, which is more flexible and easier to read.

from pydantic import BaseModel, Field
# ...

class Request(BaseModel):
    text: str = Field(..., min_length=1)
# ...

@app.post("/embed")
async def embed(request: Request) -> list[float]:
    text = request.text
# ...

We now have a Request model with a single field, text. This field is required and must have a minimum length of 1.

Let's say we decide that we don't want to embed empty texts, and that we consider whitespace-only texts to be empty as well. This is just an example; you might want to have different validation rules.

We can write a function to check if a string is non-whitespace, then register it as a field validator to our text field.

from pydantic import BaseModel, Field, field_validator
# ...

class Request(BaseModel):
    text: str = Field(..., min_length=1)

    @field_validator("text")
    def check_non_whitespace(cls, value: str):
        if value.strip() == "":
            raise ValueError("value cannot be empty or contain only whitespaces")
        return value
# ...

Validating the validator

Instead of sending a request to our model service every time we want to test our validator, we can write automated tests.

Create a new file model/test_app.py with the following contents.

import pytest
from fastapi.testclient import TestClient
from app import app

client = TestClient(app)

def test_valid_request():
    data = {"text": "This is a valid request."}
    response = client.post("/embed", json=data)
    assert response.status_code == 200
    assert response.json() is not None

def test_invalid_field():
    data = {"invalid_field": "This is an invalid request."}
    response = client.post("/embed", json=data)
    assert response.status_code == 422
    assert "detail" in response.json()

@pytest.mark.parametrize("text", ["", " ", "\t", "\n"])
def test_empty_text_request(text):
    data = {"text": text}
    response = client.post("/embed", json=data)
    assert response.status_code == 422
    assert "detail" in response.json()

We've written three tests using pytest and FastAPI's TestClient. The first test sends a valid request, the second test sends an invalid request with an invalid field, and the third test sends multiple invalid requests with "empty" text values.

Add Pydantic and pytest to our requirements file. We need at least version 2.0 of Pydantic.

pydantic>=2.0
pytest

There's one more thing we need to do. Our environment variables are being set when we deploy our model service, but not when we run our tests. Our MODEL_PATH is also different locally than it is in the container. We can fix this by adding a new env file and ensuring it's loaded when we run our tests.

Create a new file env/test.env with the following contents.

MODEL_PATH=model_files
MODEL_SEQ_LEN=512

The typical approach in Python circles is to load the env file within the test script, but I find that messy. Instead, we're going to create a Makefile rule which temporarily loads the env file before running the tests.

Create a new file Makefile in your project root with the following contents.

test-model:
	@export $(shell grep -v '^#' env/test.env | xargs) && \
	cd model && pytest

Now we can run our tests locally, without deploying our model service.

# Activate the virtual environment
source venv/model/bin/activate

# Install the requirements
pip install -r model/requirements.txt

# Run the tests
make test-model

We can also send requests like before, but with a JSON body instead of query parameters.

curl -X POST "http://localhost:8000/embed" \
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  -d "{\"text\":\"Hello world!\"}"

Adding a health check

Remember how our app service was starting before our model service? It's not as bad now that we've downloaded the model files, but ideally we should guarantee that our model service is ready before our app service starts. We can do this by adding a health check to our model service.

Add a new endpoint to our model server which always returns a successful response.

@app.get("/healthcheck")
async def healthcheck() -> dict[str, str]:
    return {"status": "ok"}

This endpoint won't work until FastAPI is ready, so if we can successfully send a request to it, we know that our model service is ready to compute embeddings too.

Update our Compose file to add a health check to our model service, and to make our app service wait until the model service is healthy.

services:
  model:
    # ...
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:80/healthcheck"]
      interval: 10s
      timeout: 1s
      retries: 3
  app:
    # ...
    depends_on:
      model:
        condition: service_healthy

Our health check sends a GET request to the /healthcheck endpoint with cURL. The --fail flag tells cURL to return a non-zero exit code if the request fails. This is what we want, because the health check should fail if the request fails. We've set the health check to run every 10 seconds, time out after 1 second, and retry failed checks 3 times.

Build the model service.

docker compose build model

Now when we start both services, the app service waits until the model service is healthy.

docker compose up

Integrating the services

So far, our app service hasn't been doing much. Let's change that! For fun, let's write the app service in Ruby.

Create a new directory app in your project root.

Create app/demo.rb with the following contents.

require 'httpx'
require 'uri'
require 'json'

# Flush stdout immediately for real-time output
$stdout.sync = true

model_service_url = URI('http://model:80/embed')

# Get the text to embed from the CLI arguments
text_value = ARGV[0]

loop do
  # Send a request to the model service
  response = HTTPX.post(model_service_url, json: { text: text_value })
  
  # Handle the response
  if response.status >= 200 && response.status < 300
    vector = JSON.parse(response.body)
    puts "Received a vector of length #{vector.length}"
  else
    puts "Error: #{response.status}"
  end
  
  # Wait before sending the next request
  sleep 30
end

Create app/Gemfile with the following contents.

source 'https://rubygems.org'

gem 'httpx'

Create app/Dockerfile with the following contents.

FROM ruby:3-slim

# Set the working directory inside the container
WORKDIR /app

# Install the requirements
COPY Gemfile .
RUN bundle install

# Copy the rest of the files
COPY . .

Update the app service in our Compose file to use this Dockerfile.

app:
  build:
    context: ./app
    dockerfile: Dockerfile
  command: ["ruby", "demo.rb", "Hello model!"]

Build the app service.

docker compose build app

Start both services.

docker compose up

You should see the model service log a successful embed request. Right after, the app service should log the received vector length. This process should repeat every 30 seconds. As you can imagine, the app service could easily be doing something more interesting.

But what did we actually do? We built an app service and made it send a request to the model service. Did you notice that the server URL contains the service name model? Docker Compose automatically resolves this to the IP address of the model service.

I used Ruby to drive home the idea that languages can differ between services. All that matters is that data contracts between services are well-defined, and that services can translate between their internal representations and data contracts. Or more simply: It's nothing but bytes, baby!

Your final project structure should look like this.

├── .gitignore
├── Makefile
├── app
│   ├── Dockerfile
│   ├── Gemfile
│   └── demo.rb
├── compose.yaml
├── env
│   ├── model.env
│   └── test.env
├── model
│   ├── Dockerfile
│   ├── app.py
│   ├── model_files
│   ├── requirements.txt
│   └── test_app.py
└── venv

We're living in the future

The initial version of my model service took about ten minutes to build, and the final version took a couple hours. (Okay, I also spent time before that figuring out what I needed to do.)

Knowing how complex these types of systems are when they're built from scratch, I'm amazed at how easy it was with open-source models and tools. Plus, they keep getting better and the ecosystem keeps growing.

Shoutout to all the people who have contributed to this community, whether with code or otherwise! You rock!

Taking it further

It doesn't have to stop here. There are a lot of ways that you can take this project further.

You can add more endpoints, like for tokenising input texts or calculating embedding similarity. You can support different models, like GPT models for text generation. You can improve scalability, efficiency, and reliability if you want to use this in production. You can also accelerate inference with GPUs, if you have them available. And of course, you can build your own app service.

Heck, you can even take what you've learned and apply it to other projects!

Conclusion

I hope this post has inspired you to build your own model service, or even to build something else entirely. I also hope that you learned something new, whether it was about embedding models, Docker Compose, or something else. If not, then I hope you at least enjoyed the read!