Use External Embeddings

Here’s how txtai docs define how o

from txtai import Embeddings

embeddings = Embeddings(path="sentence-transformers/nli-mpnet-base-v2")

As you can see, it supports Huggingface models by default. This can be fine, but what if you would rather use a service like OpenAI, or Cohere?

I dug around the documentation, issues and the examples and even the source code for a good while before I found a working example

Let me save you the trouble

from langchain_openai import AzureOpenAIEmbeddings
from txtai import Embeddings
import numpy as np
from typing import List
from os import getenv
from dotenv import load_dotenv

load_dotenv()

Next, define your Embeddings

openai_embeddings = AzureOpenAIEmbeddings(
    azure_deployment=getenv("EMBEDDINGS_NAME"),
    api_key=getenv("OPENAI_API_KEY"),
    azure_endpoint=getenv("AZURE_OPENAI_ENDPOINT"),
)

and a function to generate embeddings

def get_openai_embeddings(texts: List[str]):
    results = openai_embeddings.embed_documents(texts=texts)
    return np.array([x for x in results], dtype=np.float32)

Now, define the Embeddings object

embeddings = Embeddings(
        {
            "transform": get_openai_embeddings,
            "backend": "numpy",
            "content": True,
        }
)

As you can see we are passing a custom transform function here

Let’s test this out

data = [
  "US tops 5 million confirmed virus cases",
  "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
  "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
  "The National Park Service warns against sacrificing slower friends in a bear attack",
  "Maine man wins $1M from $25 lottery ticket",
  "Make huge profits without work, earn up to $100,000 a day"
]

embeddings.index(data)

print(embeddings.search("feel good story", 1))

[{'id': '4', 'text': 'Maine man wins $1M from $25 lottery ticket', 'score': 0.764844536781311}]