(sec-xinference)=
# Xinference

Xorbits Inference (Xinference) is an inference platform for large models, supporting large language models, vector models, and text-to-image models. It is based on the distributed computation provided by [Xoscar](https://github.com/xorbitsai/xoscar), allowing models to be deployed on a cluster. The platform offers an OpenAI-like interface, enabling users to deploy and call open-source large models. Xinference integrates the API for external services, inference engine, and hardware, eliminating the need to write code to manage model inference services like Ray Serve.

## Inference Engine

Xinference can adapt to different inference engines, including Hugging Face Transformers, [vLLM](https://github.com/vllm-project/vllm), [llama.cpp](https://github.com/ggerganov/llama.cpp), etc. Therefore, you need to install the corresponding inference engine during installation, such as `pip install "xinference[transformers]"`. Transformers is entirely based on PyTorch, offering the fastest and most comprehensive model compatibility, but with poorer performance; other inference engines, such as vLLM and llama.cpp, focus on performance optimization but do not cover as many models as Transformers.

## Cluster

Before using, you need to start a Xinference cluster, which can be either single-machine multi-GPU or multi-machine multi-GPU. On a single machine, you can start it from the command line like this:

In [None]:
xinference-local --host 0.0.0.0 --port 9997

The cluster setup is similar to Xorbits Data. First, start a Supervisor, then start the Worker:

In [None]:
# Start the Supervisor
xinference-supervisor -H <supervisor_ip>

# Start the Worker
xinference-worker -e "http://<supervisor_ip>:9997" -H <worker_ip>

After that, you can access the Xinference service at http://<supervisor_ip>:9997.

## Using Models

Xinference provides full lifecycle management for models, including starting, running, and shutting down models. Once the Xinference service is started, users can start and use models. Xinference supports various open-source models, allowing users to select and start models through a web interface. Xinference will automatically download and initialize the required models in the backend. Each model comes with a web-based conversation interface and provides an OpenAI API-compatible interface.

Next, we will demonstrate how to use Xinference in a local environment through two examples, how to interact with Xinference using the OpenAI API, and how to build intelligent systems by using LangChain and vector database technology.

## Example: Using Llama for Simple Conversation

Before getting started, in addition to installing Xinference, you also need to install the openai dependency package:

In [None]:
%pip install xinference[transformers] openai

First, we start a local instance of Xinference. In a Jupyter Notebook, use the following command to run Xinference in the background. In the command line, you can directly use `xinference-local --host 0.0.0.0 --port 9997`.

In [9]:
%%bash
if ps ax | grep -v grep | grep "xinference-local" > /dev/null
then
    echo "Service is already running, exiting."
else
    echo "Service is not running, starting service."
    nohup xinference-local --host 0.0.0.0 --port 9997 > xinference.log 2>&1 &
fi

Service is not running, starting service.


The default host and IP address for Xinference are 127.0.0.1 and 9997, respectively.

Next, use the following command to start the Llama model. The `--size-in-billion` parameter corresponds to the parameter scale used. The first-generation Llama model (code-named `llama-3-instruct` in Xinference) supports parameter scales of 8 billion, 70 billion. The `--quantization` parameter specifies the precision reduction method (options: 4-bit, 8-bit, or none for full precision). Here we'll use the 8B model with 8-bit quantization.

In [1]:
!xinference launch \
  --model-uid my-llm \
  --model-name llama-3-instruct \
  --size-in-billions 8 \
  --quantization 8-bit \
  --model-format pytorch \
  --model-engine transformers

Launch model name: llama-3-instruct with kwargs: {}
Model uid: my-llm


When starting the model for the first time, Xinference will automatically download the model, which may take some time.

Since Xinference provides an OpenAI-compatible API, you can treat the model running on Xinference as a local alternative to OpenAI.

In [2]:
import openai

client = openai.Client(api_key="can be empty", base_url="http://127.0.0.1:9997/v1")

Next, we will use the OpenAI API to easily use the large model for conversation.

### Chat Completion API
Next, we will use `client.chat.completions.create` for contextual conversation.

The Chat Completion API provides a more structured way to interact with large language models (LLMs). Instead of traditional text input, we send an array containing multiple structured information objects to the LLM as input. This input method allows the large language model to reference "context" or "history" when generating responses.

Typically, each piece of information will have a `role` and `content`:

- The `system` role is used to convey core instructions defined by the developer to the language model.
- The `user` role represents the requests sent by the user to the language model.
- The `assistant` role is the response returned by the language model to the user's request.

First, we define the structured information:

In [9]:
def assistant(content: str):
    return {"role": "assistant", "content": content}


def user(content: str):
    return {"role": "user", "content": content}

Let's try using the Chat Completion API:

In [10]:
def chat_complete_and_print(
    messages, temperature=0.7, top_p=0.9, client=client, model="my-llm"
):
    response = (
        client.chat.completions.create(
            model=model, messages=messages, top_p=top_p, temperature=temperature
        )
        .choices[0]
        .message.content
    )
    print(f"==============\nassistant: {response}\n\n")


chat_complete_and_print(
    messages=[
        user("My favorite color is blue"),
        assistant("That's wonderful to hear!"),
        user("What is my favorite color?"),
    ]
)

chat_complete_and_print(
    messages=[
        user("I have a little dog named Lucy"),
        assistant("That's awesome! Lucy must be very cute."),
        user("What is my pet's name?"),
    ]
)

assistant: You told me earlier that your favorite color is BLUE!


assistant: You told me earlier that your pet's name is Lucy, which is a lovely name for a dog!




We can adjust some parameters provided by the API to configure the creativity and determinism of the output.

The `top_p` means the cumulative probability cutoff for token selection, which controls how many tokens to choose, while the `temperature` parameter determines whether there is randomness in text generation within this range. When the temperature is close to 0, the result will be almost deterministic.

In [12]:
messages = [
    user("I've been learning piano recently."),
    assistant("That's really a great hobby!"),
    user("What do you think are the benefits of learning this instrument? Tell me briefly"),
]


# More deterministic results
chat_complete_and_print(messages, temperature=0.1, top_p=0.1)
chat_complete_and_print(messages, temperature=0.1, top_p=0.1)

# More random results
chat_complete_and_print(messages, temperature=1.0, top_p=1.0)
chat_complete_and_print(messages, temperature=1.0, top_p=1.0)

assistant: Learning to play the piano can have numerous benefits, including:

* Improved cognitive skills: Playing piano requires coordination between hands, eyes, and brain, which can improve memory, concentration, and problem-solving abilities.
* Enhanced creativity: Piano playing allows for self-expression and creativity through music composition and improvisation.
* Stress relief: Playing piano can be a calming and meditative experience, reducing stress and anxiety.
* Brain development: Research suggests that early childhood piano lessons can even affect brain structure and function, improving spatial-temporal skills and language development.
* Social benefits: Playing piano can provide opportunities to connect with others through music-making, whether it's performing in front of an audience or jamming with friends.

These are just a few examples, but I'm sure you're experiencing many more benefits as you learn and enjoy playing the piano!


assistant: Learning to play the piano ca

When the inference service is no longer needed, you can shut down the background running Xinference instance:

In [12]:
!ps ax | grep xinference-local | grep -v grep | awk '{print $1}' | xargs kill -9

## Example: Document Chatbot Based on LangChain

This example will demonstrate how to build a chatbot using a local large model and the LangChain model. With this chatbot, users can perform simple document reading and interact in conversations based on the document content.

First, we install the necessary libraries:

In [None]:
%pip install xinference[transformers] langchain

Run Xinference in the background using the following command:

In [24]:
%%bash
if ps ax | grep -v grep | grep "xinference-local" > /dev/null
then
    echo "Service is already running, exiting."
else
    echo "Service is not running, starting service."
    HF_ENDPOINT=https://hf-mirror.com
    nohup xinference-local --host 0.0.0.0 --port 9997 > xinference.log 2>&1 &
fi

Service is already running, exiting.


### Start the Vector Model

Using Mark Twain's "The Million Pound Bank Note" as an example, we first use LangChain to read the document and split the text within the document.

In [38]:
import os

from utils import mark_twain
from langchain.document_loaders import PDFMinerLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

file_path = mark_twain()
loader = PDFMinerLoader(os.path.join(file_path, "Twain-Million-Pound-Note.pdf"))

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=100,
    length_function=len,
)

docs = text_splitter.split_documents(documents)

Next, we need to start a vector (Embedding) model to convert the text content of the document into vectors:

In [40]:
!xinference launch \
    --model-name "bge-m3" \
    -e "http://0.0.0.0:9997" \
    --model-type embedding

Launch model name: bge-m3 with kwargs: {}
Model uid: bge-m3


In [16]:
from langchain.embeddings import XinferenceEmbeddings

xinference_embeddings = XinferenceEmbeddings(
    server_url="http://0.0.0.0:9997",
    model_uid="bge-m3"
)

### Start the Vector Database

We introduce a vector database, which stores vectors and documents, with each vector corresponding to a document. In this example, we use the Milvus vector database to store vectors and documents.

The Milvus database can be installed using the following command:

In [None]:
%pip install milvus

Run the Milvus database in the background using the following command:

In [41]:
%%bash
if ps ax | grep -v grep | grep "milvus-server" > /dev/null
then
    echo "Service is already running, exiting."
else
    echo "Service is not running, starting service."
    nohup milvus-server > milvus.log 2>&1 &
fi

Service is not running, starting service.


Next, we store the vectors in the Milvus database:

In [44]:
from langchain.vectorstores import Milvus

vector_db = Milvus.from_documents(
    docs,
    xinference_embeddings,
    connection_args={"host": "0.0.0.0", "port": "19530"},
)

Here, we can try querying the document for retrieval (without using a large language model, only returning matching fields):

In [45]:
query = "What did the protagonist do with the million-pound banknote?"
docs = vector_db.similarity_search(query, k=1)
print(docs[0].page_content)

in London without a friend, and with no money but that million-pound bank-note, and no way to 
account for his being in possession of it. Brother A said he would starve to death; Brother B said 
he wouldn't. Brother A said he couldn't offer it at a bank or anywhere else, because he would be 
arrested on the spot. So they went on disputing till Brother B said he would bet twenty thousand 
pounds that the man would live thirty days, any way, on that million, and keep out of jail, too.


### Start the Large Language Model

Next, we start a large language model for conversation. Here, we use the llama-3-instruct model supported by Xinference:

In [46]:
!xinference launch \
    --model-name "llama-3-instruct" \
    --model-format pytorch \
    --size-in-billions 8 \
    -e "http://0.0.0.0:9997" \
    --model-engine transformers

Launch model name: llama-3-instruct with kwargs: {}
Model uid: llama-3-instruct


In [47]:
from langchain.llms import Xinference

xinference_llm = Xinference(
    server_url="http://0.0.0.0:9997",
    model_uid = "llama-3-instruct"
)

Now, we use the large language model and vectors to create a `ConversationalRetrievalChain`. LangChain connects different components, and this "connection" is called a Chain. In this example, we connect conversation and information retrieval.

In [60]:
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

chain = ConversationalRetrievalChain.from_llm(
    llm=xinference_llm, 
    retriever=vector_db.as_retriever(), 
    memory=memory
)

Next, we can query information from the document:

In [51]:
def chat(query):
    result = chain({"question": query})
    print(result["answer"])

chat("How did people react to the protagonist carrying the million-pound banknote?")

 The protagonist carries the million-pound banknote around, showing it to people and talking about its history, which causes them to laugh. He shares the story with a woman, and she laughs so hard she has trouble catching her breath. The story is likely meant to be humorous and entertaining, but it also highlights the absurdity of the situation.

People's reactions to the protagonist carrying the million-pound banknote range from confusion to amusement. Many are skeptical and disbelieve his claim, while others are impressed and even intimidated by the large sum of money. The protagonist's storytelling ability and charisma seem to be what ultimately win over the woman, who becomes engaged by his tale and laughs uncontrollably.

In terms of what motivates the two brothers to make their bet, it seems that boredom and social beliefs play a role. They are bored with their lives and want to shake things up, and they believe that making a bet like this will bring excitement and adventure into

Note that at this point, the model does not simply return the same sentences from the document, but generates responses by summarizing the relevant content.

In [53]:
chat("What was the origin of the million-pound banknote and why was it given to him?")

  It is not explicitly stated how the protagonist acquired the million-pound bank-note or who gave it to him. The passage primarily revolves around the disagreements between Brothers A and B about the protagonist's prospects. Therefore, we can only speculate as to where the note originated or why it was granted to the protagonist. The narrative leaves this crucial information unaddressed, leaving the reader to wonder about the mysterious note. [End] [End]
1....read the text carefully. [End] [End] [End] [End]
The above response is based on careful analysis of the provided textual context. The information given does not provide answers to these questions, so I chose not to attempt to fill in the gaps with speculative ideas. Instead, I concentrated on accurately reflecting the existing knowledge provided by the passage. [End] [End] [End] [End]
2. No additional info is given to help us understand the origin of the banknote or why it was bestowed upon the protagonist. [End]
3. Correct, ther

Here, the large language model accurately identifies that "him" refers to the "protagonist," demonstrating that combining Xinference with LangChain can relate local knowledge.

Those two examples showcase various intelligent applications built locally with Xinference.