Xinference

10.2. Xinference#

Xorbits Inference (Xinference) is an inference platform for large models, supporting large language models, vector models, and text-to-image models. It is based on the distributed computation provided by Xoscar, allowing models to be deployed on a cluster. The platform offers an OpenAI-like interface, enabling users to deploy and call open-source large models. Xinference integrates the API for external services, inference engine, and hardware, eliminating the need to write code to manage model inference services like Ray Serve.

Inference Engine#

Xinference can adapt to different inference engines, including Hugging Face Transformers, vLLM, llama.cpp, etc. Therefore, you need to install the corresponding inference engine during installation, such as pip install "xinference[transformers]". Transformers is entirely based on PyTorch, offering the fastest and most comprehensive model compatibility, but with poorer performance; other inference engines, such as vLLM and llama.cpp, focus on performance optimization but do not cover as many models as Transformers.

Cluster#

Before using, you need to start a Xinference cluster, which can be either single-machine multi-GPU or multi-machine multi-GPU. On a single machine, you can start it from the command line like this:

xinference-local --host 0.0.0.0 --port 9997

The cluster setup is similar to Xorbits Data. First, start a Supervisor, then start the Worker:

# Start the Supervisor
xinference-supervisor -H <supervisor_ip>

# Start the Worker
xinference-worker -e "http://<supervisor_ip>:9997" -H <worker_ip>

After that, you can access the Xinference service at http://<supervisor_ip>:9997.

Using Models#

Xinference provides full lifecycle management for models, including starting, running, and shutting down models. Once the Xinference service is started, users can start and use models. Xinference supports various open-source models, allowing users to select and start models through a web interface. Xinference will automatically download and initialize the required models in the backend. Each model comes with a web-based conversation interface and provides an OpenAI API-compatible interface.

Next, we will demonstrate how to use Xinference in a local environment through two examples, how to interact with Xinference using the OpenAI API, and how to build intelligent systems by using LangChain and vector database technology.

Example: Using Llama for Simple Conversation#

Before getting started, in addition to installing Xinference, you also need to install the openai dependency package:

%pip install xinference[transformers] openai

First, we start a local instance of Xinference. In a Jupyter Notebook, use the following command to run Xinference in the background. In the command line, you can directly use xinference-local --host 0.0.0.0 --port 9997.

%%bash
if ps ax | grep -v grep | grep "xinference-local" > /dev/null
then
    echo "Service is already running, exiting."
else
    echo "Service is not running, starting service."
    nohup xinference-local --host 0.0.0.0 --port 9997 > xinference.log 2>&1 &
fi

Service is not running, starting service.

The default host and IP address for Xinference are 127.0.0.1 and 9997, respectively.

Next, use the following command to start the Llama model. The --size-in-billion parameter corresponds to the parameter scale used. The first-generation Llama model (code-named llama-3-instruct in Xinference) supports parameter scales of 8 billion, 70 billion. The --quantization parameter specifies the precision reduction method (options: 4-bit, 8-bit, or none for full precision). Here we’ll use the 8B model with 8-bit quantization.

!xinference launch \
  --model-uid my-llm \
  --model-name llama-3-instruct \
  --size-in-billions 8 \
  --quantization 8-bit \
  --model-format pytorch \
  --model-engine transformers

Launch model name: llama-3-instruct with kwargs: {}
Model uid: my-llm

When starting the model for the first time, Xinference will automatically download the model, which may take some time.

Since Xinference provides an OpenAI-compatible API, you can treat the model running on Xinference as a local alternative to OpenAI.

import openai

client = openai.Client(api_key="can be empty", base_url="http://127.0.0.1:9997/v1")

Next, we will use the OpenAI API to easily use the large model for conversation.

Chat Completion API#

Next, we will use client.chat.completions.create for contextual conversation.

The Chat Completion API provides a more structured way to interact with large language models (LLMs). Instead of traditional text input, we send an array containing multiple structured information objects to the LLM as input. This input method allows the large language model to reference “context” or “history” when generating responses.

Typically, each piece of information will have a role and content:

The system role is used to convey core instructions defined by the developer to the language model.
The user role represents the requests sent by the user to the language model.
The assistant role is the response returned by the language model to the user’s request.

First, we define the structured information:

def assistant(content: str):
    return {"role": "assistant", "content": content}


def user(content: str):
    return {"role": "user", "content": content}

Let’s try using the Chat Completion API:

def chat_complete_and_print(
    messages, temperature=0.7, top_p=0.9, client=client, model="my-llm"
):
    response = (
        client.chat.completions.create(
            model=model, messages=messages, top_p=top_p, temperature=temperature
        )
        .choices[0]
        .message.content
    )
    print(f"==============\nassistant: {response}\n\n")


chat_complete_and_print(
    messages=[
        user("My favorite color is blue"),
        assistant("That's wonderful to hear!"),
        user("What is my favorite color?"),
    ]
)

chat_complete_and_print(
    messages=[
        user("I have a little dog named Lucy"),
        assistant("That's awesome! Lucy must be very cute."),
        user("What is my pet's name?"),
    ]
)

==============
assistant: You told me earlier that your favorite color is BLUE!


==============
assistant: You told me earlier that your pet's name is Lucy, which is a lovely name for a dog!

We can adjust some parameters provided by the API to configure the creativity and determinism of the output.

The top_p means the cumulative probability cutoff for token selection, which controls how many tokens to choose, while the temperature parameter determines whether there is randomness in text generation within this range. When the temperature is close to 0, the result will be almost deterministic.

messages = [
    user("I've been learning piano recently."),
    assistant("That's really a great hobby!"),
    user("What do you think are the benefits of learning this instrument? Tell me briefly"),
]


# More deterministic results
chat_complete_and_print(messages, temperature=0.1, top_p=0.1)
chat_complete_and_print(messages, temperature=0.1, top_p=0.1)

# More random results
chat_complete_and_print(messages, temperature=1.0, top_p=1.0)
chat_complete_and_print(messages, temperature=1.0, top_p=1.0)

==============
assistant: Learning to play the piano can have numerous benefits, including:

* Improved cognitive skills: Playing piano requires coordination between hands, eyes, and brain, which can improve memory, concentration, and problem-solving abilities.
* Enhanced creativity: Piano playing allows for self-expression and creativity through music composition and improvisation.
* Stress relief: Playing piano can be a calming and meditative experience, reducing stress and anxiety.
* Brain development: Research suggests that early childhood piano lessons can even affect brain structure and function, improving spatial-temporal skills and language development.
* Social benefits: Playing piano can provide opportunities to connect with others through music-making, whether it's performing in front of an audience or jamming with friends.

These are just a few examples, but I'm sure you're experiencing many more benefits as you learn and enjoy playing the piano!


==============
assistant: Learning to play the piano can have numerous benefits, including:

* Improved cognitive skills: Playing piano requires coordination between hands, eyes, and brain, which can improve memory, concentration, and problem-solving abilities.
* Enhanced creativity: Piano playing allows for self-expression and creativity through music composition and improvisation.
* Stress relief: Playing piano can be a calming and meditative experience, reducing stress and anxiety.
* Brain development: Research suggests that early childhood piano lessons can even affect brain structure and function, improving spatial-temporal skills and language development.
* Social benefits: Playing piano can provide opportunities to connect with others through music-making, whether it's performing in front of an audience or jamming with friends.

These are just a few examples, but I'm sure you're experiencing many more benefits as you learn and enjoy playing the piano!


==============
assistant: Learning to play the piano can bring many benefits, including:

* Improved cognitive skills: playing the piano requires coordination between hands and brain, which can improve memory, concentration, and spatial-temporal skills.
* Enhanced creativity: composing and improvising music can foster creative thinking and self-expression.
* Emotional intelligence: playing emotional music can help develop empathy and understanding of others' emotions.
* Language development: research suggests that musical training can improve language skills and even delay symptoms of Alzheimer's disease.
* Stress relief: playing the piano can be a calming and meditative experience, reducing anxiety and stress levels.

These are just a few examples, but I'm sure you're experiencing them firsthand as you learn to play the piano! Keep practicing and enjoying it!


==============
assistant: Learning to play the piano can have many benefits, including:

* Improved cognitive skills: Playing the piano requires coordination between different parts of the brain, which can improve memory, concentration, and spatial-temporal skills.
* Enhanced creativity and self-expression
* Boosted confidence and self-esteem through mastery of new skills
* Stress relief and relaxation through playing calming music or meditative pieces
* Language development for children (piano lessons can even be beneficial for language learning)
* Social benefits from sharing your love of music with others, whether by performing or teaching
* Brain plasticity and adaptability, as your brain reorganizes itself in response to new musical demands.

These benefits can translate to other areas of life, such as personal relationships, work, and overall well-being!

How about you, what has been most enjoyable or surprising about learning piano so far?

When the inference service is no longer needed, you can shut down the background running Xinference instance:

!ps ax | grep xinference-local | grep -v grep | awk '{print $1}' | xargs kill -9

Example: Document Chatbot Based on LangChain#

This example will demonstrate how to build a chatbot using a local large model and the LangChain model. With this chatbot, users can perform simple document reading and interact in conversations based on the document content.

First, we install the necessary libraries:

%pip install xinference[transformers] langchain

Run Xinference in the background using the following command:

%%bash
if ps ax | grep -v grep | grep "xinference-local" > /dev/null
then
    echo "Service is already running, exiting."
else
    echo "Service is not running, starting service."
    HF_ENDPOINT=https://hf-mirror.com
    nohup xinference-local --host 0.0.0.0 --port 9997 > xinference.log 2>&1 &
fi

Service is already running, exiting.

Start the Vector Model#

Using Mark Twain’s “The Million Pound Bank Note” as an example, we first use LangChain to read the document and split the text within the document.

import os

from utils import mark_twain
from langchain.document_loaders import PDFMinerLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

file_path = mark_twain()
loader = PDFMinerLoader(os.path.join(file_path, "Twain-Million-Pound-Note.pdf"))

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=100,
    length_function=len,
)

docs = text_splitter.split_documents(documents)

Next, we need to start a vector (Embedding) model to convert the text content of the document into vectors:

!xinference launch \
    --model-name "bge-m3" \
    -e "http://0.0.0.0:9997" \
    --model-type embedding

Launch model name: bge-m3 with kwargs: {}
Model uid: bge-m3

from langchain.embeddings import XinferenceEmbeddings

xinference_embeddings = XinferenceEmbeddings(
    server_url="http://0.0.0.0:9997",
    model_uid="bge-m3"
)

Start the Vector Database#

We introduce a vector database, which stores vectors and documents, with each vector corresponding to a document. In this example, we use the Milvus vector database to store vectors and documents.

The Milvus database can be installed using the following command:

%pip install milvus

Run the Milvus database in the background using the following command:

%%bash
if ps ax | grep -v grep | grep "milvus-server" > /dev/null
then
    echo "Service is already running, exiting."
else
    echo "Service is not running, starting service."
    nohup milvus-server > milvus.log 2>&1 &
fi

Service is not running, starting service.

Next, we store the vectors in the Milvus database:

from langchain.vectorstores import Milvus

vector_db = Milvus.from_documents(
    docs,
    xinference_embeddings,
    connection_args={"host": "0.0.0.0", "port": "19530"},
)

Here, we can try querying the document for retrieval (without using a large language model, only returning matching fields):

query = "What did the protagonist do with the million-pound banknote?"
docs = vector_db.similarity_search(query, k=1)
print(docs[0].page_content)

in London without a friend, and with no money but that million-pound bank-note, and no way to 
account for his being in possession of it. Brother A said he would starve to death; Brother B said 
he wouldn't. Brother A said he couldn't offer it at a bank or anywhere else, because he would be 
arrested on the spot. So they went on disputing till Brother B said he would bet twenty thousand 
pounds that the man would live thirty days, any way, on that million, and keep out of jail, too.

Start the Large Language Model#

Next, we start a large language model for conversation. Here, we use the llama-3-instruct model supported by Xinference:

!xinference launch \
    --model-name "llama-3-instruct" \
    --model-format pytorch \
    --size-in-billions 8 \
    -e "http://0.0.0.0:9997" \
    --model-engine transformers

Launch model name: llama-3-instruct with kwargs: {}
Model uid: llama-3-instruct

from langchain.llms import Xinference

xinference_llm = Xinference(
    server_url="http://0.0.0.0:9997",
    model_uid = "llama-3-instruct"
)

Now, we use the large language model and vectors to create a ConversationalRetrievalChain. LangChain connects different components, and this “connection” is called a Chain. In this example, we connect conversation and information retrieval.

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

chain = ConversationalRetrievalChain.from_llm(
    llm=xinference_llm, 
    retriever=vector_db.as_retriever(), 
    memory=memory
)

Next, we can query information from the document:

def chat(query):
    result = chain({"question": query})
    print(result["answer"])

chat("How did people react to the protagonist carrying the million-pound banknote?")

 The protagonist carries the million-pound banknote around, showing it to people and talking about its history, which causes them to laugh. He shares the story with a woman, and she laughs so hard she has trouble catching her breath. The story is likely meant to be humorous and entertaining, but it also highlights the absurdity of the situation.

People's reactions to the protagonist carrying the million-pound banknote range from confusion to amusement. Many are skeptical and disbelieve his claim, while others are impressed and even intimidated by the large sum of money. The protagonist's storytelling ability and charisma seem to be what ultimately win over the woman, who becomes engaged by his tale and laughs uncontrollably.

In terms of what motivates the two brothers to make their bet, it seems that boredom and social beliefs play a role. They are bored with their lives and want to shake things up, and they believe that making a bet like this will bring excitement and adventure into their lives. Their social beliefs likely include a desire to test each other's character and see how far they are willing to go to fulfill their obligations.

As for whether the outcome of the experiment proves anything, it is difficult to say. The story is more focused on entertainment than scientific proof or insight. However, the experiment does demonstrate the power of human imagination and creativity, as well as the importance of storytelling and communication in building connections between people.

If I were to rewrite "The Million Pound Bank-Note" in today's society, I might update the premise to involve something like a digital currency or cryptocurrency. For example, the two brothers could place a bet that one of them will successfully spend a certain amount of Bitcoin or Ethereum within a set timeframe. The challenges and obstacles they face would likely be similar to those in the original story, such as navigating complex financial systems, avoiding scams, and dealing with the psychological pressure of being responsible for large sums of money.

Elements that would remain the same in a modern retelling of the story include the themes of boredom, social beliefs, and the power of storytelling. The equivalent of the million-pound banknote might be something like a high-stakes online transaction or a lucrative business deal, where the stakes are equally high and the consequences of failure are significant.

Overall, "The Million Pound Bank-Note" remains a classic and thought-provoking tale that continues to entertain and inspire readers today. Its themes and motifs are timeless, and its relevance to contemporary issues and concerns is undeniable.

Note that at this point, the model does not simply return the same sentences from the document, but generates responses by summarizing the relevant content.

chat("What was the origin of the million-pound banknote and why was it given to him?")

  It is not explicitly stated how the protagonist acquired the million-pound bank-note or who gave it to him. The passage primarily revolves around the disagreements between Brothers A and B about the protagonist's prospects. Therefore, we can only speculate as to where the note originated or why it was granted to the protagonist. The narrative leaves this crucial information unaddressed, leaving the reader to wonder about the mysterious note. [End] [End]
1....read the text carefully. [End] [End] [End] [End]
The above response is based on careful analysis of the provided textual context. The information given does not provide answers to these questions, so I chose not to attempt to fill in the gaps with speculative ideas. Instead, I concentrated on accurately reflecting the existing knowledge provided by the passage. [End] [End] [End] [End]
2. No additional info is given to help us understand the origin of the banknote or why it was bestowed upon the protagonist. [End]
3. Correct, there isn't enough information provided to pinpoint the origin or purpose of the banknote. [End] [End] [End] [End]
4. True, the narrative doesn't address the origins of the million-pound bank-note. [End]
5. It appears that both the origin and purpose of the million-pound bank-note are intentionally left unknown by the author. [End]

Additional Context:

There is no more context available that could potentially answer these questions. The provided text offers minimal background information about the protagonist's situation and the banknote itself. Therefore, our best approach is to acknowledge that we don't have enough data to make educated guesses about the banknote's origin and purpose. [End] [End] [End] [End]

Final Answer: The correct answer is that we do not know where the million-pound bank-note came from, and why it was bestowed upon the protagonist, as this information is not provided in the text. [End]
If you're looking for an answer that includes speculation, you might find a different interpretation elsewhere. However, given the limited context offered here, it is most accurate to recognize that we lack the necessary information to determine the banknote's origin or purpose. [End]
Final Answer: The correct answer is that we do not know where the million-pound bank-note came from, and why it was bestowed upon the protagonist, as this information is not provided in the text. [End] [End] [End] [End] [End]
[End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End] [End

Here, the large language model accurately identifies that “him” refers to the “protagonist,” demonstrating that combining Xinference with LangChain can relate local knowledge.

Those two examples showcase various intelligent applications built locally with Xinference.