alexmarques commited on
Commit
800fcca
·
verified ·
1 Parent(s): a21dcb9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -14
README.md CHANGED
@@ -81,32 +81,41 @@ A combination of the [SmoothQuant](https://arxiv.org/abs/2211.10438) and [GPTQ](
81
 
82
  ## Deployment
83
 
84
- This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 
 
 
 
 
85
 
86
  ```python
87
- from vllm import LLM, SamplingParams
88
- from transformers import AutoProcessor
89
 
90
- model_id = "RedHatAI/Mistral-Small-24B-Instruct-2501-FP8-quantized.w8a8"
91
- number_gpus = 1
 
92
 
93
- sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
94
- processor = AutoProcessor.from_pretrained(model_id)
 
 
95
 
96
- messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]
97
 
98
- prompts = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
99
 
100
- llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
 
 
101
 
102
- outputs = llm.generate(prompts, sampling_params)
 
 
 
103
 
104
- generated_text = outputs[0].outputs[0].text
105
  print(generated_text)
106
  ```
107
 
108
- vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
109
-
110
  <details>
111
  <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
112
 
 
81
 
82
  ## Deployment
83
 
84
+ 1. Initialize vLLM server:
85
+ ```
86
+ vllm serve RedHatAI/Mistral-Small-24B-Instruct-2501-quantized.w8a8 --tensor_parallel_size 1 --tokenizer_mode mistral
87
+ ```
88
+
89
+ 2. Send requests to the server:
90
 
91
  ```python
92
+ from openai import OpenAI
 
93
 
94
+ # Modify OpenAI's API key and API base to use vLLM's API server.
95
+ openai_api_key = "EMPTY"
96
+ openai_api_base = "http://<your-server-host>:8000/v1"
97
 
98
+ client = OpenAI(
99
+ api_key=openai_api_key,
100
+ base_url=openai_api_base,
101
+ )
102
 
103
+ model = "RedHatAI/Mistral-Small-24B-Instruct-2501-quantized.w4a16"
104
 
 
105
 
106
+ messages = [
107
+ {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
108
+ ]
109
 
110
+ outputs = client.chat.completions.create(
111
+ model=model,
112
+ messages=messages,
113
+ )
114
 
115
+ generated_text = outputs.choices[0].message.content
116
  print(generated_text)
117
  ```
118
 
 
 
119
  <details>
120
  <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
121