Checking out a free opensource ChatGPT alternative: Zephry-7b-Beta

Nov 12, 2023 selfhosting

opensource AI selfhosted

In a previous article we reviewed the free LLM Llama2, the 70b version. This is prohibitively large to run on your local machines as it requires around 48GB of GPU VRAM. I instead tried the 13b version but it was still too large for my GPU with only 8Gigs of VRAM. So ran it on my CPU instead and it was painfully slow.. Here we will look at a viable alternative: Zephry-7b-Beta, the latest sensation in the world of natural language processing!

Zephry-7b-Beta is a groundbreaking large language model that has been capturing the attention of the scientific community due to its exceptional abilities in natural language processing, despite being a relatively small model. This innovative language model, developed by the research wing of HuggingFace, boasts impressive feats that challenge the traditional notion that larger models are better. Although Zephry-7b-Beta has just over 7 billion parameters, in many benchmarks, outperforms larger models like GPT-3 and BERT-Large, which have more than 175 billion parameters each. This remarkable performance has been attributed to the model’s unique architecture and training methods, which have allowed it to achieve a balance between accuracy and efficiency, making it an ideal candidate for real-world applications where resource constraints can pose a challenge. As in the case of hosting on a home PC.

It gives out an impressive interactive experience. And I found that it resembled GPT4 more than the Llama 2 model.

I first sought to install it locally as this is reported to run comfortably on 8G GPUs. So I downloaded the prerequisite library ROCm (for driving the AMD GPU), and then installed Text generation web UI. Downloaded the LLM model from HuggingFace site. Some of the troubleshooting steps I took can be seen in this thread. But again it was too slow to be usable.. Maybe I’m doing something wrong¹.

Then I resorted to hosting on a cloud GPU provider. I made an account at vast.ai (these folks being the cheapest). I made an RTX A6000 (48GB VRAM) instance with 8G disk space. Downloaded the same model image (the Q5_K_M variant). Then I made parameter choices in Text generation web UI (not necessarily optimum, but sure something to play around with and test):

Set Generation/Preset to debug-deterministic
Increase max new tokens to 2000
ChatML for Instruction template

Also remember to set n-gpu-layers to over 35 for this gguf model when loading it.

There are plenty of Youtube videos on how to use Text generation web UI.

And used the following prompt (copied from the model page and tweaked):

<|system|>
You are a creative writing assistant
<|user|>
<your prompt here>
<|assistant|>

I have to say.. The results were snappy. And like I said earlier the responses were impressive.

Although I have been using the Default/Notebook input method of the Text generation web UI, I’m looking to study the chat prompt. And see if it gives a more fluid interactive experience.

This has been my log of trying out a free new ChatGPT alternative. Although I didn’t know why it didn’t run well on my local system since this is most suited for this task. Let me know what your goto model is these days!

I’ve finally managed to run it. See this post for updated info.↩︎