Showing posts with label mistral. Show all posts
Showing posts with label mistral. Show all posts

Monday, April 22, 2024

Hugging Face - Part6 - Repo model xyz is gated and you must be authenticated to access it

Today, while working locally on my machine with mistralai/Mistral-7B-Instruct-v0.2 from Hugging Face, I encountered the following issue:

 


Cannot access gated repo for url https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/resolve/main/config.json.
Repo model mistralai/Mistral-7B-Instruct-v0.2 is gated. You must be authenticated to access it.

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2.
403 Client Error. (Request ID: Root=1-66266e88-14951c696b21d7515a1dd516;df373d0d-261c-41ec-9142-bca579e082fc)

Cannot access gated repo for url https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/resolve/main/config.json.
Access to model mistralai/Mistral-7B-Instruct-v0.2 is restricted and you are not in the authorized list. Visit https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 to ask for access.

Upon conducting a Google search, I observed that certain Hugging Face repositories are restricted, requiring an access token for downloading models locally from these gated repositories.


Following are the discussion threads:

https://huggingface.co/google/gemma-7b/discussions/31

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/discussions/93

Therefore, if you intend to utilize this code for downloading and engaging with a Mistral model on your local system, you'll require a Hugging Face access token and must implement minor adjustments as outlined below:

import os

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"

access_token = os.environ["HFREADACCESS"]

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=access_token)

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, token=access_token)


Note: You can pass the access token to the script as an environment variable. If this is running on Kubernetes as a pod, then you can consider creating a secret with the access token, inject the secret to the container as env using secretKeyRef.  

Next, you'll need to log in to Hugging Face, navigate to the model card you wish to download, and select "Agree and access repository". Once completed, executing the Python script should enable you to download the model locally and interact with it seamlessly. 

Hope it was useful. Cheers!

Saturday, April 20, 2024

Hugging Face - Part5 - Deploy your LLM app on Kubernetes

In our previous blog post, we explored the process of containerizing the Large Language Model (LLM) from Hugging Face using FastAPI and Docker. The next step is deploying this containerized application on a Kubernetes cluster. Additionally, I'll share my observations and insights gathered during this exercise. 


You can access the deployment yaml spec and detailed instructions in my GitHub repo: 

https://github.com/vineethac/huggingface/tree/main/6-deploy-on-k8s

Requirements

  • I am using a Tanzu Kubernetes Cluster (TKC).
  • Each node is of size best-effort-2xlarge which has 8 vCPU and 64Gi of memory.

❯ KUBECONFIG=gckubeconfig k get node
NAME                                             STATUS   ROLES                  AGE    VERSION
tkc01-control-plane-49jx4                        Ready    control-plane,master   97d    v1.23.8+vmware.3
tkc01-control-plane-m8wmt                        Ready    control-plane,master   105d   v1.23.8+vmware.3
tkc01-control-plane-z6gxx                        Ready    control-plane,master   97d    v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-dc6957d97-8gjn8   Ready    <none>                 21d    v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-dc6957d97-c9nfq   Ready    <none>                 21d    v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-dc6957d97-cngff   Ready    <none>                 21d    v1.23.8+vmware.3
❯

  • I've attached 256Gi storage volumes to the worker nodes that is mounted at /var/lib/containerd. The worker nodes on which these llm pods are running should have enough storage space. Otherwise you may notice these pods getting stuck/ restarting/ unknownstatus. If the worker nodes run out of the storage disk space, you will see pods getting evicted with warnings The node was low on resource: ephemeral-storage. TKC spec is available in the above mentioned Git repo.

Deployment

  • This works on a CPU powered Kubernetes cluster. Additional configurations might be required if you want to run this on a GPU powered cluster.
  • We have already instrumented the Readiness and Liveness functionality in the LLM app itself. 
  • The readiness probe invokes the /healthz endpoint exposed by the FastAPI app. This will make sure the FastAPI itself is healthy/ responding to the API calls.
  • The liveness probe invokes liveness.py script within the app. The script invokes the /ask endpoint which interacts with the LLM and returns the response. This will make sure the LLM is responding to the user queries. For some reason if the llm is not responding/ hangs, the liveness probe will fail and eventually it will restart the container.
  • You can apply the deployment yaml spec as follows:
❯ KUBECONFIG=gckubeconfig k apply -f fastapi-llm-app-deploy-cpu.yaml

Validation


❯ KUBECONFIG=gckubeconfig k get deploy fastapi-llm-app
NAME              READY   UP-TO-DATE   AVAILABLE   AGE
fastapi-llm-app   2/2     2            2           21d
❯
❯ KUBECONFIG=gckubeconfig k get pods | grep fastapi-llm-app
fastapi-llm-app-758c7c58f7-79gmq                               1/1     Running   1 (71m ago)    13d
fastapi-llm-app-758c7c58f7-gqdc6                               1/1     Running   1 (99m ago)    13d
❯
❯ KUBECONFIG=gckubeconfig k get svc fastapi-llm-app
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)          AGE
fastapi-llm-app   LoadBalancer   10.110.228.33   10.216.24.104   5000:30590/TCP   5h24m
❯

Now you can just do a curl against the EXTERNAL-IP of the above mentioned fastapi-llm-app service.

❯ curl http://10.216.24.104:5000/ask -X POST -H "Content-Type: application/json" -d '{"text":"list comprehension examples in python"}'

In our next blog post, we'll try enhancing our FastAPI application with robust instrumentation. Specifically, we'll explore the process of integrating FastAPI metrics into our application, allowing us to gain valuable insights into its performance and usage metrics. Furthermore, we'll take a look at incorporating traces using OpenTelemetry, a powerful tool for distributed tracing and observability in modern applications. By leveraging OpenTelemetry, we'll be able to gain comprehensive visibility into the behavior of our application across distributed systems, enabling us to identify performance bottlenecks and optimize resource utilization.

Stay tuned for an insightful exploration of FastAPI metrics instrumentation and OpenTelemetry integration in our upcoming blog post!

Hope it was useful. Cheers!

Saturday, March 30, 2024

Hugging Face - Part4 - Containerize your LLM app using Python, FastAPI, and Docker

In this exercise, our objective is to integrate an API endpoint for the Large Language Model (LLM) provided by Hugging Face using FastAPI. Additionally, we aim to encapsulate this whole application within a Docker container for portability and ease of deployment.

To achieve this, our project consists of several key components:

  • Large Language Model: Our application logic resides in model.py, where the model_pipeline function serves as the core engine behind our LLM interaction using LangChain. We've chosen the Mistral Instruct model from Hugging Face for this exercise.

  • API Endpoint Integration: We'll be incorporating an API endpoint using FastAPI to seamlessly interact with the LLM downloaded from Hugging Face. The main.py file implements the FastAPI framework, defining routes and endpoints. Specifically, the /ask endpoint invokes the model_pipeline function to interact with the Mistral Instruct model and generate a response.

  • Containerization: Utilizing the Dockerfile, we containerize our FastAPI LLM application. This ensures that our application, along with its dependencies, can be easily packaged and deployed across various environments.


You can access the Dockerfile, Python code, and other observations on my GitHub repository:

https://github.com/vineethac/huggingface/tree/main/5-containerize-llm-app

Deploy on Kubernetes as a pod

Deploying directly as a pod is not a preferred way. This is just for quick testing purpose! In the next blog post we will see how to deploy this as a Kubernetes deployment resource.

❯ KUBECONFIG=gckubeconfig k run hf-11 --image=vineethac/fastapi-llm-app:latest --image-pull-policy=Always
pod/hf-11 created
❯ KUBECONFIG=gckubeconfig kg po hf-11
NAME    READY   STATUS              RESTARTS   AGE
hf-11   0/1     ContainerCreating   0          2m23s
❯
❯ KUBECONFIG=gckubeconfig kg po hf-11
NAME    READY   STATUS    RESTARTS   AGE
hf-11   1/1     Running   0          26m
❯
❯ KUBECONFIG=gckubeconfig k logs hf-11 -f
Downloading shards: 100%|██████████| 3/3 [02:29<00:00, 49.67s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:03<00:00,  1.05s/it]
INFO:     Will watch for changes in these directories: ['/fastapi-llm-app']
INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
INFO:     Started reloader process [7] using WatchFiles
Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00,  3.88s/it]
INFO:     Started server process [25]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
2024-03-28 08:19:12 hf-11 watchfiles.main[7] INFO 3 changes detected
2024-03-28 08:19:48 hf-11 root[25] INFO User prompt: select head or tail randomly. strictly respond only in one word. no explanations needed.
2024-03-28 08:19:48 hf-11 root[25] INFO Model: mistralai/Mistral-7B-Instruct-v0.2
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
2024-03-28 08:19:54 hf-11 root[25] INFO LLM response:  Head.
2024-03-28 08:19:54 hf-11 root[25] INFO FastAPI response:  Head.
INFO:     127.0.0.1:53904 - "POST /ask HTTP/1.1" 200 OK
INFO:     127.0.0.1:55264 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:43342 - "GET /healthz HTTP/1.1" 200 OK

For a quick validation, I did exec into the pod and curl against the exposed APIs.

❯ KUBECONFIG=gckubeconfig k exec -it hf-11 -- bash
root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app# curl -d '{"text":"select head or tail randomly. strictly respond only in one word. no explanations needed."}' -H "Content-Type: application/json" -X POST http://localhost:5000/ask
{"response":" Head."}root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app# curl localhost:5000
"Welcome to FastAPI for your local LLM!"root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app# curl localhost:5000/healthz
{"Status":"OK"}root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app#


You can also use kubectl expose command to create a service for this pod and then port forward to it and then curl to it. 

Hope it was useful. Cheers!

Thursday, March 28, 2024

Generative AI and LLMs Blog Series

In this blog series we will explore the fascinating world of Generative AI and Large Language Models (LLMs). We delve into the latest advancements in AI technology, focusing particularly on LLMs, which have revolutionized various fields, including natural language processing and text generation.

Throughout this series, we will discuss LLM serving platforms such as Ollama and Hugging Face, providing insights into their capabilities, features, and applications. I will also guide you through the process of getting started with LLMs, from setting up your development/ test environment to deploying these powerful models on Kubernetes clusters. Additionally, we'll demonstrate how to effectively prompt and interact with LLMs using frameworks like LangChain, empowering you to harness the full potential of these cutting-edge technologies.

Stay tuned for insightful articles, and hands-on guides that will equip you with the knowledge and skills to unlock the transformative capabilities of LLMs. Let's explore the future of AI together!

Image credits: designer.microsoft.com/image-creator


Ollama

Part1 - Deploy Ollama on Kubernetes

Part2 - Prompt LLMs using Ollama, LangChain, and Python

Part3 - Web UI for Ollama to interact with LLMs

Part4 - Vision assistant using LLaVA


Hugging Face

Part1 - Getting started with Hugging Face

Part2 - Code generation with Code Llama Instruct

Part3 - Inference with Code Llama using LangChain

Part4 - Containerize your LLM app using Python, FastAPI, and Docker

Part5 - Deploy your LLM app on Kubernetes 

Part6 - LLM app observability <coming soon>


Thursday, February 1, 2024

Ollama - Part4 - Vision assistant using LLaVA

In this exercise we will interact with LLaVA which is an end-to-end trained large multimodal model and vision assistant. We will use the Ollama REST API to prompt the model using Python.

Full project in my GitHub

https://github.com/vineethac/Ollama/tree/main/ollama_vision_assistant


LLaVA, being a large multimodal model and vision assistant, can be utilized for various tasks. Here are a couple of use cases:

  • Image Description Generation

Input: Provide LLaVA with an image.
Use Case: LLaVA can generate descriptive text or captions for the content of the image. This is particularly useful for automating image cataloging or enhancing accessibility for visually impaired users.

  • Question-Answering on Text and Image

Input: Ask LLaVA a question related to a given text or show it an image.
Use Case: LLaVA can comprehend the context and provide relevant answers. For instance, you could ask about details in a picture or seek information from a paragraph, and LLaVA will attempt to answer accordingly.

These are just a few examples, and the versatility of LLaVA allows for exploration across a wide range of multimodal tasks and applications.

Sample interaction with LLaVA model


Image


Image credits: shutterstock

Prompt

python3 query_image.py --path=images/img1.jpg --prompt="describe the picture 
and what are the essentials that one need to carry generally while going these 
kind of places?"

Response
{
    "model": "llava",
    "created_at": "2024-01-23T17:41:27.771729767Z",
    "response": " The image shows a man riding his bicycle on a country road, surrounded by 
    beautiful scenery and mountains. He appears to be enjoying the ride as he navigates 
    through the countryside. \n\nWhile cycling in such environments, an essential item one 
    would need to carry is a water bottle or hydration pack, to ensure they stay well-hydrated 
    during the journey. In addition, it's important to have a map or GPS device to navigate 
    through potentially less familiar routes and avoid getting lost. Other useful items for 
    cyclists may include a multi-tool, first aid kit, bike lock, snacks, spare clothes, 
    and a small portable camping stove if planning an overnight stay in the wilderness.",

Hope it was useful. Cheers!

Friday, January 26, 2024

Ollama - Part3 - Web UI for Ollama to interact with LLMs

In the previous blog posts, we covered the deployment of Ollama on Kubernetes cluster and demonstrated how to prompt the Language Models (LLMs) using LangChain and Python. Now we will delve into deploying a web user interface (UI) for Ollama on a Kubernetes cluster. This will provide a ChatGPT like experience when engaging with the LLMs.

Full project in my GitHub

https://github.com/vineethac/Ollama/tree/main/ollama_webui


The above referenced GitHub repository details all the necessary steps required to deploy the Ollama web UI. The Following diagram outlines the various components and services that interact with each other as part of this entire system:


For detailed information on deploying Prometheus, Grafana, and Loki on a Kubernetes cluster, please refer this blog post.

A sample interaction with the mistral model using the web UI is given below.


Hope it was useful. Cheers!

Monday, January 15, 2024

Ollama - Part1 - Deploy Ollama on Kubernetes

Docker published GenAI stack around Oct 2023 which consists of large language models (LLMs) from Ollama, vector and graph databases from Neo4j, and the LangChain framework. These utilities can help developers with the resources they need to kick-start creating new applications using generative AI. Ollama can be used to deploy and run LLMs locally. In this exercise we will deploy Ollama to a Kubernetes cluster and prompt it.

In my case I am using a Tanzu Kubernetes Cluster (TKC) running on vSphere with Tanzu 7u3 platform powered by Dell PowerEdge R640 servers. The TKC nodes are using best-effort-2xlarge vmclass with 8 CPU and 64Gi Memory.  Note that I am running it on a regular Kubernetes cluster without GPU. If you have GPU, additional configuration steps might be required.



Hope it was useful. Cheers!