Monday, April 22, 2024

Hugging Face - Part6 - Repo model xyz is gated and you must be authenticated to access it

Today, while working locally on my machine with mistralai/Mistral-7B-Instruct-v0.2 from Hugging Face, I encountered the following issue:

 


Cannot access gated repo for url https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/resolve/main/config.json.
Repo model mistralai/Mistral-7B-Instruct-v0.2 is gated. You must be authenticated to access it.

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2.
403 Client Error. (Request ID: Root=1-66266e88-14951c696b21d7515a1dd516;df373d0d-261c-41ec-9142-bca579e082fc)

Cannot access gated repo for url https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/resolve/main/config.json.
Access to model mistralai/Mistral-7B-Instruct-v0.2 is restricted and you are not in the authorized list. Visit https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 to ask for access.

Upon conducting a Google search, I observed that certain Hugging Face repositories are restricted, requiring an access token for downloading models locally from these gated repositories.


Following are the discussion threads:

https://huggingface.co/google/gemma-7b/discussions/31

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/discussions/93

Therefore, if you intend to utilize this code for downloading and engaging with a Mistral model on your local system, you'll require a Hugging Face access token and must implement minor adjustments as outlined below:

import os

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"

access_token = os.environ["HFREADACCESS"]

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=access_token)

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, token=access_token)


Note: You can pass the access token to the script as an environment variable. If this is running on Kubernetes as a pod, then you can consider creating a secret with the access token, inject the secret to the container as env using secretKeyRef.  

Next, you'll need to log in to Hugging Face, navigate to the model card you wish to download, and select "Agree and access repository". Once completed, executing the Python script should enable you to download the model locally and interact with it seamlessly. 

Hope it was useful. Cheers!

Saturday, April 20, 2024

Hugging Face - Part5 - Deploy your LLM app on Kubernetes

In our previous blog post, we explored the process of containerizing the Large Language Model (LLM) from Hugging Face using FastAPI and Docker. The next step is deploying this containerized application on a Kubernetes cluster. Additionally, I'll share my observations and insights gathered during this exercise. 


You can access the deployment yaml spec and detailed instructions in my GitHub repo: 

https://github.com/vineethac/huggingface/tree/main/6-deploy-on-k8s

Requirements

  • I am using a Tanzu Kubernetes Cluster (TKC).
  • Each node is of size best-effort-2xlarge which has 8 vCPU and 64Gi of memory.

❯ KUBECONFIG=gckubeconfig k get node
NAME                                             STATUS   ROLES                  AGE    VERSION
tkc01-control-plane-49jx4                        Ready    control-plane,master   97d    v1.23.8+vmware.3
tkc01-control-plane-m8wmt                        Ready    control-plane,master   105d   v1.23.8+vmware.3
tkc01-control-plane-z6gxx                        Ready    control-plane,master   97d    v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-dc6957d97-8gjn8   Ready    <none>                 21d    v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-dc6957d97-c9nfq   Ready    <none>                 21d    v1.23.8+vmware.3
tkc01-worker-nodepool-a1-pqq7j-dc6957d97-cngff   Ready    <none>                 21d    v1.23.8+vmware.3
❯

  • I've attached 256Gi storage volumes to the worker nodes that is mounted at /var/lib/containerd. The worker nodes on which these llm pods are running should have enough storage space. Otherwise you may notice these pods getting stuck/ restarting/ unknownstatus. If the worker nodes run out of the storage disk space, you will see pods getting evicted with warnings The node was low on resource: ephemeral-storage. TKC spec is available in the above mentioned Git repo.

Deployment

  • This works on a CPU powered Kubernetes cluster. Additional configurations might be required if you want to run this on a GPU powered cluster.
  • We have already instrumented the Readiness and Liveness functionality in the LLM app itself. 
  • The readiness probe invokes the /healthz endpoint exposed by the FastAPI app. This will make sure the FastAPI itself is healthy/ responding to the API calls.
  • The liveness probe invokes liveness.py script within the app. The script invokes the /ask endpoint which interacts with the LLM and returns the response. This will make sure the LLM is responding to the user queries. For some reason if the llm is not responding/ hangs, the liveness probe will fail and eventually it will restart the container.
  • You can apply the deployment yaml spec as follows:
❯ KUBECONFIG=gckubeconfig k apply -f fastapi-llm-app-deploy-cpu.yaml

Validation


❯ KUBECONFIG=gckubeconfig k get deploy fastapi-llm-app
NAME              READY   UP-TO-DATE   AVAILABLE   AGE
fastapi-llm-app   2/2     2            2           21d
❯
❯ KUBECONFIG=gckubeconfig k get pods | grep fastapi-llm-app
fastapi-llm-app-758c7c58f7-79gmq                               1/1     Running   1 (71m ago)    13d
fastapi-llm-app-758c7c58f7-gqdc6                               1/1     Running   1 (99m ago)    13d
❯
❯ KUBECONFIG=gckubeconfig k get svc fastapi-llm-app
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)          AGE
fastapi-llm-app   LoadBalancer   10.110.228.33   10.216.24.104   5000:30590/TCP   5h24m
❯

Now you can just do a curl against the EXTERNAL-IP of the above mentioned fastapi-llm-app service.

❯ curl http://10.216.24.104:5000/ask -X POST -H "Content-Type: application/json" -d '{"text":"list comprehension examples in python"}'

In our next blog post, we'll try enhancing our FastAPI application with robust instrumentation. Specifically, we'll explore the process of integrating FastAPI metrics into our application, allowing us to gain valuable insights into its performance and usage metrics. Furthermore, we'll take a look at incorporating traces using OpenTelemetry, a powerful tool for distributed tracing and observability in modern applications. By leveraging OpenTelemetry, we'll be able to gain comprehensive visibility into the behavior of our application across distributed systems, enabling us to identify performance bottlenecks and optimize resource utilization.

Stay tuned for an insightful exploration of FastAPI metrics instrumentation and OpenTelemetry integration in our upcoming blog post!

Hope it was useful. Cheers!