Saturday, March 30, 2024

Hugging Face - Part4 - Containerize your LLM app using Python, FastAPI, and Docker

In this exercise, our objective is to integrate an API endpoint for the Large Language Model (LLM) provided by Hugging Face using FastAPI. Additionally, we aim to encapsulate this whole application within a Docker container for portability and ease of deployment.

To achieve this, our project consists of several key components:

  • Large Language Model: Our application logic resides in model.py, where the model_pipeline function serves as the core engine behind our LLM interaction using LangChain. We've chosen the Mistral Instruct model from Hugging Face for this exercise.

  • API Endpoint Integration: We'll be incorporating an API endpoint using FastAPI to seamlessly interact with the LLM downloaded from Hugging Face. The main.py file implements the FastAPI framework, defining routes and endpoints. Specifically, the /ask endpoint invokes the model_pipeline function to interact with the Mistral Instruct model and generate a response.

  • Containerization: Utilizing the Dockerfile, we containerize our FastAPI LLM application. This ensures that our application, along with its dependencies, can be easily packaged and deployed across various environments.


You can access the Dockerfile, Python code, and other observations on my GitHub repository:

https://github.com/vineethac/huggingface/tree/main/5-containerize-llm-app

Deploy on Kubernetes as a pod

Deploying directly as a pod is not a preferred way. This is just for quick testing purpose! In the next blog post we will see how to deploy this as a Kubernetes deployment resource.

❯ KUBECONFIG=gckubeconfig k run hf-11 --image=vineethac/fastapi-llm-app:latest --image-pull-policy=Always
pod/hf-11 created
❯ KUBECONFIG=gckubeconfig kg po hf-11
NAME    READY   STATUS              RESTARTS   AGE
hf-11   0/1     ContainerCreating   0          2m23s
❯
❯ KUBECONFIG=gckubeconfig kg po hf-11
NAME    READY   STATUS    RESTARTS   AGE
hf-11   1/1     Running   0          26m
❯
❯ KUBECONFIG=gckubeconfig k logs hf-11 -f
Downloading shards: 100%|██████████| 3/3 [02:29<00:00, 49.67s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:03<00:00,  1.05s/it]
INFO:     Will watch for changes in these directories: ['/fastapi-llm-app']
INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
INFO:     Started reloader process [7] using WatchFiles
Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00,  3.88s/it]
INFO:     Started server process [25]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
2024-03-28 08:19:12 hf-11 watchfiles.main[7] INFO 3 changes detected
2024-03-28 08:19:48 hf-11 root[25] INFO User prompt: select head or tail randomly. strictly respond only in one word. no explanations needed.
2024-03-28 08:19:48 hf-11 root[25] INFO Model: mistralai/Mistral-7B-Instruct-v0.2
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
2024-03-28 08:19:54 hf-11 root[25] INFO LLM response:  Head.
2024-03-28 08:19:54 hf-11 root[25] INFO FastAPI response:  Head.
INFO:     127.0.0.1:53904 - "POST /ask HTTP/1.1" 200 OK
INFO:     127.0.0.1:55264 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:43342 - "GET /healthz HTTP/1.1" 200 OK

For a quick validation, I did exec into the pod and curl against the exposed APIs.

❯ KUBECONFIG=gckubeconfig k exec -it hf-11 -- bash
root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app# curl -d '{"text":"select head or tail randomly. strictly respond only in one word. no explanations needed."}' -H "Content-Type: application/json" -X POST http://localhost:5000/ask
{"response":" Head."}root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app# curl localhost:5000
"Welcome to FastAPI for your local LLM!"root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app# curl localhost:5000/healthz
{"Status":"OK"}root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app#


You can also use kubectl expose command to create a service for this pod and then port forward to it and then curl to it. 

Hope it was useful. Cheers!

Thursday, March 28, 2024

Generative AI and LLMs Blog Series

In this blog series we will explore the fascinating world of Generative AI and Large Language Models (LLMs). We delve into the latest advancements in AI technology, focusing particularly on LLMs, which have revolutionized various fields, including natural language processing and text generation.

Throughout this series, we will discuss LLM serving platforms such as Ollama and Hugging Face, providing insights into their capabilities, features, and applications. I will also guide you through the process of getting started with LLMs, from setting up your development/ test environment to deploying these powerful models on Kubernetes clusters. Additionally, we'll demonstrate how to effectively prompt and interact with LLMs using frameworks like LangChain, empowering you to harness the full potential of these cutting-edge technologies.

Stay tuned for insightful articles, and hands-on guides that will equip you with the knowledge and skills to unlock the transformative capabilities of LLMs. Let's explore the future of AI together!

Image credits: designer.microsoft.com/image-creator


Ollama

Part1 - Deploy Ollama on Kubernetes

Part2 - Prompt LLMs using Ollama, LangChain, and Python

Part3 - Web UI for Ollama to interact with LLMs

Part4 - Vision assistant using LLaVA


Hugging Face

Part1 - Getting started with Hugging Face

Part2 - Code generation with Code Llama Instruct

Part3 - Inference with Code Llama using LangChain

Part4 - Containerize your LLM app using Python, FastAPI, and Docker

Part5 - Deploy your LLM app on Kubernetes 

Part6 - LLM app observability <coming soon>