Showing posts with label LangChain. Show all posts
Showing posts with label LangChain. Show all posts

Saturday, March 30, 2024

Hugging Face - Part4 - Containerize your LLM app using Python, FastAPI, and Docker

In this exercise, our objective is to integrate an API endpoint for the Large Language Model (LLM) provided by Hugging Face using FastAPI. Additionally, we aim to encapsulate this whole application within a Docker container for portability and ease of deployment.

To achieve this, our project consists of several key components:

  • Large Language Model: Our application logic resides in model.py, where the model_pipeline function serves as the core engine behind our LLM interaction using LangChain. We've chosen the Mistral Instruct model from Hugging Face for this exercise.

  • API Endpoint Integration: We'll be incorporating an API endpoint using FastAPI to seamlessly interact with the LLM downloaded from Hugging Face. The main.py file implements the FastAPI framework, defining routes and endpoints. Specifically, the /ask endpoint invokes the model_pipeline function to interact with the Mistral Instruct model and generate a response.

  • Containerization: Utilizing the Dockerfile, we containerize our FastAPI LLM application. This ensures that our application, along with its dependencies, can be easily packaged and deployed across various environments.


You can access the Dockerfile, Python code, and other observations on my GitHub repository:

https://github.com/vineethac/huggingface/tree/main/5-containerize-llm-app

Deploy on Kubernetes as a pod

Deploying directly as a pod is not a preferred way. This is just for quick testing purpose! In the next blog post we will see how to deploy this as a Kubernetes deployment resource.

❯ KUBECONFIG=gckubeconfig k run hf-11 --image=vineethac/fastapi-llm-app:latest --image-pull-policy=Always
pod/hf-11 created
❯ KUBECONFIG=gckubeconfig kg po hf-11
NAME    READY   STATUS              RESTARTS   AGE
hf-11   0/1     ContainerCreating   0          2m23s
❯
❯ KUBECONFIG=gckubeconfig kg po hf-11
NAME    READY   STATUS    RESTARTS   AGE
hf-11   1/1     Running   0          26m
❯
❯ KUBECONFIG=gckubeconfig k logs hf-11 -f
Downloading shards: 100%|██████████| 3/3 [02:29<00:00, 49.67s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:03<00:00,  1.05s/it]
INFO:     Will watch for changes in these directories: ['/fastapi-llm-app']
INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
INFO:     Started reloader process [7] using WatchFiles
Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00,  3.88s/it]
INFO:     Started server process [25]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
2024-03-28 08:19:12 hf-11 watchfiles.main[7] INFO 3 changes detected
2024-03-28 08:19:48 hf-11 root[25] INFO User prompt: select head or tail randomly. strictly respond only in one word. no explanations needed.
2024-03-28 08:19:48 hf-11 root[25] INFO Model: mistralai/Mistral-7B-Instruct-v0.2
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
2024-03-28 08:19:54 hf-11 root[25] INFO LLM response:  Head.
2024-03-28 08:19:54 hf-11 root[25] INFO FastAPI response:  Head.
INFO:     127.0.0.1:53904 - "POST /ask HTTP/1.1" 200 OK
INFO:     127.0.0.1:55264 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:43342 - "GET /healthz HTTP/1.1" 200 OK

For a quick validation, I did exec into the pod and curl against the exposed APIs.

❯ KUBECONFIG=gckubeconfig k exec -it hf-11 -- bash
root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app# curl -d '{"text":"select head or tail randomly. strictly respond only in one word. no explanations needed."}' -H "Content-Type: application/json" -X POST http://localhost:5000/ask
{"response":" Head."}root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app# curl localhost:5000
"Welcome to FastAPI for your local LLM!"root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app# curl localhost:5000/healthz
{"Status":"OK"}root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app#


You can also use kubectl expose command to create a service for this pod and then port forward to it and then curl to it. 

Hope it was useful. Cheers!

Friday, February 23, 2024

Hugging Face - Part3 - Inference with Code Llama using LangChain

In the field of understanding and working with human language (NLP), Hugging Face is a key platform that provides many pre-trained models for different tasks. With Transformers, LangChain, and Python developers can easily use Hugging Face's models on their own computers for quick processing. Using LangChain offers a streamlined and user-friendly approach to tapping into the capabilities of pre-trained language models. In this blog post we focus on how to inference with Code Llama - Instruct model from Hugging Face locally using LangChain. 


You can access the Python script in my GitHub repository:
https://github.com/vineethac/huggingface/tree/main/4-codellama_with_langchain


To initiate inference with Code Llama, developers can start by specifying the desired model using its identifier, such as MODEL_ID = "codellama/CodeLlama-7b-Instruct-hf". Transformers simplifies the process by providing a unified interface with the familiar Python programming language, allowing users to effortlessly initialize the model and tokenizer.

Once the model and tokenizer are set up, developers can leverage LangChain's HuggingFacePipeline class to create a text generation pipeline. This pipeline, defined with parameters like max_new_tokens and repetition_penalty, becomes a powerful tool for local inferencing. By combining this pipeline with LangChain's PromptTemplate, developers can easily construct prompts and invoke the entire chain to generate responses. This streamlined process facilitates local inferencing with Code Llama, empowering developers to leverage Hugging Face's models for a wide range of natural language processing tasks in their Python applications. 


Example

root@hf-3:/codellama# python3 codellama_langchain.py
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 749/749 [00:00<00:00, 3.57MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 4.48MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 6.13MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████| 411/411 [00:00<00:00, 1.86MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████| 646/646 [00:00<00:00, 3.40MB/s]
model.safetensors.index.json: 100%|██████████████████████████████████████████████| 25.1k/25.1k [00:00<00:00, 68.2MB/s]
model-00001-of-00002.safetensors: 100%|██████████████████████████████████████████| 9.98G/9.98G [01:50<00:00, 90.0MB/s]
model-00002-of-00002.safetensors: 100%|██████████████████████████████████████████| 3.50G/3.50G [00:39<00:00, 89.5MB/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████| 2/2 [02:30<00:00, 75.16s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.86s/it]
generation_config.json: 100%|█████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 110kB/s]

Ask codellama: given two unsorted integer lists. merge the two lists, sort the merged list, and find median using python. consider the length of the merged list while finding the median value.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Here is a possible solution to the problem:

def merge_and_find_median(list1, list2):
# Merge the two lists
merged_list = list1 + list2

# Sort the merged list
merged_list.sort()

# Find the median value
if len(merged_list) % 2 == 0:
# Even number of elements in the merged list
median = (merged_list[len(merged_list) // 2 - 1] + merged_list[len(merged_list) // 2]) / 2
else:
# Odd number of elements in the merged list
median = merged_list[len(merged_list) // 2]

return median

Explanation:

* First, we merge the two lists by concatenating them.
* Then, we sort the merged list using the `sort()` method.
* Next, we check whether the length of the merged list is even or odd. If it's even, we take the average of the middle two elements of the list. If it's odd, we simply take the middle element as the median.
* Finally, we return the median value.

Note that this solution assumes that both input lists are sorted in ascending order. If they are not sorted, you may need to add additional code to sort them before merging and finding the median.</s>

Ask codellama: /bye
root@hf-3:/codellama#


Hope it was useful. Cheers!

Thursday, January 25, 2024

Ollama - Part2 - Prompt Large Language Models (LLMs) using Ollama, LangChain and Python


In this exercise we will learn to interact with the LLMs using Ollama, LangChain, and Python.

Full project in my GitHub

https://github.com/vineethac/Ollama/tree/main/ollama_langchain


Import necessary modules from LangChain library and Python's argparse module

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import Ollama
import argparse

Argument parsing

parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, default="llama2")

args = parser.parse_args()
model = args.model

Initialize Ollama

llm = Ollama(
        model=model, callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), base_url="http://ollama:11434"
)

Interactive loop

while True:
    print(f"Model: {model}")
    prompt = input("Ask me anything: ")

    if prompt=="/bye":
        break

    llm(prompt)
    print("\n \n")


In summary, this script sets up a simple command-line interface for interacting with the Ollama language model. It takes user prompts, sends them to the Ollama model for processing, and prints the model's responses. The loop continues until the user enters "/bye" to exit.

Hope it was useful. Cheers!