vineethac.blogspot.com: LangChain

Showing posts with label LangChain. Show all posts

Saturday, March 30, 2024

Hugging Face - Part4 - Containerize your LLM app using Python, FastAPI, and Docker

In this exercise, our objective is to integrate an API endpoint for the Large Language Model (LLM) provided by Hugging Face using FastAPI. Additionally, we aim to encapsulate this whole application within a Docker container for portability and ease of deployment.

To achieve this, our project consists of several key components:

Large Language Model: Our application logic resides in model.py, where the model_pipeline function serves as the core engine behind our LLM interaction using LangChain. We've chosen the Mistral Instruct model from Hugging Face for this exercise.
API Endpoint Integration: We'll be incorporating an API endpoint using FastAPI to seamlessly interact with the LLM downloaded from Hugging Face. The main.py file implements the FastAPI framework, defining routes and endpoints. Specifically, the /ask endpoint invokes the model_pipeline function to interact with the Mistral Instruct model and generate a response.
Containerization: Utilizing the Dockerfile, we containerize our FastAPI LLM application. This ensures that our application, along with its dependencies, can be easily packaged and deployed across various environments.

You can access the Dockerfile, Python code, and other observations on my GitHub repository:

https://github.com/vineethac/huggingface/tree/main/5-containerize-llm-app

Deploy on Kubernetes as a pod

Deploying directly as a pod is not a preferred way. This is just for quick testing purpose! In the next blog post we will see how to deploy this as a Kubernetes deployment resource.

❯ KUBECONFIG=gckubeconfig k run hf-11 --image=vineethac/fastapi-llm-app:latest --image-pull-policy=Always
pod/hf-11 created
❯ KUBECONFIG=gckubeconfig kg po hf-11
NAME    READY   STATUS              RESTARTS   AGE
hf-11   0/1     ContainerCreating   0          2m23s
❯
❯ KUBECONFIG=gckubeconfig kg po hf-11
NAME    READY   STATUS    RESTARTS   AGE
hf-11   1/1     Running   0          26m
❯
❯ KUBECONFIG=gckubeconfig k logs hf-11 -f
Downloading shards: 100%|██████████| 3/3 [02:29<00:00, 49.67s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:03<00:00,  1.05s/it]
INFO:     Will watch for changes in these directories: ['/fastapi-llm-app']
INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
INFO:     Started reloader process [7] using WatchFiles
Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00,  3.88s/it]
INFO:     Started server process [25]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
2024-03-28 08:19:12 hf-11 watchfiles.main[7] INFO 3 changes detected
2024-03-28 08:19:48 hf-11 root[25] INFO User prompt: select head or tail randomly. strictly respond only in one word. no explanations needed.
2024-03-28 08:19:48 hf-11 root[25] INFO Model: mistralai/Mistral-7B-Instruct-v0.2
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
2024-03-28 08:19:54 hf-11 root[25] INFO LLM response:  Head.
2024-03-28 08:19:54 hf-11 root[25] INFO FastAPI response:  Head.
INFO:     127.0.0.1:53904 - "POST /ask HTTP/1.1" 200 OK
INFO:     127.0.0.1:55264 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:43342 - "GET /healthz HTTP/1.1" 200 OK

For a quick validation, I did exec into the pod and curl against the exposed APIs.

❯ KUBECONFIG=gckubeconfig k exec -it hf-11 -- bash
root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app# curl -d '{"text":"select head or tail randomly. strictly respond only in one word. no explanations needed."}' -H "Content-Type: application/json" -X POST http://localhost:5000/ask
{"response":" Head."}root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app# curl localhost:5000
"Welcome to FastAPI for your local LLM!"root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app# curl localhost:5000/healthz
{"Status":"OK"}root@hf-11:/fastapi-llm-app#
root@hf-11:/fastapi-llm-app#

You can also use kubectl expose command to create a service for this pod and then port forward to it and then curl to it.

Hope it was useful. Cheers!

Friday, February 23, 2024

Hugging Face - Part3 - Inference with Code Llama using LangChain

In the field of understanding and working with human language (NLP), Hugging Face is a key platform that provides many pre-trained models for different tasks. With Transformers, LangChain, and Python developers can easily use Hugging Face's models on their own computers for quick processing. Using LangChain offers a streamlined and user-friendly approach to tapping into the capabilities of pre-trained language models. In this blog post we focus on how to inference with Code Llama - Instruct model from Hugging Face locally using LangChain.

You can access the Python script in my GitHub repository:
https://github.com/vineethac/huggingface/tree/main/4-codellama_with_langchain

To initiate inference with Code Llama, developers can start by specifying the desired model using its identifier, such as MODEL_ID = "codellama/CodeLlama-7b-Instruct-hf". Transformers simplifies the process by providing a unified interface with the familiar Python programming language, allowing users to effortlessly initialize the model and tokenizer.

Once the model and tokenizer are set up, developers can leverage LangChain's HuggingFacePipeline class to create a text generation pipeline. This pipeline, defined with parameters like max_new_tokens and repetition_penalty, becomes a powerful tool for local inferencing. By combining this pipeline with LangChain's PromptTemplate, developers can easily construct prompts and invoke the entire chain to generate responses. This streamlined process facilitates local inferencing with Code Llama, empowering developers to leverage Hugging Face's models for a wide range of natural language processing tasks in their Python applications.

Example

root@hf-3:/codellama# python3 codellama_langchain.py
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████| 749/749 [00:00<00:00, 3.57MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 4.48MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 6.13MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████| 411/411 [00:00<00:00, 1.86MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████| 646/646 [00:00<00:00, 3.40MB/s]
model.safetensors.index.json: 100%|██████████████████████████████████████████████| 25.1k/25.1k [00:00<00:00, 68.2MB/s]
model-00001-of-00002.safetensors: 100%|██████████████████████████████████████████| 9.98G/9.98G [01:50<00:00, 90.0MB/s]
model-00002-of-00002.safetensors: 100%|██████████████████████████████████████████| 3.50G/3.50G [00:39<00:00, 89.5MB/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████| 2/2 [02:30<00:00, 75.16s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.86s/it]
generation_config.json: 100%|█████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 110kB/s]

Ask codellama: given two unsorted integer lists. merge the two lists, sort the merged list, and find median using python. consider the length of the merged list while finding the median value.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 Here is a possible solution to the problem:

def merge_and_find_median(list1, list2):
   # Merge the two lists
   merged_list = list1 + list2

   # Sort the merged list
   merged_list.sort()

   # Find the median value
   if len(merged_list) % 2 == 0:
       # Even number of elements in the merged list
       median = (merged_list[len(merged_list) // 2 - 1] + merged_list[len(merged_list) // 2]) / 2
   else:
       # Odd number of elements in the merged list
       median = merged_list[len(merged_list) // 2]

   return median

Explanation:

* First, we merge the two lists by concatenating them.
* Then, we sort the merged list using the `sort()` method.
* Next, we check whether the length of the merged list is even or odd. If it's even, we take the average of the middle two elements of the list. If it's odd, we simply take the middle element as the median.
* Finally, we return the median value.

Note that this solution assumes that both input lists are sorted in ascending order. If they are not sorted, you may need to add additional code to sort them before merging and finding the median.</s>

Ask codellama: /bye
root@hf-3:/codellama#

Hope it was useful. Cheers!

Thursday, January 25, 2024

Ollama - Part2 - Prompt Large Language Models (LLMs) using Ollama, LangChain and Python

In this exercise we will learn to interact with the LLMs using Ollama, LangChain, and Python.

Full project in my GitHub

https://github.com/vineethac/Ollama/tree/main/ollama_langchain

Import necessary modules from LangChain library and Python's argparse module

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import Ollama
import argparse

Argument parsing

parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, default="llama2")

args = parser.parse_args()
model = args.model

Initialize Ollama

llm = Ollama(
        model=model, callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), base_url="http://ollama:11434"
)

Interactive loop

while True:
    print(f"Model: {model}")
    prompt = input("Ask me anything: ")

    if prompt=="/bye":
        break

    llm(prompt)
    print("\n \n")

In summary, this script sets up a simple command-line interface for interacting with the Ollama language model. It takes user prompts, sends them to the Ollama model for processing, and prints the model's responses. The loop continues until the user enters "/bye" to exit.

Hope it was useful. Cheers!

Friday, June 11, 2021

Index

Generative AI and LLMs

Azure AI Foundry

Ollama

Hugging Face

Kubernetes

Kubernetes mini project
Storage performance benchmarking of Tanzu Kubernetes Clusters
Benchmarking Kubernetes infrastructure using K-Bench
Validate your Kubernetes cluster using Sonobuoy
vSphere with Tanzu using NSX-T Blog Series
Part1 - Prerequisites
Part2 - Configure NSX
Part3 - Edge Cluster
Part4 - Tier-0 Gateway and BGP peering
Part5 - Tier-1 Gateway and Segments
Part6 - Create tags, storage policy, and content library
Part7 - Enable workload management
Part8 - Create namespace and deploy Tanzu Kubernetes Cluster
Part9 - Monitoring
Part10 - Upgrade Tanzu Kubernetes Cluster
Part11 - Troubleshooting Tanzu Kubernetes Cluster
Part12 - Deploy application on TKC and access it
Part13 - Export WCP admin kubeconfig
Part14 - Testing TKC storage using kubestr
Part15 - Working with etcd on TKC with one control plane
Part16 - Troubleshooting content library related issues
Part17 - Troubleshooting TKC stuck at updating phase
Part18 - Troubleshooting vSphere pods with ProviderFailed status
Part19 - Troubleshooting TKC stuck at creating phase
Part20 - Safely deleting NotReady nodes from a TKC
Part21 - Pointers while upgrading the stack
Part22 - Working with NGINX Ingress Controller
Part23 - Supervisor cluster certificates expiry
Part24 - Kubernetes component certs in TKC
Part25 - Spherelet
Part26 - Jumpbox kubectl plugin to SSH to TKC node
Part27 - nullfinalizer kubectl plugin
Part28 - Create a custom VM Class
Part29 - Logging using Loki stack
Part30 - Troubleshooting inaccesssible TKC with server pool members missing in the LB VS
Part31 - Troubleshooting inaccessible TKC with expired control plane certs
Part32 - Troubleshooting BGP related issues
Part33 - Troubleshooting intermittent connection timeouts to apiserver and workloads
Part34 - CPU and Memory utilization of a supervisor cluster
Part35 - Monitoring supervisor cluster health with Python and vCenter APIs

Tanzu Kubernetes Grid (TKG) on vSphere 6.7 U3
Part1 - Install
Part2 - Deploy, and manage multiple Kubernetes workload clusters
Part3 - Deploy FIO application with persistent storage

Kubernetes 101 Blog Series

Part1 - Create K8s cluster with kubeadm
Part2 - Basic operations
Part3 - Install kubectl on Windows
Part4 - Kubectl autocomplete and alias
Visualize your Kubernetes clusters and workloads using Octant

vRealize Operations (vROps)

vRealize Operations 7.5 Blog Series

Dell EMC PowerFlex Management Pack for vROps 8.x Blog Series

PowerShell

PowerShell 101
Part 1: Microsoft PowerShell Help System
Part 2: Objects, properties, and methods in PowerShell
Part 3: PowerShell Pipeline and object filtering
Part 4: Avoid disasters in PowerShell
Part 5: PowerShell Remoting

VMware PowerCLI Blog Series
Part1 - Installing the module and working with stand-alone ESXi host
Part2 - Working with vCenter server
Part3 - Basic VM operations
Part4 - Snapshots
Part5 - Real time storage IOPS and latency
Part6 - vSphere networking
Part7 - Working with vROps
Part8 - Working with vSAN
Part9 - Working with NSX-T

Working with iDRAC9 using PowerShell
Part 1
Part 2
Part 3
Part 4

Working with ScaleIO/ VxFlex OS/ PowerFlex REST API using PowerShell
Part 1
Part 2
Part 3
Part 4

Pages

Saturday, March 30, 2024

Deploy on Kubernetes as a pod

Friday, February 23, 2024

Example

Thursday, January 25, 2024

Full project in my GitHub

Friday, June 11, 2021

Generative AI and LLMs

Kubernetes

Tanzu Kubernetes Grid (TKG) on vSphere 6.7 U3Part1 - InstallPart2 - Deploy, and manage multiple Kubernetes workload clustersPart3 - Deploy FIO application with persistent storage

vRealize Operations (vROps)

PowerShell

Tanzu Kubernetes Grid (TKG) on vSphere 6.7 U3
Part1 - Install
Part2 - Deploy, and manage multiple Kubernetes workload clusters
Part3 - Deploy FIO application with persistent storage