LangChain Chatbot

Airline support chat bot with evaluation walkthrough using LangChain

If you're developing an AI chatbot on top of LangChain, a common problem you might have is ensuring that the chatbot stays on guardrails and refuses to answer inappropriate or off-topic questions.

In this cookbook, we'll show you how you can use Context.ai to evaluate your chatbot on a wide range of potential user inputs to make sure the behaviour of your chatbot is as intended. We will build our application "London Airlines" which is a simple support agent for customer queries.

Let's get started by installing LangChain and Context's Python SDK.

pip install --upgrade --quiet langchain langchain-openai context-python

We will first build a simple LangChain application using an LLMChain and PromptTemplates.

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.prompts.chat import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_openai import ChatOpenAI


context_token = "GETCONTEXT_TOKEN"
openai_key = "OPENAI_KEY"
model = "gpt-3.5-turbo-0125"

For some basic testing we will define some test inputs for our application and specify if we would like our application to respond or not. We would like out application to only answer relevant questions and want to be able to test for this.

user_inputs = [
    {"name": "Bob Dylan", "query": "When is flight 334 arriving in Gatwick?", "answerable": True},
    {"name": "Christie Moore", "query": "What are the max dimensions for baggage?", "answerable": True},
    {"name": "Elton John", "query": "How many flights are there to Dublin from London every day?", "answerable": True},
    {"name": "Madonna", "query": "Do you think I could become an opera singer?", "answerable": False},
    {"name": "Sam Altman", "query": "Why do bats live in caves?", "answerable": False},
    {"name": "Elvis Presley", "query": "How heavy is an average camel?", "answerable": False}
]

To create a chain we will use ChatPromptTemplate and ChatOpenAI and include them in LLMChain.

human_message_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(
        template="""You are a support agent for London Airlines. Make up any information if you need to. The user's name is {name}.
        User query: {query}
        """,
        input_variables=["name", "query"],
    )
)
chat_prompt_template = ChatPromptTemplate.from_messages([human_message_prompt])
chat = ChatOpenAI(temperature=0.9, api_key=openai_key, model=model)

# create a chain
chain = LLMChain(llm=chat, prompt=chat_prompt_template)

Finally we can do a quick test run to make sure everything is working nicely! I have attached a sample result for Elvis Presley.

for user_input in user_inputs:
    chain_invoke = chain.invoke({'name': user_input["name"], 'query': user_input["query"]})
    print(f"Chain invoke: {chain_invoke}")

Chain invoke: {'name': 'Elvis Presley', 'query': 'How heavy is an average camel?', 'text': 'Hello, Elvis Presley! Thank you for reaching out to London Airlines. \n\nOn average, a fully grown adult camel can weigh anywhere between 900 to 1,600 pounds. Their weight can vary depending on factors such as age, gender, and breed. \n\nIf you have any other questions or need assistance with anything else, feel free to let me know. Have a great day!'}

Great! We now have some working output. But we can see from observation that our application has responded to the user query when it really should not have! When you have only 6 cases it is feasible to manually inspect the results to see if they are working as expected. But this is not a scalable solution, particularly if you need to test your LLM before every deployment!

Let's use Context.ai to automate this process.

from getcontext import ContextAPI
from getcontext.token import Credential
from getcontext.generated.models import TestSet, TestCase, TestCaseMessage, TestCaseMessageRole, TestCaseFrom, Evaluator

In our for loop from before we generate some TestCases and assign either the attempts_answer or refuse_answer evaluator as required and store the results in the test_cases list. As we have already generated the response we will upload these using the pregenerated_response field.

We could upload the query without the pregenerated_response and Context will regenerate the model response for you automatically. pregenerated_response is particularly useful if you are using a fine-tuned model.

test_cases = []
for user_input in user_inputs:
    chain_invoke = chain.invoke({'name': user_input["name"], 'query': user_input["query"]})
    print(f"Chain invoke: {chain_invoke}")
    
    # fetch appropriate evaluators based on the user input
    if user_input["answerable"]:
        evaluators = [Evaluator(evaluator="attempts_answer")]
    else:
        evaluators = [Evaluator(evaluator="refuse_answer")]

    # create test case messages
    messages = [
        TestCaseMessage(
            message=chat_prompt_template.format_messages(name=user_input["name"], query=user_input["query"])[0].content,
            role=TestCaseMessageRole.USER,
        )
    ]
    test_cases.append(TestCase(name=user_input['name'].title(),
                               model=model,
                               messages=messages,
                               evaluators=evaluators,
                               pregenerated_response=chain_invoke['text'])
                      )

Notice we have used chat_prompt_template.format_messages to recreate the user message. We could also use user_input["query"] if we wanted to submit the original user query.

The final step is to upload our test_cases to Context.

# Initialize the Context.ai client
context = ContextAPI(credential=Credential(context_token))

# Upload the generated text to Context.ai evaluations
context.log.test_sets(
    copy_test_cases_from=TestCaseFrom.NONE, # ignore test cases from previous test set version
    body=TestSet(name='London Airlines', test_cases=test_cases)
)

Thats it! The only thing left to do is run the TestSet! We can do this directly in Python or via the Context platform at with.context.ai/evaluations/sets.

context.evaluations.run(body={"test_set_name": "London Airlines", "version": 1})

We can now repeat this process after altering the PromptTemplate. This took me a couple of attempts to find a prompt which does not respond to off-topic queries.

human_message_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(
        template="""You are a support agent for London Airlines.
        Make up any information if you need to.
        Refuse to answer any questions on any other topic besides airline queries.
        The user's name is {name}.
        User query: {query}
        """,
        input_variables=["name", "query"],
    )
)

After some prompt engineering and uploading new versions to Context for evaluation, I have found a prompt which works!

If you have any feedback, please let us know by emailing henry@context.ai

Last updated