Skip to main content
How To Test LLMs at Test IO

Quickly learn how to test LLMs on our platform

Zorica Micanovic avatar
Written by Zorica Micanovic
Updated over 8 months ago

Anyone can test LLMs at Test IO, yes or no?

Even though anyone can test LLMS, it is favorable to have a specific set of skills that include:

  • Analytical mind

  • Critical thinking and

  • Curiosity

Equipped with the mentioned skills you'll streamline the testing activities to save resources and produce the best possible result for our customers.

What types of testing scenarios should I know before I start testing LLMs

Positive Testing

Positive Testing in LLM Agent Testing is where you check if the system behaves as expected upon valid inputs. The main purpose of positive testing is to confirm that the software product or application functions as intended and provides expected outputs. This type of testing is based on the principle of "happy path" testing, where the focus is on the normal, expected behavior of the system.

Imagine you have a new toy car. You wind it up and let it go to see if it moves forward. If it does, that's like a successful positive test.

So, in the case of an LLM Agent, positive testing means giving it the right kind of information and seeing if it responds with expected data.

Positive Testing LLM for Context

Represents checking if the AI model correctly understands and responds to the context of a conversation or a text.

Here is one interesting example where an LLM successfully followed the topic of the conversation:

User: "I'm thinking about buying a new car."
LLM: "That's great! What kind of car are you considering?"
User: "I'm not sure yet, but I want something with good fuel efficiency."
LLM: "There are many cars known for their fuel efficiency. Some popular options are hybrid cars or electric vehicles. Would you like more information on these?"

The LLM was following the context of the conversation about cars. It understands that:

  • the user is considering buying a new car and

  • wants one with good fuel efficiency.

The LLM responds appropriately by suggesting types of cars known for their fuel efficiency. If the LLM continues responding in a way that shows it understands the context, it will pass the positive test.

Positive Testing LLM for Performance

Represents checking if the AI model can respond quickly and accurately under normal conditions.

e.g. If you ask the AI a question, it should be able to give you a correct answer in a reasonable amount of time. If it can do this consistently, then it's performing well.

Here's an example of Positive Testing LLM for Performance:

User: "What's the weather like in New York today?"
LLM: (Responds within a few seconds) "The weather in New York today is sunny with a high of 75 degrees."

The LLM quickly and accurately provides the requested information above. This means the AI is performing well in terms of:

  • speed and

  • accuracy.

⚠️Be careful not to mix the Functional endless loading bug with the LLM Performance bug.⚠️

We suggest the following way to check if the bug is a Functional or LLM Performance bug

  • While testing the LLM observe if the LLM is loading endlessly on each request or only on specific requests.

An endless loading on each request would mean that this is a Functional bug, while in the case of infinite loading after a particular input, you are dealing with the LLM Performance bug.

Positive Testing LLM for Focus During Long Conversations

Represents checking if the AI model can maintain the context and stay on topic throughout a lengthy conversation.

e.g. If you're having a long conversation with the AI about a specific topic the AI should be able to follow along, remember previous parts of the conversation, and provide relevant responses. It should not get confused or start talking about unrelated topics.

User: "I'm trying to bake a cake and I need some help."
LLM: "Of course, I'd be happy to help. What kind of cake are you baking?"
User: "I'm baking a chocolate cake."
LLM: "Sounds delicious! What step are you currently on in the recipe?"
User: "I'm at the part where I need to melt the chocolate."
LLM: "Alright, you can melt chocolate by using a double boiler or by microwaving it in short intervals and stirring in between. Which method do you prefer?"

The LLM is maintaining focus during a long conversation about baking a cake:

  • It follows the context,

  • remembers the user is baking a chocolate cake and

  • provides relevant advice.

Positive Testing LLM for Language Accuracy

Represents checking if the AI model can use language correctly. This includes proper grammar, spelling, punctuation, and appropriate use of words and phrases.

e.g. If you ask the AI a question, it should respond with a grammatically correct sentence, spelled correctly, and make sense in the conversation context.

User: "Can you tell me about the Eiffel Tower?"
LLM: "The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. It's one of the most famous landmarks in the world. It was originally built as the entrance arch for the 1889 World's Fair and stands at approximately 330 meters tall."

The LLM uses:

  • correct grammar

  • spelling and

  • punctuation.

The sentence structure is clear, and the information provided is accurate and relevant to the question.

Positive Testing LLM for Cross-platform Compatibility

Represents checking if the AI model works well across different platforms or devices.

e.g. The AI should function properly whether you're using it on a computer, a smartphone, a tablet, or any other device. It should also work well if you're using Windows, macOS, Android, iOS, or any other operating system.

Let's say you're using an AI chatbot to help manage your schedule. You first interact with the chatbot on your Windows computer to set a reminder for a meeting. Later, while you're out and about, you check the reminder on your Android smartphone. The chatbot should be able to display the reminder correctly on both devices.

The AI chatbot works well across different platforms:

  • a Windows computer and

  • an Android smartphone.

Positive Testing LLM for Data Privacy

Represents checking if the AI model respects and protects user data as it should.

e.g.The AI should not store or share personal information unless it's necessary and the user has consent. It should also have measures to protect the data it handles from unauthorized access or leaks.

Let's say you use an AI assistant to help manage your schedule and tell the AI about an upcoming doctor's appointment. The AI should be able to remind you about the appointment, when necessary, but it should not share this information with anyone else or use it for any other purpose.

The AI respected data privacy by:

  • keeping your information confidential and

  • using it only for the intended purpose.

Negative Testing

Negative Testing in LLM Testing is a type of testing where the system is checked to see how it handles incorrect or unexpected inputs. The main purpose of negative testing is to ensure that the software or application behaves gracefully and does not crash or produce errors when faced with invalid data or unexpected user behavior.

For Negative Testing, imagine you try to wind up the toy car backward or you try to make it move without winding it up at all. The car shouldn't break or do anything strange. It just shouldn't move, because that's not how it's supposed to work.

Similarly, in the case of a computer program or AI, negative testing means:

  • giving it the wrong kind of information or

  • instructions and seeing how it handles it.

The program shouldn't crash or give a weird response. It should either do nothing or let you know that something went wrong.

Negative Testing LLM for Context

Represents checking how the AI model handles situations where the context is unclear or confusing.

e.g. If you're having a conversation with the AI and suddenly change the topic without any clear transition, or if you use a word or phrase that could have multiple meanings, the AI should still be able to respond sensibly. It might ask for clarification or make a best guess based on the information it has.

User: "I'm thinking about getting a new pet."
LLM: "That's exciting! What kind of pet are you considering?"
User: "I'm thinking about a bulldog or a goldfish."

The user has given two very different types of pets, which could create a confusing context for the AI. A good AI should be able to handle this situation gracefully. It might respond with something like:

LLM: "Both are great choices but require different levels of care. Would you like information on taking care of a bulldog or a goldfish?"

This response shows:

  • that the AI has recognized the potential confusion and

  • has asked for clarification, which would be a good result in a negative test for context.

On the other hand, the interaction could run like this as well:

User: "I'm thinking about getting a new pet."
LLM: "That's exciting! What kind of pet are you considering?"
User: "I'm thinking about a bulldog or a goldfish."
LLM: "That's great! Bulldogs love to swim."

The AI has:

  • misunderstood the context and

  • provided an incorrect response.

Bulldogs, in general, are not known for their love of swimming, and the AI has also ignored the mention of the goldfish.

Negative Testing LLM for Performance

Represents checking how the AI model handles situations that are not ideal for performance.

e.g. The AI might be given a very complex task that requires a lot of processing power or is used by more people at once. Negative Testing aims to check the AI handling these high-stress situations.

Let's say you're using an AI chatbot to help manage your schedule. Normally, you would ask the chatbot to set one or two reminders at a time, but for a Negative test, you would set hundreds of reminders for the same time slot.

In this situation, the chatbot might take longer than usual to process your request, or it might not be able to set all the reminders, but it shouldn't crash or stop working completely. It should:

  • setting as many reminders as it can and

  • let you know that it couldn't set them all.

Here's an example of a failure in a Negative Test for Performance:

Let's say you're using an AI chatbot to help manage your schedule. Normally, you ask the chatbot to set one or two reminders at a time. But for a negative test, you try to set hundreds of reminders all at once.

If the chatbot

  • crashes

  • freezes or

  • takes an extremely long time to respond

it would be considered a failure in the negative test for performance.

Negative Testing LLM for Focus During Long Conversations

Represents checking how the AI model handles situations where the conversation is long and potentially confusing.

e.g. The user might change the topic frequently, introduce irrelevant information, or use unclear or ambiguous language. Negative Testing aims to see how the AI handles these challenging conversations.

User: "I'm thinking about getting a new pet."
LLM: "That's exciting! What kind of pet are you considering?"
User: "I like dogs. But I also like pizza. And I need to do laundry."

The user is changing the topic rapidly and introducing unrelated information above. A good AI might respond with something like:

LLM: "It sounds like you have a lot on your mind. Would you like to continue discussing pets, or would you prefer to talk about something else?"

This response shows that the AI has:

  • recognized the rapid topic changes and

  • is trying to steer the conversation back to a clear path.

Here's an example of a failure in a negative test for focus during long conversations:

User: "I'm thinking about getting a new pet."
LLM: "That's exciting! What kind of pet are you considering?"
User: "I like dogs. But I also like pizza. And I need to do laundry."
LLM: "That's great! Dogs make wonderful pets. What breed are you considering?"

In this example, the AI has failed to:

  • acknowledge the rapid change of topics and

  • continue to discuss the first topic, which is about pets.

Negative Testing LLM for Language Accuracy

Represents checking how the AI model handles situations where the language used is incorrect or unclear.

e.g. The user might use incorrect grammar, misspell words, or use slang or jargon that the AI might not recognize. The goal of negative testing is to see how the AI handles these language inaccuracies.

User: "I has a question about dogs. Can they eats chocolate?"

In this example, the user has used incorrect grammar ("I has" instead of "I have" and "eats" instead of "eat"). A good AI should be able to understand the intended meaning despite the errors.

It might respond with something like:

LLM: "It's important to know that chocolate is not safe for dogs to eat. It contains a substance called theobromine, which is toxic to dogs."

This response shows that the AI:

  • has understood the question and

  • provided a correct answer, despite the incorrect grammar.

Negative Testing LLM for Cross-platform Compatibility

Represents checking how the AI model handles situations where it's used on a platform or device it's not optimized for.

e.g. The AI might be designed to work best on a computer, but a user might try to use it on a smart TV or a smartwatch. The Negative Testing aims to see how the AI handles these less-than-ideal platforms.

Let's say you're using an AI chatbot designed to work best on a computer or a smartphone. But for a negative test, you try to use it on a smartwatch.

The chatbot might be unable to:

  • display all the information as clearly on the small screen or

  • it might be harder to navigate

but it should still be able to:

  • function in some capacity.

It should be able to receive commands and provide responses, even if they're not as detailed or easy to interact with as they would be on a larger screen.

Negative Testing LLM for Data Privacy

Represents checking how the AI model handles situations where there could be a risk to data privacy.

e.g. The user might try to share sensitive personal information with the AI, or a third party might try to access the data the AI is handling. The Negative Testing aims to see how the AI protects data privacy in these situations.

Let's say you're using an AI chatbot for online shopping. During the checkout process, you need to enter your credit card information. For a Negative test, try to enter your credit card number into a field that's not secure.

The chatbot should not allow you to:

  • enter your credit card information into an insecure field.

It should either:

  • prevent you from entering the information or

  • it should warn you that the field is not secure and ask you to enter it in a secure field instead.

Edge Case Testing

Edge Case Testing in LLM is a type of testing where the system is checked to see how it handles inputs at the extreme ends or boundaries of what's expected. These could be:

  • very large or

  • small numbers

  • special characters or

  • unusual combinations of inputs.

The main purpose of Edge Case Testing is to ensure that the software or application behaves correctly in scenarios that are less common, but still possible. It helps to identify any issues that might only occur under these special circumstances.

Imagine you take the toy car to the top of a very steep hill and let it go, or you try to make it move on a surface it's not designed for, like sand or water. The car might not behave as it usually does, but it shouldn't break or do anything dangerous.

Similarly, for a computer program or AI, Edge Case Testing means giving it unusual or extreme inputs and seeing how it handles them. The program shouldn't crash or give a weird response. It should either:

  • handle the situation as best as it can or

  • let you know it can't process the input.

Edge Case Testing LLM for Context

Represents checking how the AI model handles situations where the context is unusual or extreme.

e.g. The user might:

  • use a rare word or

  • phrase or

  • they might change the topic of the conversation in a way that's unexpected or

  • abrupt.

The Edge Case Testing goal is to check how the AI handles these unusual contexts.

User: "I'm thinking about getting a new pet."
LLM: "That's exciting! What kind of pet are you considering?"
User: "I'm thinking about a Jackalope."

The user mentioned a "Jackalope", a mythical creature from North American folklore, described as a jackrabbit with antelope horns. This is an unusual context because most people don't consider mythical creatures when getting a new pet.

A good AI should be able to handle this situation gracefully. It might respond with something like:

LLM: "A Jackalope sounds like an interesting choice! However, as far as I know, Jackalopes are creatures of folklore. If you're interested in real animals, rabbits or antelopes might be a good choice."

This response shows that the AI:

  • has recognized the unusual context and

  • provided a relevant and

  • sensible response.

Edge Case Testing LLM for Performance

Represents checking how the AI model handles situations that are extreme in terms of performance demands.

e.g. The AI:

  • might be given an extremely complex task that requires a lot of processing power or

  • it might be used by an extremely large number of people at the same time.

The Edge Case Testing aims to see how the AI handles these extreme situations.

Imagine you're using an AI language model to translate a text. Normally, the text is in a common language like English or Spanish. But for an edge case test, you might give it a text in a very rare or complex language, or even a made-up or coded language.

The AI might not be able to translate the text as accurately or quickly as it usually does, but it should:

  • try its best to provide a translation or

  • it should let you know that it can't process the text.

Edge Case Testing LLM for Focus During Long Conversation

Represents checking how the AI model handles situations where the conversation is extremely long or complex.

e.g. The user might have a conversation with the AI that lasts for several hours, or they might discuss a topic that's very complex and difficult to follow. The Edge Case Testing aims to see how the AI handles these extreme conversations.

User: "I'm thinking about writing a book."
LLM: "That's a great endeavor! What's the book going to be about?"
User: "It's going to be a science fiction novel about quantum physics, time travel, and interdimensional beings, with a subplot involving political intrigue in a post-apocalyptic society. The main character is a physicist who..."

In this example, the user introduced a very complex topic with many different elements. A good AI might respond with something like:

LLM: "That sounds like a fascinating and complex story. Would you like to discuss the plot further, or do you need help with specific aspects of your book?"

This response shows that the AI:

  • is maintaining focus during a long and complex conversation.

It's acknowledging the complexity of the topic and offering to continue the discussion.

Edge Case Testing LLM for Language Accuracy

Represents checking how the AI model handles situations where the language used is extremely complex or unusual.

e.g. The user might use very rare words or phrases, complex grammatical structures, or a style of language that's very formal or very informal. The Edge Case Testing aims to see how the AI handles these extreme language scenarios.

User: "I'm in a quandary over the conundrum posed by the juxtaposition of antidisestablishmentarianism and floccinaucinihilipilification in contemporary discourse."

In this example, the user has used some very complex and rare words. A good AI might respond with something like:

LLM: "That's quite a complex statement! It seems like you're pondering the contrast between the opposition to the withdrawal of state support from an established church (antidisestablishmentarianism) and the act of estimating something as worthless (floccinaucinihilipilification) in today's discussions. Could you please elaborate on your thoughts?"

This response shows that the AI has:

  • understood the complex language and

  • provided a relevant and

  • sensible response.

Edge Case Testing LLM for Cross-platform Compatibility

Represents checking how the AI model handles situations where it's used on a platform or device that's very unusual or not commonly used.

e.g. AI might be designed to work best on common devices like computers, smartphones, and tablets, but for an Edge Case Test, a user might try to:

  • use it on a very old device

  • a device with a very small screen or

  • a device that uses an uncommon operating system.

Imagine you're using an AI language model that's designed to work best on web browsers like Chrome, Firefox, or Safari. But for an edge case test, you try to use it on a less common browser, like an old version of Internet Explorer or a niche browser that's not widely used.

The AI might not be able to:

  • display its interface as intended or

  • some features might not work as smoothly.

However, the core functionality should still work. It should still be able to:

  • process inputs and

  • respond, even if some of the additional features or visual elements don't work perfectly.

Edge Case Testing LLM for Data Privacy

Represents checking how the AI model handles situations where there could be an extreme risk to data privacy.

e.g. The user might try to share very sensitive personal information with the AI, or a third party might try to access the data the AI is handling in a very aggressive or sophisticated way. The Edge Case Testing aims to see how the AI protects data privacy in these extreme situations.

Imagine you're using an AI language model that's designed to respect user privacy and not store personal data. For an edge case test, you might try to share a large amount of personal data with the AI all at once, such as a long list of names and addresses.

The AI should:

  • recognize that this is sensitive information and

  • refuse to process it.

It might respond with a message like:

"Sorry, but I can't assist with that."

This would show the AI is adhering to the data privacy rules, even in an extreme situation.

If the AI processes this information and continues the conversation without acknowledging the sensitivity of the data or refusing to process it, this would be a failure. The AI should recognize this is sensitive information and either refuse to process it or explicitly confirm that it won't be stored or misused.


Did this answer your question?