Insight

Structuring data using artificial intelligence

Published on August 24, 2023

Titouan Launay

Co-founder

Paris, France

Unstructured data such as chats or email contains valuable information that can be used to improve business processes. We explain how to structure this data using latest AI technologies.

Introduction

Unstructured data is common in the business world. It can be found in emails, chats, documents, and many other sources. It can also be found in news outlets, or other textual sources. This data is often not used to its full potential, as it is difficult to extract information from it. Some companies may resort to human labor to extract information from this data, but this is both time-consuming and prone to errors. In this article, we will explain how you can leverage AI capabilities to structure this data and extract valuable information from it.

Converting unstructured data to a readily usable format is a complex task, however it can be broken down into three main steps:

Preparing for structuring : understanding the data and defining the structure
Structuring : using AI to structure the data
Working towards production : understanding if the solution is viable, and if so, how to take it one step further

Preparing for structuring

For the rest of this article, we will assume that we are working with a simple trading conversation, where two parties discuss closed deals. The goal is to extract the information from the conversation and store it in a structured format.

Data collection

The first step is to collect data. This can usually be done by compiling a list of emails, or exporting a chat history. In our case, we will use a chat history, which contains messages that look like these:

Hi John,
We confirm that we will buy 1000 shares of Apple at 100$ per share.
Have a good day,
Jane

Or that could also be

Dear John,
The transaction is confirmed for 1000 shares of AAPL@100$.
Best regards

Make assumptions

The next step is to make assumptions about the data. This is a crucial step, as it will determine the structure of the data. In our case, we know the following:

The data is in a chat format
The data is about trading
The data is about closed deals (NB: this is an assumption, and may not be true, the chat could also be about a deal that is still in progress or chit-chat)

We also make some assumptions about the information contained when talking about closed deals:

The number of shares, company and price must always be present.
There should usually be a currency symbol (e.g. $, €, £, ¥, etc.) or information.
There may be some info about the buyer, but it is not always the case or might be inferred from the sender of the message.

Define the structure

Based on the assumptions we made, we can define the structure of the data. In our case, we will define the following structure in YAML format:

trade:
  buyer: Name of the buyer (optional)
  stock: Name of the bought stock
  currency: Three letter code of the currency
  price: Price of the stock (per share)
  quantity: Number of shares, is an integer

Here is the same model in a Pydantic format, where we have added some validation rules. This will be useful later on.

To install Pydantic, run pip install pydantic in your terminal, or other package manager.

from typing import Optional
from pydantic import BaseModel, Field, field_validator

class Trade(BaseModel):
  buyer: Optional[str] = Field(description="Name of the buyer (optional)")
  stock: str = Field(description="Name of the bought stock")
  currency: str = Field(description="Three letter code of the currency")
  price: float = Field(description="Price of the stock (per share)")
  quantity: int = Field(description="Number of shares, is an integer")

  @field_validator("currency")
  def currency_length(cls, v):
    if len(v) != 3:
      raise ValueError("Currency must be a three letter code")
    return v

This structure is kept simple for the sake of the example, but it could be extended to include more information, such as the date of the transaction, the seller, etc.

Structuring

We will now try to use AI to structure the data. There are many ways to do this, but we will focus on an LLM-based approach, as it is the most flexible and can be used for many different tasks. It is also the most readily available, and cheapest to implement for small datasets.

Setting-up OpenAI API

To use LLMs, we will use OpenAI's API. OpenAI is a company that specializes in AI research, and has developed a very powerful LLM. They have made it available through an API, which we will use in this example.

To get started, head over to https://platform.openai.com/ and register a new account. Then, create a new API key, and copy it somewhere safe. We will use it later.

Important: you will also need to install the OpenAI Python package. You can do this by running pip install openai in your terminal, or other package manager.

Using Langchain

Large Language Models work similarly to typeahead, where they try to predict the next word based on the previous words. This can be used to generate text, but given that the models have also been trained on structured data such as JSON or XML, they can also be used to generate - somewhat - structured data.

The very nature of LLMs prevent them from reliably generating structured data, as they may miss some information or generate plainly incorrect syntax. However, there exists some tooling, such as Langchain that aims to solve this problem, by pointing the LLM in the right direction.

One major feature of Langchain is that it also leverages LLMs ability to correct incorrect syntax. This essentially means that if the first attempt at generating structured data fails, it will ask the LLM to correct itself, yielding very good results.

To install Langchain, run pip install langchain in your terminal, or other package manager.

For our example, a basic setup using OpenAI's API and Langchain would look like this, added to the code we wrote earlier:

from langchain.output_parsers import PydanticOutputParser # Import the Pydantic output parser
from langchain.llms import OpenAI # Import the OpenAI LLM
from langchain.prompts import PromptTemplate # Import the prompt template
import os # Import the os module to set the environment API key
# ...


messages = [
  "Hi John,\nWe confirm that we will buy 1000 shares of Apple at 100$ per share.\nHave a good day,\nJane",
  "Dear John,\nThe transaction is confirmed for 1000 shares of AAPL@100$.\nBest regards",
] # Define the messages

os.environ['OPENAI_API_KEY'] = 'xxxxxxx' # Set the API key, replacing xxxxxxx with your API key.

model_name = "text-davinci-003" # Set the model name
temperature = 0.0 # Set the temperature. A higher temperature will yield more diverse results, but is not recommended for structured data. It can be seen as "freedom" given to the model to generate whatever it wants.
model = OpenAI(model_name=model_name, temperature=temperature)

parser = PydanticOutputParser(pydantic_object=Trade) # Create the output parser

prompt = PromptTemplate(
    template="If this message is about a closed deal, please fillin the information in the given format.\n{format_instructions}\n{message}\n",
    input_variables=["message"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
) # Create a prompt template

_input = prompt.format_prompt(message=messages[0])
# Feed the first message to the model

output = model(_input.to_string())
# Generate the output

structured_data = parser.parse(output)
# Parse the output

print(structured_data.model_dump_json())
# Print the structured data

Running the complete file should yield the following output:

{
  "buyer": "John",
  "stock": "Apple",
  "currency": "USD",
  "price": 100,
  "quantity": 1000
}

Trying it on the second message is as simple as:

_input = prompt.format_prompt(message=messages[0])
_input = prompt.format_prompt(message=messages[1])

This also yields the same input, which is correct.

Congrats, you have successfully structured your first messages!

You can get the full code on Replit:

NB: trying the above Replit will not work, as you need to provide your own API key.

Few-shot learning to the rescue

As you see above, the model was able to correctly parse the first message, with no previous understanding of the data nor training. In just a few minutes, we were able to structure a message that would have taken a human a few minutes to parse, filling in the information in a structured format.

However, things are not always that easy. The model picked up John as the buyer, which we - given the context - can infer as false. The buyer is most likely Jane, as she is the one sending the message. This is a common problem with LLMs, as they are not able to understand the context of the message in a chat conversation.

For most of parsing errors, we can take advantage of the few-shot learning technique. To put it simply, few-shot learning is a technique that allows us to train a model on a small dataset, and then use it to perform a task, by giving it a few additional indications or examples to run the task. The idea is that the LLM will be able to generalize the few examples that we give it to perform the task.

In our case, we can use few-shot learning to tell the model that the buyer is most likely the sender of the message. This can be done by adding a few examples to the prompt template, adding context, or being more verbose in the field descriptors. Here, we will chose to add context to the prompt template, as well as an example.

prompt = PromptTemplate(
    template="If this message is about a closed deal, please fillin the information in the given format.\n{format_instructions}\n{message}\n",
    template="""
If this message is about a closed deal, please fillin the information in the given format.
Note that we are passing you chats, where the buyer is most likely the person who wrote the chat, and not the recipient. If the sender did not tell his name in the message, we can't tell who the buyer is.

{format_instructions}

{message}
""",
    input_variables=["message"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

We now have the following output for message 1:

{
  "buyer": "Jane",
  "stock": "Apple",
  "currency": "USD",
  "price": 100,
  "quantity": 1000
}

And the following output for message 2:

{
  "buyer": null,
  "stock": "AAPL",
  "currency": "USD",
  "price": 100,
  "quantity": 1000
}

These outputs are correct, as the buyer is not specified in the second message.

Working towards production

While the above solution gets good results, it is not production-ready. It is yet to be verified that the model is able to generalize to other messages, and that it is able to parse messages that are not in the training set.

There are also other elements that should be taken into account, for example the fact that the stock name was provided in various ways (AAPL, Apple, etc.). This can be solved by adding more examples to the training set, or by adding other correctors down the line.

Monitoring

The key to a successful production-ready solution is monitoring. It is important to monitor the model's performance, and to understand when it is failing. This can be done by logging the model's output, and then manually checking the output to see if it is correct. If it is not, then the model should be retrained on the new data.

This process needs to be kept up-to-date, as input data may evolve over time.

Enterprise-ready AI providers

Finally, while the above solution is a good starting point, its use on third party AI providers might not be the best fit for your company. Although there exists specific AI providers for enterprise, such as Azure or AWS, you may also want to take the extra step to deploy your own AI solution.

Lleed & Partners can help you deploy your own AI solution, and integrate it with your existing infrastructure. We can also help you build a custom solution, tailored to your needs.

Get in touch

Lleed & Partners experts are here to help you in the digitalisation process, from strategy to implementation. Let's discuss your current needs, and how we can help you achieve your goals.