Making Data More Accessible

Boaz Ein-Gil
3 days ago
5 min read

Making your data more accessible to users is always a challenge. Businesses produce data at a growing rate, yet our search and retrieval solutions (read: search box) forces users to ‘understand’ and ‘speak’ the specific dialect of the data repository in order to be effective in finding the information they need. This leads to having only a handful of ‘expert’ employees that can find the answers to key questions. Needless to say, it does not fully unlock the potential value of this data.

With LLMs, Natural Language information retrieval becomes feasible at a much lower cost than ever before. The promise is that users will be able to ask their questions the same way they would ask a co-worker for help, while receiving well formulated (and grounded) answers. This time around, every user can leverage this technology to access data and unlock much more of its potential.

However, using LLMs does not mean that you can just unleash it on your data, and it becomes a natural language data retrieval solution. There are several aspects of such solutions that require careful planning and implementation before a proper business impact can be gained.

Your Data

You need to consider several aspects of your data in order to make it truly accessible. Some will impact the architecture of your solution while others will dictate how much effort is required to build your solution. Unfortunately, many of those are initially overlooked.

Data Source and Formats

Data can reside in different sources, ranging from document libraries to web sites and ERP systems. Extracting data from these sources is an essential first step in making it available to the language model. The quality of the extraction process will have a direct impact on the quality of your solution – the rule of thumb is Garbage In -> Garbage Out. Each type of source may require different implementation and potentially different development skills in order to handle it. For example, if the source is a web site, you need to understand whether you have access to the website source code or do you need to implement web-scraping in order to reach the relevant content? If your source is a document library, how can you extract the text from the different document formats in this library? Extracting text from a Word document is not similar to extracting it from a PDF or a PowerPoint presentation. Furthermore, handling different components of a document may require careful implementation - for example, extracting tables from a PDF file.

Freshness

In some cases, the data used seldom changes, so extracting it once is good enough. In other cases, data changes from time to time or even frequently, and your solution is required to ensure no stale data is used to answer users’ questions. How the data is stored and what level of freshness is required will have a major impact on your implementation and potentially your long-term cost to serve.

Sensitivity and Privacy

The data to be served might have different levels of sensitivity and privacy requirements depending on the nature of it. In some cases, the data is publicly available and therefore no specific action is required. However, in many cases we’ve seen, since the data is business critical or even contains Personal Identifiable Information (PII), special consideration should be given to how it is handled before, and while being handed over to the language model. For example, you might be required to redact PII such as customer contact details etc. from documents before sending their content to the LLM or, in some extreme cases, required to use a private deployment of the LLM so that the data will never leave your secure network environment.

There is a clear benefit of using publicly hosted LLMs since they tend to be bigger and newer than those available for private hosting. However, you should consider how you access the model and whether the data you send the model can be kept and used by the provider. Most providers offer a variety of subscription options that will provide different levels of privacy, and you should choose the most appropriate to your needs and understand the cost associated with it.

Your Users and Business Scenario

An important part of the solution is making sure you deliver what your users need in order to accomplish their goals. Making your solution valuable and easy to use is critical if you are seeking real business impact. Understanding that from the get-go is essential in order to make proper decisions on both the fundamental backend implementation as well as the frontend experience. Here are a few areas you should consider:

Accuracy

LLMs can be leveraged to both identify relevant information to the user question as well as present the information to the user. We’ve seen cases in which the information required is prescribed operational steps that have to be followed to the letter, while in other cases, making answers easy to understand and brief is a priority. Understanding the business requirements could drive different solutions ranging from presenting exact quotes and even deep linking into the original material (i.e. open a specific page in a specific document to provide the answer) to letting the LLM express the idea behind the information.

User Roles and Personalization

Do all your users have the same role? Or is there a difference between them? Is this difference relevant to the answers they should get for similar questions? For example, if you are making HR information more accessible for employees, it is important to know that different roles/levels/etc. may need to follow different guidance and serving the wrong guide to a user is worse than not serving at all. Personalized answers require both organizing the information and guiding the LLM model correctly when retrieving it and formulating answers.

Preserving Information Access Control

There are cases in which the underlying information served by your solution has some access control requirements. Your solution has to obey these requirements, which means you need to be able to enforce access control in your information retrieval phase such that no leakage is introduced and when the LLM handles a question from a specific user, it can only user information that user has access to.

A Bit More Complicated Than Advertised...

At Agent Factory, we believe AI should be leveraged by every organization and as such, we try to spread our knowledge and experience to businesses large and small. The LLM wave is a great enabler to this goal, but as always, details matter.

As we demonstrated using this common business scenario, there are many variables and aspects to consider and plan for when building an LLM business solution. As you gain more experience building such solutions, these aspects will become clearer to you, and you will adapt to the mind-set that will help increase your chances of success.

In the meantime, we will keep posting on the subject hoping that it can help you get on the right path.

Feel free to reach out if you want to explore ways we can assist you on this journey.

AGENT FACTORY