AI Tools Notes & Disclaimers

How we created our AI apps

We developed our AI evaluation apps using MindStudio, an application that creates AI 'agents'. AI agents follow a series of commands to achieve a task, using one or more Large Language Models (LLMs). Instead of providing a prompt like "Create a logic model", we broke each task down into a number of sub-tasks to improve the chance that the end product would be useful. The field of artificial intelligence is changing so quickly that we decided to use a tool that could upgrade to better LLMs instantly, as they are released, but maintain the same step-by-step approach to complex tasks.

MindStudio can use many different Large Language Models including GPT-4o and Claude 3 Opus. Typically, a single MindStudio app will use a mix of inexpensive LLMs (like Claude 3 Haiku) and expensive ones (like GPT-4 Turbo), depending on the subtasks. Public apps like the ones offered here generally use the inexpensive models because costs mount up quickly if many people are using them. However, if one organization is using an application for something like an evaluation, it probably makes sense to use the most capable LLMs for the most complex tasks.

All of the apps on this site were developed by LogicalOutcomes, a Canadian nonprofit, with support from the Ontario Trillium Foundation. They are all free to use and revise for your own organization. All you need is a free MindStudio account and a template from LogicalOutcomes.

Security and confidentiality

Your chats with these apps cannot be seen by LogicalOutcomes or MindStudio employees. The Large Language Models that are analyzing your conversations and data do have temporary access to your chats and data, but your data is not retained or used for training purposes. MindStudio uses the 'Enterprise API' versions of LLMs, which do not permit user data to be used for training purposes.

Files that you upload should not contain personally identifiable client information. Before uploading a file with identifiable information, review it to see if it contains information that could be harmful if it was revealed, like sensitive health information. Generally we recommend not asking that kind of information unless it is necessary, and only after you have done a privacy assessment to ensure you can handle the data responsibly. Assuming there is no risk of harm, strip the file of identifying information like name, email address, personal record numbers and other aspects that could identify individual people. That generally means deleting a few columns in a spreadsheet.

We are testing data analysis services that do have good enough security protections, and MindStudio is also working towards SOC classification. In the meantime, we decided to focus on AI tools that are not dependent on sensitive confidential information.

Disclaimers

These tools are based on generative artificial intelligence (AI) using Large Language Models (LLMs). The world of LLMs is changing rapidly, and not even AI researchers understand how LLMs actually work. When you use these tools, you need to know that they are unpredictable and often inaccurate.

Every time you use an LLM to ask a complex open-ended question you will get a different answer. Even narrow 'fact based' questions will often result in different, possibly inaccurate results, and you won't know which responses are correct and which are not.

"Because you cannot rely on LLMs to provide correct responses, and you cannot generate a confidence score for any given response, you have to either accept potential inaccuracies (which makes sense in many cases, humans are wrong sometimes too) or keep a Human-in-the-Loop (HITL) to validate the response." From https://lethain.com/mental-model-for-how-to-use-llms-in-products/ .

It is essential to keep a human in the loop, which means that you or another human must check responses to ensure they are good enough to use. That doesn't mean the responses must be perfectly correct; they just need to be good enough to be helpful and not harmful. For example, if you want to know more about 'human in the loop' or 'LLM', just ask any of the good GPTs available for free (like Bing Copilot in 'precise mode).

For our evaluation tools, we have tried to prevent inaccurate or unhelpful or harmful results in several ways:

We developed a basic evaluation framework and design approach for the evaluation planning apps. The framework defines the common elements of evaluations that will have a positive impact on participants and service providers. By targeting common, basic elements of effective evaluations the app is unlikely to recommend irrelevant or needlessly expensive methodologies.
We broke down each step into individual tasks, with each task having a different prompt or function. For example, one step scrapes a web page about a program, extracts the text, and saves it into a variable. The next step analyzes the text to summarize the program. By using separate defined tasks, the LLM has less room to hallucinate and is more likely to be accurate.
We tested each step using a less capable and inexpensive LLM (Claude 3 Haiku, which is still amazing) to identify mistakes and misunderstandings that could be fixed with more precise prompting. Then we often switched to a more capable (and expensive) LLM like GPT-4 Turbo for actual use, to provide more accurate and helpful answers.
We built in some automated review and correction within the app itself to reduce errors.
We ask users to copy and paste the results into a document for review, acting as a 'human in the loop'.
Most of the evaluation effort should be in communicating with participants, stakeholders and decision-makers, exploring emerging findings with them and coming to shared conclusions about how to improve services. That process in itself will reduce the risk of inaccuracies.

That said, the accuracy of the app will be the result of a collaboration between you and the LLMs. It is definitely possible to interact with it in a way that delivers bad results. Users are responsible for the final result and how they use it. There is no other way to manage LLMs at this stage of development.

On the other hand, LLMs are an incredibly useful resource. It's like working with a weird intern who is astoundingly knowledgeable in some areas and completely without judgement in other areas - and you don't know which until you ask them to do something - and then they improve by the next month anyway. They need close supervision but can help you do a lot more than you could by yourself. Use with care but also (we suggest) use with an appreciation of how helpful they can be.