Show HN: Improve LLM Performance by Maximizing Iterative Development

github.com

102 points by asif_ 3 days ago

I have been working in AI space for a while now, first at FAANG with ML since 2021, then with LLM in start-ups since early 2023. I think LLM Application development is extremely iterative, more so than any other types of development. This is because to improve an LLM application performance (accuracy, hallucinations, latency, cost), you need to try various combinations of LLM models, prompt templates (e.g., few-shot, chain-of-thought), prompt context with different RAG architecture, different agent architecture, and more. There are thousands of possible combinations and you need a process that let’s you quickly test and evaluate these different combinations.

I have had the chance to talk with many companies working on AI products. The biggest mistake I see is a lack of standard process that allows them to rapidly iterate towards their performance goal. Using my learnings, I’m working on an Open Source Framework that structures your application development for rapid iteration so you can easily test different combination of your LLM application components and quickly iterate towards your accuracy goals.

You can checkout the project at https://github.com/palico-ai/palico-ai

You can locally setup a complete LLM Chat App with us with a single command. Stars are always appreciated!

Would love any feedback or your thoughts around LLM Development.

leobg 2 days ago

It seems to me that the fastest way for iterative improvement is to use LLMs with pure Python with as few frameworks in between as possible:

You know exactly what goes into the prompt, how it’s parsed, what params are used or when they are changed. You can abstract away as much or as little of it as you like. Your API is going to change only when you make it so. And everything you learn about patterns in the process will be applicable to Python in general - not just one framework that may be replaced two months from now.

  • asif_ 2 days ago

    I agree that you want to have the flexibility for what you use to build your LLM application. For example, if you want direct API level access to building and searching through your RAG layer, calling your LLM models, and other business logic, you should. There's a lot of opportunities to fine-tune each of these layers. However, you are still left with having thousands of combinations that you can experiment with, ex. which prompt template x rag context x LLM model gives me the best results. And you need a framework that helps you manage these thousands of experiments. That is where I'm trying to position this framework, which is it helps you scale the need for being able to try thousands of different configuration of your LLM application, so you can improve your LLM application performance, while providing as much flexibility for what components you use to actually build your LLM application.

    With our framework, if you want flexibility for

    > You know exactly what goes into the prompt, how it’s parsed, what params are used or when they are changed

    We provide this for you. We just give you a process that lets you try and evaluate different configurations of your LLM application layer at scale.

  • TNWin 2 days ago

    I agree. The complexity doesn’t lie in the abstractions that these frameworks are selling.

    • asif_ 2 days ago

      I try to think about what should be a framework and what should be a library. Libraries are tools that helps you achieve a task, for example, building a prompt, calling LLM models, communicating with vector database.

      Frameworks are more process driven for achieving a complex task. This is like ReactJS with their component mode -- they set a process for building web application such that you can build more complex applications. At the same time, you have lots of flexibilities in the implementation details of your application. Framework should provide as much flexibilities as possible.

      Similarly, we are trying to build our framework for streamlining the process for LLM development such that you can iterate on your LLM application faster. To help setup this process, we enforce very high-level interfaces for how you build(input & output schema), evaluate, deploy your application. We provide all the flexibilities to the developer for low-level implementation details and ensure it's extensible so you can also use any external tools you want within the framework.

jeswin 2 days ago

Thanks for TypeScript support, when nearly everything else is in Python - trying it right away. Although we were familiar with Python, the lack of types were slowing us down tremendously every time we wanted to refactor. Python's typing is really baby typing.

  • campers 2 days ago

    It feels a step backwards developing with a such a minimal type system. Seeing so many projects in Python was one reason we're open sourcing our LLM/agentic framework, to give more options for the TypeScript community. We haven't launched it but the code it up at https://github.com/trafficguard/nous

  • zarathustreal 2 days ago

    As a Haskell and then TypeScript dev now having to work with Python for exactly this reason, I second this sentiment. I can’t understand why Python is the language the industry decided to use as the de-facto AI language

orliesaurus 3 days ago

This is a good idea, I wonder if you have a write-up/blog about the performance gains in real world applications?

  • asif_ 2 days ago

    Hey, thanks for checking out the framework! We just released this week so there aren't any data-points to share yet. But as we onboard more dev teams, we'll are planning on writing about their process and outcomes over the next few months.

    If you are curious about the theory and best practices behind iterating on LLM applications to improve it's performance, this is a good blog-post from Data Science at Microsoft: https://medium.com/data-science-at-microsoft/evaluating-llm-...

    I am also working on takes the theory behind the blog-post above, and converting that to a more practical guide using our framework. It should be out within the next two weeks. You can get notified when we release a blog by signing up for our newsletter: https://palico.us22.list-manage.com/subscribe?u=84ba2d0a4c03...

spacecadet 2 days ago

This is not unique. Hardware for instance. You can easily exceed 10-100 iteration cycles on a single part.

Also in general, We are currently in a time of comparatively low iteration. Most companies don't have the tolerance for it anymore and choose cheap one-shot execution at stupid risk, because of FOMO.

Iteration cycles are a function of your inputs; creative potential, vision, energy, runway.

mdp2021 2 days ago

> quickly iterate towards your accuracy goals

Do not you have a phenomenon akin to overfitting? How do you ensure that enhancing accuracy on foreseen input does not weaken results under unforseen future ones?

  • IanCal 2 days ago

    Yes but it's less of an issue than when we typically talk about it.

    Overfitting in most ml is a problem because you task an automated process with no understanding with the job of mercilessly optimising for a goal, and then you have to figure out how to spot that.

    Here you're actively picking architectures and you should be actively creating new tests to see how your system performs.

    You're also dealing with much more general purpose systems, so the chance you're overfitting lowers.

    Beyond that you're into the production ML environment where you need to be monitoring how things are going for your actual users.

    • asif_ 2 days ago

      Absolutely agree, thanks for the clear explanation!

jonnycoder 3 days ago

How does this intersect with evaluation in LLM integration & testing?

  • asif_ 2 days ago

    Hey, thanks for the question. Are you talking about standard evaluation tools like promptfoo? These evaluation frameworks are often just tools that helps you grade the response of your LLM application. They however do not help you to build an LLM application that makes it easy to test different configurations of your application and evaluate them. That is where we different -- we help you build an application that is made for easily testing different configurations of your application so you can evaluate them much faster.

    So the process we see when companies are trying to adopt a evaluation framework is that when they want to try a new configuration, they completely change their code-base, create the code to run an evaluation, and review that result independently and try to compare with other changes they have made sometimes in the past. This usually leads to a very slow process for making new changes and becomes very unorganized.

    With us, we help you build your LLM application where it's easy to swap components. From there, when you want to see how your application works with a certain configuration, we have a UI where you can pass in the configuration settings for your application, and run an evaluation. We also save all your previous evaluations so you can easily compare them with each other. As a result, it's very easy and fast to test different configurations of your application and evaluate them with us.

saberience 2 days ago

I may be wrong but this seems to only be useful if you want to write your code in Typescript? If my application uses Java or Python I can't use Palico?

  • asif_ 2 days ago

    Hi, yeah unfortunately this is only in Typescript at the moment. As we refine the framework more, we'll look for a more language agnostic approach, or provide support across different languages.

f6v 2 days ago
  • asif_ 2 days ago

    Interesting read. Honestly I don't have enough business experience to make any conclusion. But here's one point I disagreed with.

    The article states companies are pivoting towards more specialized verticals, ex. LlamaIndex is focusing on managed document parsing / OCR, which means they are going to get smaller and smaller and eventually die. I don't think just because companies are narrowing their scope means they can't have viable business. If LlamaIndex was charging $100K base price per enterprise and had 1000 customers, they are doing 100M in revenue at least, which is a very viable business.

    If you are curious about this topic, maybe this is a good podcast for you :)

    https://www.youtube.com/@opensourcebusiness/videos

E_Bfx 2 days ago

How easy is it to switch from OpenAI to testing a LLM on premise ?

  • asif_ 2 days ago

    We provide complete flexibility on how you call your LLM model. So if you have your on-prem LLM behind an API, you would just write the standard code to call your API from within our framework.