Running chained LLMs with TypeScript in production
5/16/2023 · 6 min read
Right now LLMs are maybe the most exciting piece of new tech available, being used in a huge variety of cases from startups to global companies. Why? The time to value with these large models is amazing. They contain a huge amount of pre-trained data, allowing you to get decent output with almost zero knowledge of AI — you only need to make a prompt.
For real world usage you'll almost certainly want to chain LLM calls together in your backend, though chaining in production takes lots of time to develop well. In this post we'll explore how you can develop reliable chained LLMs and deploy to Vercel in minutes.
Why chaining?
A one-shot response from an LLM often isn’t enough to build the experience you need. You might need a chained workflow that refines and improves LLM responses through each iteration. Some examples:
- Given a user’s input, you might need to run 4 different prompts or ideas and present the output to users as choices (think Midjourney)
- You might need to chunk a user’s input to reduce context/tokens in each call
- You might need to continue to refine input, such as going from question → data → SQL → human readable answer
- You might just want the LLM to introspect whether it made the right answer (eg. ask “Are you sure?”). This is a basic, but common, approach to testing LLM output
- You might ask an LLM whether the prompt is susceptible to injection before running the actual prompt
Issues with chaining LLMs
Though chaining is required, it exacerbates every issue with LLMs in production:
- Reliability becomes harder
- Latency and costs increase
- The infrastructure required to store state and context through the chain becomes more difficult to manage
- Even small things like cancelling chains becomes hard (to eg. reduce cost)
- Observing the chain and introspecting state can be difficult
Taking an LLM pipeline from toy to production is a surprisingly difficult challenge that requires strong durable systems. Availability, reliability, state, and maintenance all become much harder when running something for many users.
Zero-infra, zero-ops chained LLMs
The annoying part of chaining is the distributed state and orchestration required to reliably run the chains. The ideal end state for chaining looks similar to the following:
- A single function with automatically managed function context/state
- Retries and durability built in
- Optional parallelisation included, to improve performance across complex flows
- Observability and transparency included
- Cancellation of flows possible to reduce costs
Using Inngest, you get all of this for free — zero infra or setup required. Using our SDK, you can build durable functions with retries, automatically persisted state, parallelism, and cancellation on any platform. All you need to focus on is the business logic: your prompts and your chain.
Here’s an example:
import { Inngest } from "inngest";// https://www.inngest.com/docs/learn/serving-inngest-functions#setting-up-the-apiimport { serve } from "inngest/next";const inngest = new Inngest({ name: "Chained LLM app" });export const chain = inngest.createFunction({ name: "Summarize chat and documents" },{ event: "api/chat.submitted" },async ({ event, step }) => {const llm = new OpenAI();// `step.run` creates a new reliable step which retries automatically, and// only runs once on success. It returns data which is stored in function// state automatically.const output = await step.run("Summarize input", async () => {const prompt = `You are an executive assistant.You must summarize the given document accurately within 4 paragraphs.`;return await llm.createCompletion({model: "gpt-3.5-turbo",prompt: `${prompt}: ${event.data.input}`,});});const title = await step.run("Generate a title", async () => {const prompt = `You are a business leader who writes reportson different topics. Given the following report, generate a titlewhich introduces the report in under 100 words`;// Uses output from the previous LLM call, stored automatically// in function memory!return await llm.createCompletion({model: "gpt-3.5-turbo",prompt: `${prompt}: ${output}`,});});// Save the generated content to the database, and return it to be captured// as the final function state.await step.run("Save to DB", async () => {await db.summaries.create({ output, title, requestID: event.data.requestID });});return { output, title };},);// Create an HTTP handler that serves your chained functions. This function// will be called any time the `api/chat.submitted` event is received.//// It can be hosted anywhere: Vercel, Netlify, Cloudflare, Fly.io, Railway, etc.export const handler = serve(inngest, [chain]);// You can trigger this function by sending an event. This is a single HTTP// POST.await inngest.send({name: "api/chat.submitted", // matches the event name in createFunctiondata: {requestID: "ef2fc16e-5f9b-48fb-a996-e3adbf1accb9",input: "<Add any data you want summarized here>",}});
In this example, we’re defining a function which automatically runs when you send a specific event to Inngest. It’s served via HTTP, and can be hosted on any platform. Under the hood, whenever we receive a new api/chat.submitted
event we:
- Create a new blank function run, with empty state
- Call your function with empty state.
- Run each step in series or parallel, injecting the current function state from
step.run
blocks into the appropriate variables.
You can then start to chain LLMs without building out complex queues, state, or services. You're writing standard TypeScript, so you don't need to learn a new framework. Everything “just works”, allowing you to focus specifically on the model code without worrying about any infrastructure at all — meaning you can deploy to production within an hour.
Beyond this simple case, you can extend your functions to:
- Automatically parallelize calls using actual parallelism to speed up your chain
- Automatically cancel if desired, eg. on window close, to save on model costs
- Manage concurrency, ensuring that you automatically enqueue functions if they exceed rate limits
- Handle failures, allowing you to create cleanup code or notify users on error.
- We also provide branch environments across platforms, and work with your current CI/CD process — as functions are served via HTTP anywhere they’re accessible.
We’re already helping a bunch of users build reliable products via LLMs. We’re free to use, and we’d love feedback on how to make this better for your own AI usage.
Future plans
Our initial list of ideas for the LLM use case space is:
- A higher-level library for working with chained functions
- An integration with Langchain, allowing people to run Langchain models via serverless environments with zero infra or state
- Function state introspection beyond our application UI, via an API and stylable React components
- And, once this is done, streaming function state to browsers
We’re excited about what this makes possible. Being able to focus on business logic and skip tedious, complex infrastructure is ideal, and seeing ideas go from code to production in hours — using your current development flow — is a dream. Please do give it a try, and if you have feedback let us know!