TEQNation 2024: Using RAG and ReACT to augment your LLM app
I always love speaking at TEQNation. This year, I showed how to use RAG and ReACT to build an LLM-powered app. TEQNation is a great conference with a high-quality speaker lineup every year, and since I live in the Netherlands, it’s very easy for me to get there.
Here’s the outline of my talk, with links to the deck and recording of the session at the end.
The idea
I built an app that lets you ask questions to a large language model (LLM), grounded in Hacker News comment threads. Ask it something like ‘Can satellites fly in formation?’, and it will sift through over 20 million comments to find relevant comment threads and give you an answer. You might also know this pattern as RAG, or retrieval augmented generation.
Getting 20 million comments
To make this work, I first had to gather all the data. While Hacker News provides a search API, I opted for scraping every single comment and post using the Hacker News API, starting from May 2019, which amounted to over 20 million individual comments. Once I had this dataset, I built my own system to search and find relevant comments to feed to the LLM.
Cloud Run jobs with parallelization to do the work
To run the job to scrape these millions of comments, I used Cloud Run Jobs, a Google Cloud product that lets you run a containerized task with a timeout of up to 7 days.
My initial approach was a simple for loop, but it was incredibly slow and only achieved 3% CPU utilization. I then tried using goroutines to improve performance, which helped utilize the instances better, but it still wasn’t fast enough.
The real breakthrough came when I leveraged Cloud Run job’s parallelism feature, allowing me to spin up 160 containers concurrently. This significantly sped up the process, and I successfully downloaded all 20 million items!
Indexing comment threads
However, my goal wasn’t just to collect individual items; I needed to index entire comment threads. My first attempt involved a complex SQL query with a recursive subquery, but it turned out to be painfully slow. Instead of debugging, I opted for a pre-order traversal of the comment tree, using many simple SQL queries to retrieve individual comments.
Implementing search using text embeddings on PostgreSQL
To enable search functionality, I decided to use text embeddings. These are numerical representations of text that capture the meaning and relationships between words and concepts, allowing for semantic search.
I generated embeddings using the Google Cloud Vertex AI text embedding API. I had to solve a few challenges, such as the API being rate limited and the token size per request being limited. But I managed to solve them by batching and - again - using goroutines.
For efficient vector search, I used the pgvector extension for PostgreSQL. I then built a simple API around this and deployed it as a Cloud Run service.
While dedicated vector stores like Pinecone and Weaviate exist, I opted for PostgreSQL because I’m familiar with it.
Connecting all the pieces with LangChain
Finally, I demonstrated how I implemented everything in Python using LangChain, a framework for developing applications powered by large language models. I explained the concept of a LangChain “chain,” which is essentially a sequence of operations that transform a user query into a final answer.
I also explored ReACT (Reasoning and Acting), a prompting technique that involves interacting with the LLM in a loop. While I implemented and tested ReACT, it didn’t significantly improve the responses in my specific use case.
Links
- Slides (docs.google.com)
- Session recording (youtube.com)
- Hacker News API
- LangChain
- RAG from scratch - an educational YouTube playlist by the folks at LangChain
- The original ReACT paper
- pgvector
- Announcing pgvector support in Cloud SQL
- Cloud Run jobs
- Cloud Run jobs parallelism feature
- Vertex AI text embedding API
- TEQNation