Loading CSV files
When working with the Retrieval Augmented Generation (RAG) model, it is often necessary to load tabular data, such as a CSV file. This guide provides recommendations for loading CSV files in a way that is compatible with the RAG model.
When loading a CSV file, the process involves:
-
Transforming each row into a document.
-
Ingesting the set of documents using an appropriate document splitter.
-
Storing the documents in the database.
You can find a complete example in the GitHub Repository.
From CSV to Documents
There are multiple ways to load CSV files in Java. In this example, we use the following dependencies:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
<version>1.10.0</version>
</dependency>
You can choose a different library; the APIs are similar enough.
Once you have the dependency, load the CSV and process the rows:
/**
* The CSV file to load.
*/
@ConfigProperty(name = "csv.file")
File file;
/**
* The CSV file headers.
* Some libraries provide an API to extract them.
*/
@ConfigProperty(name = "csv.headers")
List<String> headers;
/**
* Ingest the CSV file.
* This method is executed when the application starts.
*/
public void ingest(@Observes StartupEvent event) throws IOException {
// Configure the CSV format.
CSVFormat csvFormat = CSVFormat.DEFAULT.builder()
.setHeader(headers.toArray(new String[0]))
.setSkipHeaderRecord(true)
.build();
// This will be the resulting list of documents:
List<Document> documents = new ArrayList<>();
try (Reader reader = new FileReader(file)) {
// Generate one document per row, using the specified syntax.
Iterable<CSVRecord> records = csvFormat.parse(reader);
int i = 1;
for (CSVRecord record : records) {
Map<String, String> metadata = new HashMap<>();
metadata.put("source", file.getAbsolutePath());
metadata.put("row", String.valueOf(i++));
StringBuilder content = new StringBuilder();
for (String header : headers) {
// Include all headers in the metadata.
metadata.put(header, record.get(header));
content.append(header).append(": ").append(record.get(header)).append("\n");
}
documents.add(new Document(content.toString(), Metadata.from(metadata)));
}
// ...
}
Ingesting the Documents
Once you have the list of documents, they need to be ingested. For this, use a document splitter. We recommend the recurve
splitter, a simple splitter that divides the document into chunks of a given size. While it may not be the most suitable splitter for your use case, it serves as a good starting point.
var ingestor = EmbeddingStoreIngestor.builder()
.embeddingStore(store) // Injected
.embeddingModel(embeddingModel) // Injected
.documentSplitter(recursive(500, 0))
.build();
ingestor.ingest(documents);
Implementing the Retriever
With the documents ingested, you can now implement the retriever:
package io.quarkiverse.langchain4j.sample.chatbot;
import java.util.List;
import jakarta.enterprise.context.ApplicationScoped;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.retriever.EmbeddingStoreRetriever;
import dev.langchain4j.retriever.Retriever;
import io.quarkiverse.langchain4j.redis.RedisEmbeddingStore;
@ApplicationScoped
public class RetrieverExample implements Retriever<TextSegment> {
private final EmbeddingStoreRetriever retriever;
RetrieverExample(RedisEmbeddingStore store, EmbeddingModel model) {
// Limit the number of documents to avoid exceeding the context size.
retriever = EmbeddingStoreRetriever.from(store, model, 10);
}
@Override
public List<TextSegment> findRelevant(String s) {
return retriever.findRelevant(s);
}
}