Ollama
Prerequisites
To use Ollama, you need to have a running Ollama server. Go to the Ollama download page and download the server for your platform.
Once installed, check that Ollama is running using:
> ollama list
It may not display any model, which is fine, let’s pull the llama2
model:
> ollama pull llama2
Models are huge. For example Llama2 is 3.8Gb. Make sure you have enough disk space. |
Let’s also pull the default embedding model:
> ollama pull nomic-embed-text
Dev Service
If you have Ollama running locally, you do not need a dev service. However, if you want to use the Ollama dev service, add the following dependency to your project:
Then, in your
The dev service will start an Ollama server for you, using a docker container. Note that the provisioning can take some time. |
Using Ollama
To integrate with models running on Ollama, add the following dependency into your project:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-ollama</artifactId>
<version>0.13.1</version>
</dependency>
If no other LLM extension is installed, AI Services will automatically utilize the configured Ollama model.
By default, the extension uses llama2
, the model we pulled in the previous section.
You can change it by setting the quarkus.langchain4j.ollama.chat-model.model-id
property in the application.properties
file:
# Do not forget to pull the model before using it using `ollama pull <model-id>`
quarkus.langchain4j.ollama.chat-model.model-id=mistral
Configuration
Several configuration properties are available:
Configuration property fixed at build time - All other configuration properties are overridable at runtime
Type |
Default |
|
---|---|---|
Whether the model should be enabled Environment variable: |
boolean |
|
Whether the model should be enabled Environment variable: |
boolean |
|
Base URL where the Ollama serving is running Environment variable: |
string |
|
Timeout for Ollama calls Environment variable: |
|
|
Whether the Ollama client should log requests Environment variable: |
boolean |
|
Whether the Ollama client should log responses Environment variable: |
boolean |
|
Whether to enable the integration. Defaults to Environment variable: |
boolean |
|
Model to use. According to Ollama docs, the default value is Environment variable: |
string |
|
The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively. Environment variable: |
double |
|
Maximum number of tokens to predict when generating text Environment variable: |
int |
|
Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return Environment variable: |
list of string |
|
Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text Environment variable: |
double |
|
Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative Environment variable: |
int |
|
With a static number the result is always the same. With a random number the result varies Example:
Environment variable: |
int |
|
the format to return a response in. Currently, the only accepted value is Environment variable: |
string |
|
Whether chat model requests should be logged Environment variable: |
boolean |
|
Whether chat model responses should be logged Environment variable: |
boolean |
|
Model to use. According to Ollama docs, the default value is Environment variable: |
string |
|
The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively. Environment variable: |
double |
|
Maximum number of tokens to predict when generating text Environment variable: |
int |
|
Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return Environment variable: |
list of string |
|
Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text Environment variable: |
double |
|
Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative Environment variable: |
int |
|
Whether embedding model requests should be logged Environment variable: |
boolean |
|
Whether embedding model responses should be logged Environment variable: |
boolean |
|
Type |
Default |
|
Base URL where the Ollama serving is running Environment variable: |
string |
|
Timeout for Ollama calls Environment variable: |
|
|
Whether the Ollama client should log requests Environment variable: |
boolean |
|
Whether the Ollama client should log responses Environment variable: |
boolean |
|
Whether to enable the integration. Defaults to Environment variable: |
boolean |
|
Model to use. According to Ollama docs, the default value is Environment variable: |
string |
|
The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively. Environment variable: |
double |
|
Maximum number of tokens to predict when generating text Environment variable: |
int |
|
Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return Environment variable: |
list of string |
|
Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text Environment variable: |
double |
|
Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative Environment variable: |
int |
|
With a static number the result is always the same. With a random number the result varies Example:
Environment variable: |
int |
|
the format to return a response in. Currently, the only accepted value is Environment variable: |
string |
|
Whether chat model requests should be logged Environment variable: |
boolean |
|
Whether chat model responses should be logged Environment variable: |
boolean |
|
Model to use. According to Ollama docs, the default value is Environment variable: |
string |
|
The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively. Environment variable: |
double |
|
Maximum number of tokens to predict when generating text Environment variable: |
int |
|
Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return Environment variable: |
list of string |
|
Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text Environment variable: |
double |
|
Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative Environment variable: |
int |
|
Whether embedding model requests should be logged Environment variable: |
boolean |
|
Whether embedding model responses should be logged Environment variable: |
boolean |
|
About the Duration format
To write duration values, use the standard You can also use a simplified format, starting with a number:
In other cases, the simplified format is translated to the
|
Document Retriever and Embedding
Ollama also provides embedding models.
By default, it uses nomic-embed-text
(make sure you pulled that model as indicated in the prerequisites section).
You can change the default embedding model by setting the quarkus.langchain4j.ollama.embedding-model.model-id
property in the application.properties
file:
quarkus.langchain4j.ollama.log-requests=true
quarkus.langchain4j.ollama.log-responses=true
quarkus.langchain4j.ollama.chat-model.model-id=mistral
quarkus.langchain4j.ollama.embedding-model.model-id=mistral
If no other LLM extension is installed, retrieve the embedding model as follows:
@Inject EmbeddingModel model; // Injects the embedding model
However, in general, we recommend using local embedding models, as Ollama embeddings are rather slow.