You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: examples/server/README.md
+52-8
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,13 @@
1
1
# llama.cpp/example/server
2
2
3
-
This example demonstrates a simple HTTP API server to interact with llama.cpp.
3
+
This example demonstrates a simple HTTP API server and a simple web front end to interact with llama.cpp.
4
4
5
5
Command line options:
6
6
7
7
-`--threads N`, `-t N`: Set the number of threads to use during computation.
8
8
-`-m FNAME`, `--model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
9
9
-`-m ALIAS`, `--alias ALIAS`: Set an alias for the model. The alias will be returned in API responses.
10
-
-`-c N`, `--ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
10
+
-`-c N`, `--ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference. The size may differ in other models, for example, baichuan models were build with a context of 4096.
11
11
-`-ngl N`, `--n-gpu-layers N`: When compiled with appropriate support (currently CLBlast or cuBLAS), this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
12
12
-`-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.
13
13
-`-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.
@@ -21,24 +21,22 @@ Command line options:
21
21
-`-to N`, `--timeout N`: Server read/write timeout in seconds. Default `600`.
22
22
-`--host`: Set the hostname or ip address to listen. Default `127.0.0.1`.
23
23
-`--port`: Set the port to listen. Default: `8080`.
24
+
-`--path`: path from which to serve static files (default examples/server/public)
Then you can utilize llama.cpp as an OpenAI's **chat.completion** or **text_completion** API
207
+
208
+
### Extending or building alternative Web Front End
209
+
210
+
The default location for the static files is `examples/server/public`. You can extend the front end by running the server binary with `--path` set to `./your-directory` and importing `/completion.js` to get access to the llamaComplete() method.
211
+
212
+
Read the documentation in `/completion.js` to see convenient ways to access llama.
0 commit comments