Completion results are worse with Windows line endings (CRLF) #3279

kleincode · 2024-10-16T15:58:07Z

Describe the bug

Tabby is running on a Linux machine, but I'm using it through the VSCode extension on Windows. When I edit a document with CRLF line endings (the standard on Windows), code completion returns an empty string or unfitting suggestions substantially more often. When I change the line endings to LF, the completion suggestions improve.

From looking at the completions events, I can see that the line endings are fed directly to the model - in the prompt field, they are conserved. Thus, the same piece of code will lead to different results by default on Windows and Linux/Mac machines. I have tested this with different variants of CodeLLaMa.

I'd like to ask if this is intended. I would suggest that only LF line endings should be fed to the model because in my experiments, the results were consistently better than with CRLF. Theoretically, this makes sense since the models are probably trained mostly on LF data.

Information about your version

tabby 0.18.0 via Docker compose on Debian with CUDA

Information about your GPU

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10                     On  |   00000000:B3:00.0 Off |                    0 |
|  0%   65C    P0             71W /  150W |    8653MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    975463      C   /usr/bin/python3                             1394MiB |
|    0   N/A  N/A    976533      C   /opt/tabby/bin/llama-server                   786MiB |
|    0   N/A  N/A    980188      C   /opt/tabby/bin/llama-server                  6454MiB |
+-----------------------------------------------------------------------------------------+

Additional context

Minimal example

Sending the Swagger example to the completions endpoint works perfectly fine:

{
  "language": "python",
  "segments": {
    "prefix": "def fib(n):\n    ",
    "suffix": "\n        return fib(n - 1) + fib(n - 2)"
  }
}

{
  "id": "cmpl-b2f200d4-8cad-4bcc-98d0-18c2d5263fbc",
  "choices": [
    {
      "index": 0,
      "text": "if n <= 1:"
    }
  ]
}

But the same request with CRLF returns an empty string:

{
  "language": "python",
  "segments": {
    "prefix": "def fib(n):\r\n    ",
    "suffix": "\r\n        return fib(n - 1) + fib(n - 2)"
  }
}

{
  "id": "cmpl-315f7020-6435-409e-b257-08fcf4c3f77b",
  "choices": [
    {
      "index": 0,
      "text": ""
    }
  ]
}

Real example 1

Some examples of the exact same code leading to good suggestions with LF, but no suggestions with CRLF:

{
  "completion": {
    "completion_id": "cmpl-ffa784c2-4f2d-4502-8606-143ff2c2bec7",
    "language": "java",
    "prompt": "<PRE> // Path: file:///c%3A/Users/x/Documents/temp/HelloWorld.java\n//     int fib(int n) {\n//         if (n <= 1) {\n//             return n;\n//         } else {\n//             return fib(n - 1) + fib(n - 2);\n//         }\n//     }\n//\n//     int sum(int n, int m) {\n//         return n + m;\n//     }\n//\n//     int factorial(int n) {\n//         if (n <= 1) {\n//             return 1;\n//         } else {\n//             return n * factorial(n - 1);\n//         }\n//     }\n//\n//     void printPyramid(int height) {\n//         for (int i = 0; i < height; i++) {\n//             for (int j = 0; j < height - i; j++) {\n//\n// Path: file:///c%3A/Users/x/Documents/temp/HelloWorld.java\n//             for (int j = 0; j < height - i; j++) {\n//                 System.out.print(\" \");\n//             }\n//             for (int j = 0; j <= i; j++) {\n//                 System.out.print(\"*\");\n//             }\n//             System.out.println(\"\");\n//         }\n//     }\n//\nclass MathUtils {\r\n    static double clamp(double value, double min, double max) {\r\n        return math.max(min, math.min(max, value));\r\n    }\r\n\r\n    static double lerp(double a, double b, double t) {\r\n        return a + (b - a) * t;\r\n    }\r\n    \r\n    static int fibonacci(int n) {\r\n         <SUF>\r\n    }\r\n} <MID>",
    "segments": {
      "prefix": "class MathUtils {\r\n    static double clamp(double value, double min, double max) {\r\n        return math.max(min, math.min(max, value));\r\n    }\r\n\r\n    static double lerp(double a, double b, double t) {\r\n        return a + (b - a) * t;\r\n    }\r\n    \r\n    static int fibonacci(int n) {\r\n        ",
      "suffix": "\r\n    }\r\n}"
    },
    "choices": [
      {
        "index": 0,
        "text": ""
      }
    ],
    "user_agent": "Node.js/v20.16.0 tabby-agent/1.8.0-dev Visual-Studio-Code-desktop/1.94.2 TabbyML.vscode-tabby/1.12.2"
  }
}

With LF:

{
  "completion": {
    "completion_id": "cmpl-f35217b2-6e40-421f-9d74-77bb0ff7c0ba",
    "language": "java",
    "prompt": "<PRE> // Path: file:///c%3A/Users/x/Documents/temp/HelloWorld.java\n//     int fib(int n) {\n//         if (n <= 1) {\n//             return n;\n//         } else {\n//             return fib(n - 1) + fib(n - 2);\n//         }\n//     }\n//\n//     int sum(int n, int m) {\n//         return n + m;\n//     }\n//\n//     int factorial(int n) {\n//         if (n <= 1) {\n//             return 1;\n//         } else {\n//             return n * factorial(n - 1);\n//         }\n//     }\n//\n//     void printPyramid(int height) {\n//         for (int i = 0; i < height; i++) {\n//             for (int j = 0; j < height - i; j++) {\n//\n// Path: file:///c%3A/Users/x/Documents/temp/HelloWorld.java\n//             for (int j = 0; j < height - i; j++) {\n//                 System.out.print(\" \");\n//             }\n//             for (int j = 0; j <= i; j++) {\n//                 System.out.print(\"*\");\n//             }\n//             System.out.println(\"\");\n//         }\n//     }\n//\nclass MathUtils {\n    static double clamp(double value, double min, double max) {\n        return math.max(min, math.min(max, value));\n    }\n\n    static double lerp(double a, double b, double t) {\n        return a + (b - a) * t;\n    }\n    \n    static int fibonacci(int n) {\n         <SUF>\n    }\n} <MID>",
    "segments": {
      "prefix": "class MathUtils {\n    static double clamp(double value, double min, double max) {\n        return math.max(min, math.min(max, value));\n    }\n\n    static double lerp(double a, double b, double t) {\n        return a + (b - a) * t;\n    }\n    \n    static int fibonacci(int n) {\n        ",
      "suffix": "\n    }\n}"
    },
    "choices": [
      {
        "index": 0,
        "text": "if (n <= 1) {\n            return n;\n        } else {\n            return fibonacci(n - 1) + fibonacci(n - 2);\n        }\n    }\n    \n    static int factorial(int n) {\n        if (n"
      }
    ],
    "user_agent": "Node.js/v20.16.0 tabby-agent/1.8.0-dev Visual-Studio-Code-desktop/1.94.2 TabbyML.vscode-tabby/1.12.2"
  }
}

Real example 2

Another example with CRLF:

{
  "completion": {
    "completion_id": "cmpl-e4c9f951-de3b-4b09-bdea-e0e1acf0827e",
    "language": "java",
    "prompt": "<PRE> // Path: file:///c%3A/Users/x/Documents/temp/HelloWorld.java\n//     int fib(int n) {\n//         if (n <= 1) {\n//             return n;\n//         } else {\n//             return fib(n - 1) + fib(n - 2);\n//         }\n//     }\n//\n//     int sum(int n, int m) {\n//         return n + m;\n//     }\n//\n//     int factorial(int n) {\n//         if (n <= 1) {\n//             return 1;\n//         } else {\n//             return n * factorial(n - 1);\n//         }\n//     }\n//\n//     void printPyramid(int height) {\n//         for (int i = 0; i < height; i++) {\n//             for (int j = 0; j < height - i; j++) {\nclass Fibonacci {\r\n    int fib(int n) {\r\n        if (n <= 1) {\r\n            return n;\r\n        }\r\n        return fib(n - 1) + fib(n - 2);\r\n    }\r\n\r\n    int fibIterative(int n) {\r\n         <SUF>\r\n    }\r\n} <MID>",
    "segments": {
      "prefix": "class Fibonacci {\r\n    int fib(int n) {\r\n        if (n <= 1) {\r\n            return n;\r\n        }\r\n        return fib(n - 1) + fib(n - 2);\r\n    }\r\n\r\n    int fibIterative(int n) {\r\n        ",
      "suffix": "\r\n    }\r\n}"
    },
    "choices": [
      {
        "index": 0,
        "text": ""
      }
    ],
    "user_agent": "Node.js/v20.16.0 tabby-agent/1.8.0-dev Visual-Studio-Code-desktop/1.94.2 TabbyML.vscode-tabby/1.12.2"
  }
}

After changing the line endings from CRLF to LF:

{
  "completion": {
    "completion_id": "cmpl-f107d218-dbc5-4fe3-a40d-2fa2aa82aa5b",
    "language": "java",
    "prompt": "<PRE> // Path: file:///c%3A/Users/x/Documents/temp/HelloWorld.java\n//     int fib(int n) {\n//         if (n <= 1) {\n//             return n;\n//         } else {\n//             return fib(n - 1) + fib(n - 2);\n//         }\n//     }\n//\n//     int sum(int n, int m) {\n//         return n + m;\n//     }\n//\n//     int factorial(int n) {\n//         if (n <= 1) {\n//             return 1;\n//         } else {\n//             return n * factorial(n - 1);\n//         }\n//     }\n//\n//     void printPyramid(int height) {\n//         for (int i = 0; i < height; i++) {\n//             for (int j = 0; j < height - i; j++) {\nclass Fibonacci {\n    int fib(int n) {\n        if (n <= 1) {\n            return n;\n        }\n        return fib(n - 1) + fib(n - 2);\n    }\n\n    int fibIterative(int n) {\n         <SUF>\n    }\n} <MID>",
    "segments": {
      "prefix": "class Fibonacci {\n    int fib(int n) {\n        if (n <= 1) {\n            return n;\n        }\n        return fib(n - 1) + fib(n - 2);\n    }\n\n    int fibIterative(int n) {\n        ",
      "suffix": "\n    }\n}"
    },
    "choices": [
      {
        "index": 0,
        "text": "int a = 0;\n        int b = 1;\n        for (int i = 0; i < n; i++) {\n            int c = a + b;\n            a = b;\n            b = c;\n        }\n        return b;\n    }\n"
      }
    ],
    "user_agent": "Node.js/v20.16.0 tabby-agent/1.8.0-dev Visual-Studio-Code-desktop/1.94.2 TabbyML.vscode-tabby/1.12.2"
  }
}

The text was updated successfully, but these errors were encountered:

wsxiaoys · 2024-10-16T16:02:24Z

Thank you for the detailed bug report. My initial thought is that this issue might stem from a limitation of the model, rather than a manipulation within Tabby's layer.

One approach you could take is to submit a relevant completion request at https://demo.tabbyml.com/swagger-ui/#/v1/completion, which is powered by a much more powerful model (>100B parameters). Based on my testing, requests with CRLF seem to work just fine.

kleincode · 2024-10-16T16:16:22Z

Hi, thanks for the quick reply! You are right, it works more often with a more powerful model. My minimal example and real example 1 work well on the test instance, even with CRLF. But with real example 2, there is still some weird output with CRLF only:

Input from example 2 (CRLF):

{
  "language": "java",
  "segments": {
    "prefix": "class Fibonacci {\r\n    int fib(int n) {\r\n        if (n <= 1) {\r\n            return n;\r\n        }\r\n        return fib(n - 1) + fib(n - 2);\r\n    }\r\n\r\n    int fibIterative(int n) {\r\n        ",
    "suffix": "\r\n    }\r\n}"
  }
}

Output (CRLF):

{
  "id": "cmpl-1c3809e9-03f1-4edb-a22a-f4fdd1f706ef",
  "choices": [
    {
      "index": 0,
      "text": "if (n <= 1)bolds\n            return nbolds\nbolds\n        int fib = 1bolds\n        int prevFib = 1bolds\nbolds\n        for (int i = 2; i < n; i++)bolds\n        {bolds\n            int temp = fibbolds\n            fib += prevFibbolds\n            prevFib = tempbolds\n        }bolds\nbolds\n        return fibbolds\n    }bolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds\nbolds"
    }
  ]
}

This happens repeatedly. With LF, it works fine.

wsxiaoys · 2024-10-16T16:19:11Z

I feel it is more or less related to how data in model training is processed. One way to solve this is by replacing CRLF with LF in requests within Tabby, and then replacing them back in responses.

wsxiaoys · 2024-10-17T02:02:36Z

Additional notes: https://demo.tabbyml.com/files/github/TabbyML/tabby/-/search/main?q=join%5C%5C(%5C%22%5C%5C%5C%5Cn%5C%22%5C%5C)%20lang%3Arust

Inserted context (e.g from LSP / repository context) is currently constructed solely with LF, which might make the situation worse

zwpaper · 2024-10-22T06:36:36Z

Hi @kleincode, which model do you use can mostly reproduce this case?

kleincode · 2024-10-22T08:52:33Z

Hi @zwpaper I can reproduce all my examples using CodeLlama-7b. It usually fails to give proper predictions as soon as there are 2-3 CRLF line breaks in the input. I would assume though that you can find examples for every LLM where replacing LF with CRLF (or even mixing them) leads to different, usually worse, results.

kleincode added the bug-unconfirmed label Oct 16, 2024

wsxiaoys added enhancement New feature or request and removed bug-unconfirmed labels Oct 16, 2024

wsxiaoys mentioned this issue Oct 17, 2024

Code Completion Quality - Ideas #2674

Open

6 tasks

wsxiaoys assigned zwpaper Oct 21, 2024

zwpaper linked a pull request Oct 23, 2024 that will close this issue

chore(completion): replace CRLF with LF for code completion LLM requests #3303

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completion results are worse with Windows line endings (CRLF) #3279

Completion results are worse with Windows line endings (CRLF) #3279

kleincode commented Oct 16, 2024

wsxiaoys commented Oct 16, 2024

kleincode commented Oct 16, 2024 •

edited

Loading

wsxiaoys commented Oct 16, 2024

wsxiaoys commented Oct 17, 2024 •

edited

Loading

zwpaper commented Oct 22, 2024

kleincode commented Oct 22, 2024

Completion results are worse with Windows line endings (CRLF) #3279

Completion results are worse with Windows line endings (CRLF) #3279

Comments

kleincode commented Oct 16, 2024

Describe the bug

Information about your version

Information about your GPU

Additional context

Minimal example

Real example 1

Real example 2

wsxiaoys commented Oct 16, 2024

kleincode commented Oct 16, 2024 • edited Loading

wsxiaoys commented Oct 16, 2024

wsxiaoys commented Oct 17, 2024 • edited Loading

zwpaper commented Oct 22, 2024

kleincode commented Oct 22, 2024

kleincode commented Oct 16, 2024 •

edited

Loading

wsxiaoys commented Oct 17, 2024 •

edited

Loading