Enable LLM-Driven Data Exploration with Presets and 3W Integration #55

zRafaF · 2024-10-18T01:43:55Z

Overview

This PR lays the groundwork for integrating Large Language Models (LLMs) into BibMon. It introduces a client that allows users to interact with endpoints for data processing and exploration. As a starting point, we have implemented access to the 3W dataset, enabling the model to infer the most relevant column based on the provided data.

Data tailoring is handled through what we call presets, which are located in bibmon/llm/presets. These presets allow users to customize how data is structured before being sent to the model.

Limitations

This feature acts purely as a client, meaning it requires an external endpoint for model interaction. For instance, you can use OpenAI's API or self-host an alternative model, as we have done.

Direct LLM inference within BibMon could be achieved using tools like llama-cpp-python or similar. However, this approach was avoided to prevent unnecessary complexity and bloat in the library.

While fine-tuning the LLM and creating more precise instructions for a dataset is possible, it requires a detailed data annotation process. Additional information on this can be found in our auxiliary notebook.

Note: This PR is dependent on #50.

Implementation Details (3W Dataset Integration)

Data Preset

The data sent to the model follows this structure:

{
    "event_name": "string",
    "event_description": "string",
    "columns_and_description": "dict",
    "data": [
        {
            "event_name": "string",
            "average_values": "string",
            "standard_deviation": "string",
            "head": "string",
            "tail": "string"
        },
        ...
    ]
}

Model Response Format

The model will respond with the following structure:

{
    "column": "key of the column of interest",
    "extra": "additional information deemed relevant by the model"
}

Usage Example

import bibmon
import bibmon.llm
import bibmon.llm.presets
import bibmon.llm.tools
import bibmon.three_w

# Loading the data
df, ini, class_id = bibmon.load_3w(8, "WELL-00026_20170608230000.parquet")

pre_processed = bibmon.PreProcess(
    f_pp=["remove_empty_variables", "ffill_nan", "remove_frozen_variables"]
).apply(df)

formatted_data = bibmon.three_w.tools.format_for_llm_prediction(
    pre_processed, ini, class_id, 10
)

llm_client = bibmon.llm.Client(
    "http://127.0.0.1:5000/v1",
    "NO_API_KEY",
    "Llama3",
)

res = llm_client.chat_completion_json_tool(
    *bibmon.llm.presets.three_w_find_linked_column(formatted_data)
)

parsed_result = bibmon.llm.tools.parse_chat_completion(res)

print(parsed_result)

Additional Resources

For further information and examples, please refer to our detailed notebook showcasing this feature.

…r ini inspection

* fixed relative imports * fixed relative imports * removed debbug prints

- Refactor the LLM class to improve readability and maintainability. - Add support for JSON tool schema in the `chat_completion_json_tool` method. - Update the default value of `max_tokens` parameter in the `chat_completion` method. - Add a new method `parse_chat_completion` to parse the chat completion response. Fix relative imports in LLM module - Fix relative imports in the `__init__.py` file of the LLM module. Add functionality to find linked column in three_w module - Add a new function `three_w_find_linked_column` in the `presets.py` file of the LLM module. - Modify the function to accept a JSON dataset and find the column linked to the error. Add functionality to format dataset for LLM prediction - Add a new function `format_for_llm_prediction` in the `tools.py` file of the three_w module. - Modify the function to format the dataset for LLM prediction based on the provided configuration file. Update formatting and documentation in tools.py - Update formatting and documentation in the `tools.py` file of the three_w module.

zRafaF and others added 14 commits October 15, 2024 13:00

added find_df_transitions to _bibmon_tools

d0dca7f

Created test requirements and implemented find_df_transitions unity test

3b72b57

implemented df splitting tool and its unity test

72be3cc

Added 3w dataset example to code base, added 3w loader and tooling fo…

65c119d

…r ini inspection

added split_dataset to 3w tools

db3d911

Fix relative imports (#1)

f016640

* fixed relative imports * fixed relative imports * removed debbug prints

added llm suport folder struct

9363cd3

adding template to use 3w with the llm

71e6575

added support to download 3w dataset from gh repo

e792e1d

Merge branch 'main' into llm

f15654b

Refactor LLM class and adjust parameters for chat completion

89865fc

removed unused var. TODO: Rename Llm to client

87e3d07

formating and unity tests

8bdd1f0

zRafaF changed the title ~~Adding LLM client support to BibMon, implemented 3W integration for finding columns of interest~~ Enable LLM-Driven Data Exploration with Presets and 3W Integration Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable LLM-Driven Data Exploration with Presets and 3W Integration #55

Enable LLM-Driven Data Exploration with Presets and 3W Integration #55

zRafaF commented Oct 18, 2024

Enable LLM-Driven Data Exploration with Presets and 3W Integration #55

Are you sure you want to change the base?

Enable LLM-Driven Data Exploration with Presets and 3W Integration #55

Conversation

zRafaF commented Oct 18, 2024

Overview

Limitations

Implementation Details (3W Dataset Integration)

Data Preset

Model Response Format

Usage Example

Additional Resources