Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable LLM-Driven Data Exploration with Presets and 3W Integration #55

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

zRafaF
Copy link

@zRafaF zRafaF commented Oct 18, 2024

Overview

This PR lays the groundwork for integrating Large Language Models (LLMs) into BibMon. It introduces a client that allows users to interact with endpoints for data processing and exploration. As a starting point, we have implemented access to the 3W dataset, enabling the model to infer the most relevant column based on the provided data.

Data tailoring is handled through what we call presets, which are located in bibmon/llm/presets. These presets allow users to customize how data is structured before being sent to the model.

Limitations

This feature acts purely as a client, meaning it requires an external endpoint for model interaction. For instance, you can use OpenAI's API or self-host an alternative model, as we have done.

Direct LLM inference within BibMon could be achieved using tools like llama-cpp-python or similar. However, this approach was avoided to prevent unnecessary complexity and bloat in the library.

While fine-tuning the LLM and creating more precise instructions for a dataset is possible, it requires a detailed data annotation process. Additional information on this can be found in our auxiliary notebook.

Note: This PR is dependent on #50.

Implementation Details (3W Dataset Integration)

Data Preset

The data sent to the model follows this structure:

{
    "event_name": "string",
    "event_description": "string",
    "columns_and_description": "dict",
    "data": [
        {
            "event_name": "string",
            "average_values": "string",
            "standard_deviation": "string",
            "head": "string",
            "tail": "string"
        },
        ...
    ]
}

Model Response Format

The model will respond with the following structure:

{
    "column": "key of the column of interest",
    "extra": "additional information deemed relevant by the model"
}

Usage Example

import bibmon
import bibmon.llm
import bibmon.llm.presets
import bibmon.llm.tools
import bibmon.three_w

# Loading the data
df, ini, class_id = bibmon.load_3w(8, "WELL-00026_20170608230000.parquet")

pre_processed = bibmon.PreProcess(
    f_pp=["remove_empty_variables", "ffill_nan", "remove_frozen_variables"]
).apply(df)

formatted_data = bibmon.three_w.tools.format_for_llm_prediction(
    pre_processed, ini, class_id, 10
)

llm_client = bibmon.llm.Client(
    "http://127.0.0.1:5000/v1",
    "NO_API_KEY",
    "Llama3",
)

res = llm_client.chat_completion_json_tool(
    *bibmon.llm.presets.three_w_find_linked_column(formatted_data)
)

parsed_result = bibmon.llm.tools.parse_chat_completion(res)

print(parsed_result)

Additional Resources

For further information and examples, please refer to our detailed notebook showcasing this feature.

zRafaF and others added 14 commits October 15, 2024 13:00
* fixed relative imports

* fixed relative imports

* removed debbug prints
- Refactor the LLM class to improve readability and maintainability.
- Add support for JSON tool schema in the `chat_completion_json_tool` method.
- Update the default value of `max_tokens` parameter in the `chat_completion` method.
- Add a new method `parse_chat_completion` to parse the chat completion response.

Fix relative imports in LLM module

- Fix relative imports in the `__init__.py` file of the LLM module.

Add functionality to find linked column in three_w module

- Add a new function `three_w_find_linked_column` in the `presets.py` file of the LLM module.
- Modify the function to accept a JSON dataset and find the column linked to the error.

Add functionality to format dataset for LLM prediction

- Add a new function `format_for_llm_prediction` in the `tools.py` file of the three_w module.
- Modify the function to format the dataset for LLM prediction based on the provided configuration file.

Update formatting and documentation in tools.py

- Update formatting and documentation in the `tools.py` file of the three_w module.
@zRafaF zRafaF changed the title Adding LLM client support to BibMon, implemented 3W integration for finding columns of interest Enable LLM-Driven Data Exploration with Presets and 3W Integration Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant