Add query_dataframe function to be able to then use the labels in pandas #8

n-peugnet · 2025-11-04T12:36:03Z

This is especially useful to use the labels later in plots, for exemple to group by configuration and such.

dcoles · 2025-11-04T22:39:11Z

Thanks for the PR!

I'll need to think a little about this, since the query API can return several different data-types depending on the query. The Prometheus.query and Prometheus.query_range methods currently map directly to the matching APIs, though I can understand why a table format might be preferred.

Query returning instant vector

node_cpu_seconds_total{mode=~"user|system"}

`query`

Returns: <class 'pandas.core.series.Series'>

node_cpu_seconds_total{cpu="0",instance="localhost:9100",job="node",mode="system"}    25504.27
node_cpu_seconds_total{cpu="0",instance="localhost:9100",job="node",mode="user"}      46959.06
dtype: float64

`query_dataframe`

~~Returns: <class 'pandas.core.series.Series'> ← Should have been a DataFrame?~~
Returns: <class 'pandas.core.frame.DataFrame'>

                 __name__ cpu        instance   job    mode     value
0  node_cpu_seconds_total   0  localhost:9100  node  system  25504.27
1  node_cpu_seconds_total   0  localhost:9100  node    user  46959.06

Query returning range vector

node_cpu_seconds_total{mode=~"user|system"}[1m]

`query`

Returns: <class 'pandas.core.frame.DataFrame'>

                               node_cpu_seconds_total{cpu="0",instance="localhost:9100",job="node",mode="system"}  node_cpu_seconds_total{cpu="0",instance="localhost:9100",job="node",mode="user"}
2025-11-04 22:10:26.881000042                                           25503.80                                                                            46958.37
2025-11-04 22:10:41.881000042                                           25503.98                                                                            46958.58
2025-11-04 22:10:56.881000042                                           25504.10                                                                            46958.80
2025-11-04 22:11:11.881000042                                           25504.27                                                                            46959.06

`query_dataframe`

KeyError: 'value'

Query returning scalar

scalar(sum(node_cpu_seconds_total{mode=~"user|system"}))

`query`

Returns: <class 'numpy.ndarray'>

[1.76229428e+09 7.24633300e+04]

`query_dataframe`

TypeError: 'float' object is not subscriptable`

n-peugnet · 2025-11-05T12:53:28Z

I'll need to think a little about this, since the query API can return several different data-types depending on the query.

Ah, yes you are right, I didn't pay attention to this and thought the basic query always returned a vector. 😅

Query returning instant vector
node_cpu_seconds_total{mode=~"user|system"}
[...]

query_dataframe

Returns: <class 'pandas.core.series.Series'> ← Should have been a DataFrame?

This is strange, considering I'm calling the pandas.DataFrame constructor.

For the rest of your reply, my bad, I was not thinking about these types of query. But I am completely open to another API, provided that I am able to get this kind of results for the basic vector query, as you showed it:

                 __name__ cpu        instance   job    mode     value
0  node_cpu_seconds_total   0  localhost:9100  node  system  25504.27
1  node_cpu_seconds_total   0  localhost:9100  node    user  46959.06

dcoles · 2025-11-06T07:56:00Z

This is strange, considering I'm calling the pandas.DataFrame constructor.

Looks like it was a copy-paste bug in my test script. Sorry to cause confusion. 🙇

I think I now understand what's going on and how I tried to tackle it when first writing the library.
The Prometheus selector is effectively the index, but dictionaries can't be used as indexes, which is why I flattened it to a string matching what you would see in the Prometheus query console.

Potentially the right tool would have been a MultiIndex. That would look like the following:

Query returning instant vector

node_cpu_seconds_total{mode=~"user|system"}

`query`

Returns: <class 'pandas.core.series.Series'>

__name__                cpu  instance        job   mode    @
node_cpu_seconds_total  0    localhost:9100  node  system  2025-11-01    22905.89
                                                   user    2025-11-01    42495.91
dtype: float64

Query returning range vector

node_cpu_seconds_total{mode=~"user|system"}[1m]

`query`

Returns: <class 'pandas.core.series.Series'> ← Yes! This is actually also a Series and not another typo!

__name__                cpu  instance        job   mode    @
node_cpu_seconds_total  0    localhost:9100  node  system  2025-10-31 23:59:11.881000042    22905.65
                                                           2025-10-31 23:59:26.881000042    22905.71
                                                           2025-10-31 23:59:41.881000042    22905.83
                                                           2025-10-31 23:59:56.881000042    22905.89
                                                   user    2025-10-31 23:59:11.881000042    42495.48
                                                           2025-10-31 23:59:26.881000042    42495.59
                                                           2025-10-31 23:59:41.881000042    42495.78
                                                           2025-10-31 23:59:56.881000042    42495.91
dtype: float64

Would that structure be suitable for your needs? It looks like it would be possible to generalize across all the API result types.

dcoles · 2025-11-06T08:10:26Z

This would be the alternate to_pandas implementation:

def to_pandas2(data: dict) -> pd.Series:
    """Convert Prometheus data object to Pandas Series."""
    result_type = data['resultType']
    if result_type == 'vector':
        index_frame = pd.DataFrame(
                r['metric'] | {'@': pd.Timestamp(r['value'][0], unit='s')}
                for r in data['result'])
        return pd.Series(
            data=(np.float64(r['value'][1]) for r in data['result']),
            index=pd.MultiIndex.from_frame(index_frame)
        )
    elif result_type == 'matrix':
        index_frame = pd.DataFrame(
                r['metric'] | {'@': pd.Timestamp(v[0], unit='s')}
                for r in data['result'] for v in r['values'])
        return pd.Series(
            data=(np.float64(v[1]) for r in data['result'] for v in r['values']),
            index=pd.MultiIndex.from_frame(index_frame)
        )
    elif result_type == 'scalar':
        return pd.Series(
            data=[np.float64(data['result'][1])],
            index=[pd.Timestamp(data['result'][0], unit='s')])
    elif result_type == 'string':
        return pd.Series(
            data=[data['result'][1]],
            index=[pd.Timestamp(data['result'][0], unit='s')])
    else:
        raise ValueError('Unknown type: {}'.format(result_type))

n-peugnet · 2025-11-06T15:09:30Z

This would be the alternate to_pandas implementation:

I just tried it and works nicely for my use case. I can simply call to_frame(name='value') to get the DataFrame I need.

But I think directly replacing to_pandas might break some other use cases for other people. So maybe it is better to provide it as a new API?

dcoles · 2025-11-07T08:52:13Z

I was working on updating the Jupyter workbook, but it turns out that the form above is pretty frustrating to work with (especially for plotting). This column DataFrame layout seems a lot more convenient to work with than a Series one.

> df = to_pandas2(...).unstack(data.index.names.difference(['@']))
> df

__name__            node_cpu_seconds_total          
cpu                                      0          
instance                    localhost:9100          
job                                   node          
mode                                system      user
@                                                   
2025-11-01 00:00:00               22905.89  42495.91
2025-11-01 00:01:00               22906.22  42496.44
2025-11-01 00:02:00               22906.50  42496.99
2025-11-01 00:03:00               22906.83  42497.54
2025-11-01 00:04:00               22907.13  42498.09
...                                    ...       ...
2025-11-01 23:56:00               23605.30  43677.62
2025-11-01 23:57:00               23605.65  43678.27
2025-11-01 23:58:00               23606.06  43678.86
2025-11-01 23:59:00               23606.36  43679.39
2025-11-02 00:00:00               23606.75  43680.00

[1441 rows x 2 columns]

> df.xs('user', level='mode', axis=1).diff().rolling(timedelta(minutes=15)).sum().plot()

dcoles · 2025-11-08T05:28:52Z

I've created an updated multiindex branch that has this new logic and some other cleanups (e.g. correctly returning timestamps on scalar queries). This would be a major version release because while it does support a string_labels flag to emulate the old behavior it's not completely identical.

n-peugnet · 2025-11-14T17:51:43Z

Hey, I just had the time to look into this, I'm sorry but this layout makes it a pain to do what I want to do and I still didn't manage to do it.

With the table layout I was able to plot the data with seaborn this way:

df = p.query_dataframe(
    'sum by (exp,config,seed,num) (node_network_transmit_bytes_total{device=~"link.*"})',
    '1970-01-01T00:10:00Z',
    '1s'
)
txbytes['value'] = txbytes['value'] / 1_000_000

# ... some more customisations ...

sns.barplot(x='exp', y="value", hue="config", data=df)

I did not yet find a way to make this work with the multiindex output. I got pretty close by transposing it and resetting the index, but then I have a column named by the timestamp that I don't know how to select (this query does not have group by exp):

df = p.query(
    'sum by (config,seed,num) (node_network_transmit_bytes_total{device=~"link.*"})',
    t,
    '1s'
).T.reset_index()
print(df)

     config num seed  1970-01-01 00:10:00
0   meshmon   1    1            2579685.0
1   meshmon   1   10            2313246.0
2   meshmon   1   11            2762638.0
3   meshmon   1   12            2209689.0
4   meshmon   1   13            2403765.0
..      ...  ..  ...                  ...
59  meshsim   1    5            2644419.0
60  meshsim   1    6            2208169.0
61  meshsim   1    7            2602326.0
62  meshsim   1    8            2406648.0
63  meshsim   1    9            2654732.0

Add query_dataframe function to be able to then use the labels in pandas

0eabd62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add query_dataframe function to be able to then use the labels in pandas #8

Add query_dataframe function to be able to then use the labels in pandas #8

Uh oh!

n-peugnet commented Nov 4, 2025

Uh oh!

dcoles commented Nov 4, 2025 •

edited

Loading

Uh oh!

n-peugnet commented Nov 5, 2025

Query returning instant vector

`query_dataframe`

Uh oh!

dcoles commented Nov 6, 2025 •

edited

Loading

Uh oh!

dcoles commented Nov 6, 2025 •

edited

Loading

Uh oh!

n-peugnet commented Nov 6, 2025

Uh oh!

dcoles commented Nov 7, 2025

Uh oh!

dcoles commented Nov 8, 2025

Uh oh!

n-peugnet commented Nov 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add query_dataframe function to be able to then use the labels in pandas #8

Are you sure you want to change the base?

Add query_dataframe function to be able to then use the labels in pandas #8

Uh oh!

Conversation

n-peugnet commented Nov 4, 2025

Uh oh!

dcoles commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Query returning instant vector

query

query_dataframe

Query returning range vector

query

query_dataframe

Query returning scalar

query

query_dataframe

Uh oh!

n-peugnet commented Nov 5, 2025

Query returning instant vector

query_dataframe

Uh oh!

dcoles commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Query returning instant vector

query

Query returning range vector

query

Uh oh!

dcoles commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

n-peugnet commented Nov 6, 2025

Uh oh!

dcoles commented Nov 7, 2025

Uh oh!

dcoles commented Nov 8, 2025

Uh oh!

n-peugnet commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dcoles commented Nov 4, 2025 •

edited

Loading

`query`

`query_dataframe`

`query`

`query_dataframe`

`query`

`query_dataframe`

`query_dataframe`

dcoles commented Nov 6, 2025 •

edited

Loading

`query`

`query`

dcoles commented Nov 6, 2025 •

edited

Loading

n-peugnet commented Nov 14, 2025 •

edited

Loading