Skip to content

Conversation

@n-peugnet
Copy link

This is especially useful to use the labels later in plots, for exemple to group by configuration and such.

@dcoles
Copy link
Owner

dcoles commented Nov 4, 2025

Thanks for the PR!

I'll need to think a little about this, since the query API can return several different data-types depending on the query. The Prometheus.query and Prometheus.query_range methods currently map directly to the matching APIs, though I can understand why a table format might be preferred.

Query returning instant vector

node_cpu_seconds_total{mode=~"user|system"}

query

  • Returns: <class 'pandas.core.series.Series'>
node_cpu_seconds_total{cpu="0",instance="localhost:9100",job="node",mode="system"}    25504.27
node_cpu_seconds_total{cpu="0",instance="localhost:9100",job="node",mode="user"}      46959.06
dtype: float64

query_dataframe

  • Returns: <class 'pandas.core.series.Series'> ← Should have been a DataFrame?
  • Returns: <class 'pandas.core.frame.DataFrame'>
                 __name__ cpu        instance   job    mode     value
0  node_cpu_seconds_total   0  localhost:9100  node  system  25504.27
1  node_cpu_seconds_total   0  localhost:9100  node    user  46959.06

Query returning range vector

node_cpu_seconds_total{mode=~"user|system"}[1m]

query

  • Returns: <class 'pandas.core.frame.DataFrame'>
                               node_cpu_seconds_total{cpu="0",instance="localhost:9100",job="node",mode="system"}  node_cpu_seconds_total{cpu="0",instance="localhost:9100",job="node",mode="user"}
2025-11-04 22:10:26.881000042                                           25503.80                                                                            46958.37
2025-11-04 22:10:41.881000042                                           25503.98                                                                            46958.58
2025-11-04 22:10:56.881000042                                           25504.10                                                                            46958.80
2025-11-04 22:11:11.881000042                                           25504.27                                                                            46959.06

query_dataframe

KeyError: 'value'

Query returning scalar

scalar(sum(node_cpu_seconds_total{mode=~"user|system"}))

query

  • Returns: <class 'numpy.ndarray'>
[1.76229428e+09 7.24633300e+04]

query_dataframe

TypeError: 'float' object is not subscriptable`

@n-peugnet
Copy link
Author

I'll need to think a little about this, since the query API can return several different data-types depending on the query.

Ah, yes you are right, I didn't pay attention to this and thought the basic query always returned a vector. 😅

Query returning instant vector

node_cpu_seconds_total{mode=~"user|system"}

[...]

query_dataframe

  • Returns: <class 'pandas.core.series.Series'> ← Should have been a DataFrame?

This is strange, considering I'm calling the pandas.DataFrame constructor.

For the rest of your reply, my bad, I was not thinking about these types of query. But I am completely open to another API, provided that I am able to get this kind of results for the basic vector query, as you showed it:

                 __name__ cpu        instance   job    mode     value
0  node_cpu_seconds_total   0  localhost:9100  node  system  25504.27
1  node_cpu_seconds_total   0  localhost:9100  node    user  46959.06

@dcoles
Copy link
Owner

dcoles commented Nov 6, 2025

This is strange, considering I'm calling the pandas.DataFrame constructor.

Looks like it was a copy-paste bug in my test script. Sorry to cause confusion. 🙇

I think I now understand what's going on and how I tried to tackle it when first writing the library.
The Prometheus selector is effectively the index, but dictionaries can't be used as indexes, which is why I flattened it to a string matching what you would see in the Prometheus query console.

Potentially the right tool would have been a MultiIndex. That would look like the following:

Query returning instant vector

node_cpu_seconds_total{mode=~"user|system"}

query

  • Returns: <class 'pandas.core.series.Series'>
__name__                cpu  instance        job   mode    @
node_cpu_seconds_total  0    localhost:9100  node  system  2025-11-01    22905.89
                                                   user    2025-11-01    42495.91
dtype: float64

Query returning range vector

node_cpu_seconds_total{mode=~"user|system"}[1m]

query

  • Returns: <class 'pandas.core.series.Series'> ← Yes! This is actually also a Series and not another typo!
__name__                cpu  instance        job   mode    @
node_cpu_seconds_total  0    localhost:9100  node  system  2025-10-31 23:59:11.881000042    22905.65
                                                           2025-10-31 23:59:26.881000042    22905.71
                                                           2025-10-31 23:59:41.881000042    22905.83
                                                           2025-10-31 23:59:56.881000042    22905.89
                                                   user    2025-10-31 23:59:11.881000042    42495.48
                                                           2025-10-31 23:59:26.881000042    42495.59
                                                           2025-10-31 23:59:41.881000042    42495.78
                                                           2025-10-31 23:59:56.881000042    42495.91
dtype: float64

Would that structure be suitable for your needs? It looks like it would be possible to generalize across all the API result types.

@dcoles
Copy link
Owner

dcoles commented Nov 6, 2025

This would be the alternate to_pandas implementation:

def to_pandas2(data: dict) -> pd.Series:
    """Convert Prometheus data object to Pandas Series."""
    result_type = data['resultType']
    if result_type == 'vector':
        index_frame = pd.DataFrame(
                r['metric'] | {'@': pd.Timestamp(r['value'][0], unit='s')}
                for r in data['result'])
        return pd.Series(
            data=(np.float64(r['value'][1]) for r in data['result']),
            index=pd.MultiIndex.from_frame(index_frame)
        )
    elif result_type == 'matrix':
        index_frame = pd.DataFrame(
                r['metric'] | {'@': pd.Timestamp(v[0], unit='s')}
                for r in data['result'] for v in r['values'])
        return pd.Series(
            data=(np.float64(v[1]) for r in data['result'] for v in r['values']),
            index=pd.MultiIndex.from_frame(index_frame)
        )
    elif result_type == 'scalar':
        return pd.Series(
            data=[np.float64(data['result'][1])],
            index=[pd.Timestamp(data['result'][0], unit='s')])
    elif result_type == 'string':
        return pd.Series(
            data=[data['result'][1]],
            index=[pd.Timestamp(data['result'][0], unit='s')])
    else:
        raise ValueError('Unknown type: {}'.format(result_type))

@n-peugnet
Copy link
Author

This would be the alternate to_pandas implementation:

I just tried it and works nicely for my use case. I can simply call to_frame(name='value') to get the DataFrame I need.

But I think directly replacing to_pandas might break some other use cases for other people. So maybe it is better to provide it as a new API?

@dcoles
Copy link
Owner

dcoles commented Nov 7, 2025

I was working on updating the Jupyter workbook, but it turns out that the form above is pretty frustrating to work with (especially for plotting). This column DataFrame layout seems a lot more convenient to work with than a Series one.

> df = to_pandas2(...).unstack(data.index.names.difference(['@']))
> df

__name__            node_cpu_seconds_total          
cpu                                      0          
instance                    localhost:9100          
job                                   node          
mode                                system      user
@                                                   
2025-11-01 00:00:00               22905.89  42495.91
2025-11-01 00:01:00               22906.22  42496.44
2025-11-01 00:02:00               22906.50  42496.99
2025-11-01 00:03:00               22906.83  42497.54
2025-11-01 00:04:00               22907.13  42498.09
...                                    ...       ...
2025-11-01 23:56:00               23605.30  43677.62
2025-11-01 23:57:00               23605.65  43678.27
2025-11-01 23:58:00               23606.06  43678.86
2025-11-01 23:59:00               23606.36  43679.39
2025-11-02 00:00:00               23606.75  43680.00

[1441 rows x 2 columns]
> df.xs('user', level='mode', axis=1).diff().rolling(timedelta(minutes=15)).sum().plot()
Untitled

@dcoles
Copy link
Owner

dcoles commented Nov 8, 2025

I've created an updated multiindex branch that has this new logic and some other cleanups (e.g. correctly returning timestamps on scalar queries). This would be a major version release because while it does support a string_labels flag to emulate the old behavior it's not completely identical.

@n-peugnet
Copy link
Author

n-peugnet commented Nov 14, 2025

Hey, I just had the time to look into this, I'm sorry but this layout makes it a pain to do what I want to do and I still didn't manage to do it.

With the table layout I was able to plot the data with seaborn this way:

df = p.query_dataframe(
    'sum by (exp,config,seed,num) (node_network_transmit_bytes_total{device=~"link.*"})',
    '1970-01-01T00:10:00Z',
    '1s'
)
txbytes['value'] = txbytes['value'] / 1_000_000

# ... some more customisations ...

sns.barplot(x='exp', y="value", hue="config", data=df)
txbytes_test

I did not yet find a way to make this work with the multiindex output. I got pretty close by transposing it and resetting the index, but then I have a column named by the timestamp that I don't know how to select (this query does not have group by exp):

df = p.query(
    'sum by (config,seed,num) (node_network_transmit_bytes_total{device=~"link.*"})',
    t,
    '1s'
).T.reset_index()
print(df)
     config num seed  1970-01-01 00:10:00
0   meshmon   1    1            2579685.0
1   meshmon   1   10            2313246.0
2   meshmon   1   11            2762638.0
3   meshmon   1   12            2209689.0
4   meshmon   1   13            2403765.0
..      ...  ..  ...                  ...
59  meshsim   1    5            2644419.0
60  meshsim   1    6            2208169.0
61  meshsim   1    7            2602326.0
62  meshsim   1    8            2406648.0
63  meshsim   1    9            2654732.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants