Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

always use the selected timeframe for sla calculation #834

Open
marcojarjour opened this issue Oct 30, 2024 · 2 comments
Open

always use the selected timeframe for sla calculation #834

marcojarjour opened this issue Oct 30, 2024 · 2 comments

Comments

@marcojarjour
Copy link

Preface

I noticed this bug when we evaluated the icinga reporting module for our internal SLA Monitoring. The Values seemed to be off when a Host did not exist for the whole timeframe used in the report.

Describe the bug

The SLA value is not calculated based on the total time selected by the timeframe, instead, if there is no data for the whole timeframe, the total time starts with the first entry, which results in a wrong value.

For example, a timeframe of a year is selcted and the data starts after half a year, then the total_time should always be a year. Let's assume the following, we have a downtime of 08 hours 41 minutes and 38 seconds, which should result in a SLA value 99,9 over a year.

Expected: timeframe: 1 Year total_time: 1 Year, with a downtime of 08:41:38, which results in 99,9
Actual: timeframe: 1 Year total_time: 1/2 Year, with a downtime of 08:41:38, which results in less

To Reproduce

Steps to reproduce the behavior:

⚠️ don't test this on production!

  1. Insert some test data into the database
insert into sla_history_state
    (id, environment_id, endpoint_id, object_type, host_id, service_id, event_time, hard_state, previous_hard_state)
VALUES ('111', '222', '333', 'service', 'hhhh', 'ssss', '1609459200000', '0', '99'),
       ('112', '222', '333', 'service', 'hhhh', 'ssss', '1609545600000', '3', '0'),
       ('113', '222', '333', 'service', 'hhhh', 'ssss', '1609576898000', '0', '3');    
  1. select the data for verification
MariaDB [icingadb]> select event_time, FROM_UNIXTIME(event_time/1000,  '%Y-%m-%d %H:%i') as date, hard_state, previous_hard_state from sla_history_state where hex(host_id) = '6868686800000000000000000000000000000000' and hex(service_id) = '7373737300000000000000000000000000000000' order by event_time desc;
+---------------+------------------+------------+---------------------+
| event_time    | date             | hard_state | previous_hard_state |
+---------------+------------------+------------+---------------------+
| 1609576898000 | 2021-01-02 08:41 |          0 |                   3 |
| 1609545600000 | 2021-01-02 00:00 |          3 |                   0 |
| 1609459200000 | 2021-01-01 00:00 |          0 |                  99 |
+---------------+------------------+------------+---------------------+
3 rows in set (0.039 sec)
  1. calculate the sla and start the timeframe at the same time with the data
MariaDB [icingadb]> select get_sla_ok_percent( 'hhhh', 'ssss', UNIX_TIMESTAMP('2021-01-01')*1000, UNIX_TIMESTAMP('2022-01-01')*1000);
+-----------------------------------------------------------------------------------------------------------+
| get_sla_ok_percent( 'hhhh', 'ssss', UNIX_TIMESTAMP('2021-01-01')*1000, UNIX_TIMESTAMP('2022-01-01')*1000) |
+-----------------------------------------------------------------------------------------------------------+
|                                                                                                   99.9008 |
  • The Timeframe is a year and the result is as expected 99,9
  1. calculate the sla with the data somewhere in between
MariaDB [icingadb]> select get_sla_ok_percent( 'hhhh', 'ssss', UNIX_TIMESTAMP('2020-01-03')*1000, UNIX_TIMESTAMP('2021-01-03')*1000);
+-----------------------------------------------------------------------------------------------------------+
| get_sla_ok_percent( 'hhhh', 'ssss', UNIX_TIMESTAMP('2020-01-03')*1000, UNIX_TIMESTAMP('2021-01-03')*1000) |
+-----------------------------------------------------------------------------------------------------------+
|                                                                                                   81.8877 |
+-----------------------------------------------------------------------------------------------------------+
  • The timeframe is still 1 year, but the result is much lower then the expected 99,9

Expected behavior

Both queries have a timeframe of one year, therefore i would expect both to have the same result.

Your Environment

# os
Platform: Debian GNU/Linux
Platform version: 11 (bullseye)
Kernel: Linux
Kernel version: 6.10.6
Architecture: x86_64

# icinga
icinga2:		v2.14.2
Icinga Web 2 Version:	2.12.1
Git Commit:		cd2daeb2cb8537c633d343a29eb76c54cd2ebbf2
PHP-Version:		8.2.20
Git Commit Datum:	2023-11-15

# libaries
icinga/icinga-php-library	0.14.1
icinga/icinga-php-thirdparty	0.12.1

# modules
director		1.11.1
icingadb		1.1.3
incubator		0.22.0
pdfexport		0.11.0
reporting		1.0.2

Additional context

The Database function is located in the icingadb schema file and in my tests the issue is resolved when its changed liked this:

-    IF row_previous_hard_state = 99 THEN
-      SET total_time = total_time - (row_event_time - last_event_time);
-    ELSEIF ((in_service_id IS NULL AND last_hard_state > 0) OR (in_service_id IS NOT NULL AND last_hard_state > 1))
+    IF ((in_service_id IS NULL AND last_hard_state > 0) OR (in_service_id IS NOT NULL AND last_hard_state > 1))
       AND last_hard_state != 99
       AND active_downtimes = 0
     THEN

And i could not think of a situation where this behaviour would be needed, but i guess it has its reason why its in there 🙂, maybe someone who has a deeper understanding of the reporting module may have a clue?

I hope its not a stupid mistake on my end, even if i tried to check everything throughly.

Last but not least, thanks to anyone who spent the time looking into it and espacially thanks for the software and all the effort everyone has put into it.

@oxzi
Copy link
Member

oxzi commented Nov 4, 2024

First, thanks a lot for this very detailed and well written issue.

I was able to reproduce your reported behavior and have integrated it in the SLA tests as follows. As one might expect, the second test "i834-2" is going to fail.

diff --git a/tests/sql/sla_test.go b/tests/sql/sla_test.go
index de2bace..bde20c2 100644
--- a/tests/sql/sla_test.go
+++ b/tests/sql/sla_test.go
@@ -210,6 +210,26 @@ func TestSla(t *testing.T) {
                Start:    1000,
                End:      2000,
                Expected: 60,
+       }, {
+               Name: "i834",
+               Events: []SlaHistoryEvent{
+                       &State{Time: 1609459200000, State: 0, PreviousState: 99}, // 2021-01-01 00:00:00.0000
+                       &State{Time: 1609545600000, State: 3, PreviousState: 0},  // 2021-01-02 00:00:00.0000
+                       &State{Time: 1609576898000, State: 0, PreviousState: 3},  // 2021-01-02 08:41:38.0000
+               },
+               Start:    1609459200000, // 2021-01-01 00:00:00.0000
+               End:      1640995200000, // 2022-01-01 00:00:00.0000
+               Expected: 99.9008,
+       }, {
+               Name: "i834-2",
+               Events: []SlaHistoryEvent{
+                       &State{Time: 1609459200000, State: 0, PreviousState: 99}, // 2021-01-01 00:00:00.0000
+                       &State{Time: 1609545600000, State: 3, PreviousState: 0},  // 2021-01-02 00:00:00.0000
+                       &State{Time: 1609576898000, State: 0, PreviousState: 3},  // 2021-01-02 08:41:38.0000
+               },
+               Start:    1578009600000, // 2020-01-03 00:00:00.0000
+               End:      1609632000000, // 2021-01-03 00:00:00.0000
+               Expected: 99.901,
        }}

        for _, test := range tests {

However, the issue's core lies in the handling of initially pending data. The ongoing PR #566 tries to explain this with prose and pseudo code, https://github.com/Icinga/icingadb/blob/add-create-delete-history-events/doc/10-Sla-Reporting.md#computing-sla-ok-percent. As @yhabteab wrote over there, the current logic follows the idea that an initial pending state is equivalent to no usable data so far.

This can be demonstrated with the following test case, effectively "ignoring" the first half of the tested time window. When applying your suggested schema fix, this test will fail with a SLA of 80%.

}, {
Name: "InitialUnknownReducesTotalTime",
Events: []SlaHistoryEvent{
&State{Time: 1500, State: 2, PreviousState: 99},
&State{Time: 1700, State: 0, PreviousState: 2},
&CurrentState{State: 0},
},
Start: 1000,
End: 2000,
Expected: 60,
}, {

The bug description addresses missing data for the time frame, but as I have tried to explain above, this is not the issue at hand. For example, altering "i843-2" to have a first State entry of PreviousState: 0, would result in the expected uptime of 99.901%.

Go Code of Changed Test

	{
		Name: "i834-3",
		Events: []SlaHistoryEvent{
			&State{Time: 1609459200000, State: 0, PreviousState: 0}, // 2021-01-01 00:00:00.0000
			&State{Time: 1609545600000, State: 3, PreviousState: 0}, // 2021-01-02 00:00:00.0000
			&State{Time: 1609576898000, State: 0, PreviousState: 3}, // 2021-01-02 08:41:38.0000
		},
		Start:    1578009600000, // 2020-01-03 00:00:00.0000
		End:      1609632000000, // 2021-01-03 00:00:00.0000
		Expected: 99.901,
	}

The concrete issue is due to initially pending data, not due to absent data. Thus, we are dealing with a design decision. Please feel free to comment on this or suggest changes. For the moment, I just wanted to describe why things are happening the way they are.

@oxzi oxzi removed their assignment Nov 4, 2024
@marcojarjour
Copy link
Author

First of all, thanks for your time and detailed answer!

As @yhabteab wrote over there, the current logic follows the idea that an initial pending state is equivalent to no usable data so far.

This is the new concept, when the PR #566 is finished right? I read through the linked documentation and code and tried to understand it as good as possible. And if i got it right, as you also said it should count the time before the first event as not usable, which should not negatively impact the sla calculation.

What i do not understand is the InitialUnknownReducesTotalTime Test, shouldn't it be 80% in the first place?
We have a calculation timeframe of 1000 seconds. The "service" in his lifetime is only for 200 seconds in a non Zero state (and therefore down), or in 99, which indicates it has not existed before. Therefore in my mind 800 of these 1000 seconds should be counted towards an up state and therefore should result in an uptime of 80%.

But what seems to happen is exactly what i meant in my description, the time gets substracted from the total time, which results in a shorter timeframe and therefore in a wrong result.

SET total_time = total_time - (row_event_time - last_event_time);

Example based on the vlaues of the InitialUnknownReducesTotalTime test

# sla calculation
sla_ok_percent := 100 * (total_time - problem_time) / total_time

# current situation
# as referenced above, the first 500 seconds get substracted from the total_time,
# therefore the total_time is 500 instead of 1000 which results in a wrong sla result
60 = 100 * (500 - 200) / 500

# proposed solution
# do not modify the total_time
80 = 100 * (1000 - 200) / 1000

The bug description addresses missing data for the time frame, but as I have tried to explain above, this is not the issue at hand. For example, altering "i843-2" to have a first State entry of PreviousState: 0, would result in the expected uptime of 99.901%

I guess we are already on the same page, but maybe my description wasn't the best. I assume too that the none existing data before the first event, or how the absence of those is handled is the root cause. And in my opinion they should be counted as if it was up.

And the only purpose of the "quickfix" in the SQL function was that we do not need to manually replace all PreviousStates of 99 with 0 and regularly check for those. Therefore i added it to demonstrate what maybe could be a quickfix.

The concrete issue is due to initially pending data, not due to absent data. Thus, we are dealing with a design decision. Please feel free to comment on this or suggest changes. For the moment, I just wanted to describe why things are happening the way they are.

It was not my intention to say its based on absent data, but to say if there is no data for the whole timeframe, the calculation is wrong, which should not be the case in my opinion.

And if read through PR #566, espacially the Computing SLA OK percent description correctly, it seems to have the same issue, as described above. It seems the pending event_time should still be substracted from the total_time, which in my opinion should not be the case, as it changes the total_time which wenn changed influences the sla percentage as a downtime has a higher impact.

total_time = total_time - (event.event_time - last_event_time)

I hope i could make my point clearer and did not miss yours.
Thanks for your time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants