Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken downtime comment sync #10000

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yhabteab
Copy link
Member

@yhabteab yhabteab commented Feb 12, 2024

All objects must be synced sorted by their load dependency. Otherwise, downtimes and/or comments might get synced before their respective Checkables, which will result in comments and downtimes being ignored by the other endpoint since it does not yet know about their checkables. Given that the runtime config updates event does not trigger a reload on the remote endpoint, these objects won't be synced again until the next reload.

~/master2/icinga2 (bundled-cluster-fixes ✗) ls prefix/var/lib/icinga2/api/packages/_api/*/conf.d/downtimes | wc -l
    3501
~/master2/icinga2 (bundled-cluster-fixes ✗) curl -sSku root:icinga 'https://localhost:5666/v1/objects/downtimes?pretty=1' | grep ' "attrs": {' | wc -l
    1501
~/master1/icinga2 (bundled-cluster-fixes ✗) curl -sSku root:icinga 'https://localhost:5665/v1/objects/downtimes?pretty=1' | grep ' "attrs": {' | wc -l
    3501

After master2 reload:

~/master2/icinga2 (bundled-cluster-fixes ✗) curl -sSku root:icinga 'https://localhost:5666/v1/objects/downtimes?pretty=1' | grep ' "attrs": {' | wc -l
    3501

closes #7786
closes #9873

TODO

@yhabteab yhabteab added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) ref/IP area/runtime Downtimes, comments, dependencies, events labels Feb 12, 2024
@cla-bot cla-bot bot added the cla/signed label Feb 12, 2024
lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
@log1-c
Copy link
Contributor

log1-c commented Jul 22, 2024

Quick question: would this also fix the following issue:
Downtimes scheduled via API (sometimes) get synced/re-created with delay and are doubled #10078

@Mordecaine
Copy link

Mordecaine commented Aug 13, 2024

Quick question: would this also fix the following issue: Downtimes scheduled via API (sometimes) get synced/re-created with delay and are doubled #10078

It would be very helpful to get an answer to this.

@yhabteab
Copy link
Member Author

Quick question: would this also fix the following issue: Downtimes scheduled via API (sometimes) get synced/re-created with delay and are doubled #10078

It would be very helpful to get an answer to this.

Hi, we don't know for sure whether this will fix #10078 as we still haven't identified exactly what is going wrong there, other than something is not working as expected. It's unlikely that this PR will fix #10078, but we can't tell you for sure until the cause for #10078 is identified.

@Al2Klimov Al2Klimov removed their assignment Aug 20, 2024
@Al2Klimov Al2Klimov self-requested a review August 20, 2024 11:16
@dgiesselbach
Copy link

@yhabteab When will this request be completed? Is there a timeline?

@yhabteab yhabteab changed the base branch from master to enhanced-sort-types-by-load-dependencies September 16, 2024 07:23
@yhabteab yhabteab force-pushed the enhanced-sort-types-by-load-dependencies branch 4 times, most recently from cb4fe57 to eb97676 Compare September 20, 2024 14:18
@Al2Klimov Al2Klimov changed the base branch from enhanced-sort-types-by-load-dependencies to master September 20, 2024 15:30
@Al2Klimov Al2Klimov marked this pull request as draft September 20, 2024 15:31
@julianbrost
Copy link
Contributor

I believe similar problems will still exist for other types where there's no load_after dependency at the moment: many objects can refer to a time period, however, there's not a single load_after TimePeriod. There are more such examples: Host/Service -> *Command, Notification -> NotificationCommand (not necessarily a complete list).

@yhabteab yhabteab added the consider backporting Should be considered for inclusion in a bugfix release label Sep 26, 2024
@yhabteab
Copy link
Member Author

A complete list of the navigable aka navigation types (attributes):

Another list of non-navigable dependencies :):

~/Workspace/icinga2 (broken-downtime-comment-sync ✗) grep -rE 'array\(name.*' lib     
lib/icinga/host.ti:     [config, no_user_modify, required, signal_with_old_value] array(name(HostGroup)) groups {
lib/icinga/notification.ti:     [config, signal_with_old_value] array(name(User)) users (UsersRaw);
lib/icinga/notification.ti:     [config, signal_with_old_value] array(name(UserGroup)) user_groups (UserGroupsRaw);
lib/icinga/servicegroup.ti:     [config, no_user_modify] array(name(ServiceGroup)) groups;
lib/icinga/hostgroup.ti:        [config, no_user_modify] array(name(HostGroup)) groups;
lib/icinga/usergroup.ti:        [config, no_user_modify] array(name(UserGroup)) groups;
lib/icinga/service.ti:  [config, no_user_modify, required, signal_with_old_value] array(name(ServiceGroup)) groups {
lib/icinga/user.ti:     [config, no_user_modify, required, signal_with_old_value] array(name(UserGroup)) groups {
lib/icinga/timeperiod.ti:       [config, required, signal_with_old_value] array(name(TimePeriod)) excludes {
lib/icinga/timeperiod.ti:       [config, required, signal_with_old_value] array(name(TimePeriod)) includes {
lib/remote/zone.ti:     [config] array(name(Endpoint)) endpoints (EndpointsRaw);

Al2Klimov
Al2Klimov previously approved these changes Sep 27, 2024
@julianbrost
Copy link
Contributor

julianbrost commented Oct 8, 2024

While cross-checking1 the dependencies added by this PR, I both noticed another problem as well a different idea to fix this in a hopefully more reliable way.

The problem: There are object types that allow referencing other types of the same type. The first example of this I found is HostGroup: it has a groups attribute where other host groups can be referenced to recursively define groups:

[config, no_user_modify] array(name(HostGroup)) groups;

Something very similar also exists for TimePeriod with the include/exclude attributes:

[config, required, signal_with_old_value] array(name(TimePeriod)) excludes {
default {{{ return new Array(); }}}
};
[config, required, signal_with_old_value] array(name(TimePeriod)) includes {
default {{{ return new Array(); }}}
};

So no matter how we reorder the types, the issue that the order of individual objects within a type can also make a difference, i.e. the included/excluded time periods have to be sent, otherwise the same problem can happen there as well (and then continue to affect other objects that reference these time periods like hosts and services).

However, on a positive note, it looks like there's actually an easy solution to this: Icinga 2 already tracks these dependencies between individual config objects on a per-object basis (https://github.com/Icinga/icinga2/blob/master/lib/base/dependencygraph.hpp, https://github.com/Icinga/icinga2/blob/master/lib/base/dependencygraph.cpp). Thus, we should be able to solve both the problem that was the reason for starting this PR as well as the other one I just mentioned by sending the config objects in a topological sort order in respect to DependencyGraph (that should actually remove the requirement to iterate the types in a particular order at all).

Footnotes

  1. These two commands were my friends (just so that they are not only forgotten in my shell history): git grep -E 'name\([A-Za-z]+\)' -- '*.ti' and git grep load_after -- '*.ti'

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, I already wrote the following inline comment before figuring out #10000 (comment). I'm submitting this for completeness so that it doesn't get forgotten as a pending comment, but there's a good chance it will become obsolete.

lib/icinga/dependency.ti Outdated Show resolved Hide resolved
@yhabteab
Copy link
Member Author

Thus, we should be able to solve both the problem that was the reason for starting this PR as well as the other one I just mentioned by sending the config objects in a topological sort order in respect to DependencyGraph

No, it won't!

I actually did notice this, but didn't want to go into it too much since it's a relationship on a per object basis and not on the actual types themselves. Icinga 2 happily accepts such a config, and none of the proposed alternatives is going to overcome it.

object HostGroup "foo" {
    groups = [ "bar" ]
}

object HostGroup "bar" {
    groups = [ "foo" ]
}
{
    "attrs": {
        "__name": "bar",
        "display_name": "bar",
        "groups": [
            "foo"
        ],
        "name": "bar",
        "type": "HostGroup",
    },
    "name": "bar",
    "type": "HostGroup"
},
{
    "attrs": {
        "__name": "foo",
        "display_name": "foo",
        "groups": [
          "bar"
        ],
        "name": "foo",
        "type": "HostGroup",
    },
    "name": "foo",
    "type": "HostGroup"
}

@yhabteab
Copy link
Member Author

And this config!

object TimePeriod "included" {
  excludes = ["excluded"]
  ranges = { "2024-10-14" = "14:00-15:00" }
}

object TimePeriod "excluded" {
  excludes = ["included"]
  ranges = { "2024-10-14" = "13:00-15:00" }
}

There is absolutely nothing in the code that prevents two objects from becoming dependent on each other.

@julianbrost
Copy link
Contributor

Thus, we should be able to solve both the problem that was the reason for starting this PR as well as the other one I just mentioned by sending the config objects in a topological sort order in respect to DependencyGraph

No, it won't!

Indeed, it would not make every (currently) possible situation work, but at least for all objects where it's possible.

object HostGroup "foo" {
    groups = [ "bar" ]
}

object HostGroup "bar" {
    groups = [ "foo" ]
}
object TimePeriod "included" {
  excludes = ["excluded"]
  ranges = { "2024-10-14" = "14:00-15:00" }
}

object TimePeriod "excluded" {
  excludes = ["included"]
  ranges = { "2024-10-14" = "13:00-15:00" }
}

With these objects, it's simply impossible to order them such that each object is synced before all of its dependencies. And such config being accepted looks like a bug to me.

Note that this ordering is only relevant for runtime-created objects, so defining something like this in a config file wouldn't even be a problem here. However, you can also create the dependency cycle by modifying attributes at runtime (if allows, for nested host/service groups it isn't, the groups attribute is actually marked no_user_modify, but with the time period example, it's actually possible at the moment).

lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
@yhabteab yhabteab force-pushed the broken-downtime-comment-sync branch 2 times, most recently from 0435672 to 6f4d8c8 Compare November 4, 2024 15:25
@yhabteab yhabteab requested review from Al2Klimov and julianbrost and removed request for Al2Klimov and julianbrost November 4, 2024 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) area/runtime Downtimes, comments, dependencies, events bug Something isn't working cla/signed consider backporting Should be considered for inclusion in a bugfix release ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants