Skip to content
This repository has been archived by the owner on Nov 8, 2022. It is now read-only.

Issues #29 and #30 #31

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Issues #29 and #30 #31

wants to merge 8 commits into from

Conversation

evilezh
Copy link

@evilezh evilezh commented Nov 15, 2016

This is approximate idea for #29 and #30 .
#29 I tested - it works. #30 will test bit later. Need to roll out and change some stuff, before I can say it works correct.

It is kinda "first/second" code i wrote in golang, so welcome to modify and make it nicer.

Sorry, no tests, no documentation changes.

here is what i added to configuration:
key: "plugin_running_on"
output_type: "tree"

Both are optional, If not specified, it will default to previous behavior.

@evilezh
Copy link
Author

evilezh commented Dec 28, 2016

I changed to new plugin api.
New defaults

  • output type = tree. My guess is, that anyone who uses kafka and have more that 5-10 servers would like to get 80-90% reduced in size messages. For me - if I do array, i produce 250+kb message every 10 seconds, in tree format, removing redundant attributes, it is ~25kb.
  • as partitioning key is used tag "plugin_running_on". There is no option to remove partitioning key at all (e.g. use random). I can put such, if someone think it is somehow usable.

p.s.
If someone could fix tests and build scripts/godeps, it would be awesome.

@mbbroberg
Copy link

Thanks for the update @evilezh. The maintainers of Snap are unfamiliar with Kafka, so this PR will take much longer to review than we like. I might use this as an excuse to dig into it. We'll keep you posted - happy new year!

@katarzyna-z
Copy link
Contributor

katarzyna-z commented Jan 31, 2017

I've launched the plugin with your changes and I see that with your changes some information is lost: unit, version, last_advertised_time, version.

Output from current kafka publisher:

[
  {
    "timestamp": "2017-01-30T11:02:29.933141255+01:00",
    "namespace": "/intel/psutil/load/load1",
    "data": 0.44,
    "unit": "Load/1M",
    "tags": {
      "plugin_running_on": "kujawka-Z97X-UD7-TH"
    },
    "version": 0,
    "last_advertised_time": "2017-01-30T11:02:29.93389848+01:00"
  },
  {
    "timestamp": "2017-01-30T11:02:29.933141825+01:00",
    "namespace": "/intel/psutil/load/load15",
    "data": 0.56,
    "unit": "Load/15M",
    "tags": {
      "plugin_running_on": "kujawka-Z97X-UD7-TH"
    },
    "version": 0,
    "last_advertised_time": "2017-01-30T11:02:29.933898807+01:00"
  }
]

Output from kafka publisher with your changes:

{
  "_tags": {
    "plugin_running_on": "kujawka-Z97X-UD7-TH"
  },
  "_timestamp": 1485772518024965428,
  "intel": {
    "psutil": {
      "load": {
        "load1": 0.15,
        "load15": 0.5
      }
    }
  }
}

@evilezh
Copy link
Author

evilezh commented Feb 7, 2017

That is general idea of conversion. To drop un-used fields and get only actual data.
So - what happens ...
timestamp is only one per document and converted to nanoseconds.
Tags also only per document.
last_advertised_time, version, unit are dropped as unnecessary.

Format is optional, depends on your requirements. If you need full output, you can use old method. If you are sufficient with only path and data, then welcome to use tree-like output. That in general saves you around 90% of traffic and space on kafka.
It is big difference to send 300kb or 30kb documents every 10 seconds from each host.

@mkleina
Copy link

mkleina commented Feb 13, 2017

I think all data like unit and version should be kept for future filtering or debugging (to know which version of plugin was used in case the value is incorrect), so output should be like:

{
  "_tags": {
    "plugin_running_on": "kujawka-Z97X-UD7-TH"
  },
  "_timestamp": 1485772518024965428,
  "intel": {
    "psutil": {
      "load": {
        "load1": {
           "value": 0.15,
           "version": 0,
           "unit": "Load/1M",
           "last_advertised_time": "2017-01-30T11:02:29.93389848+01:00"
        },
        "load15": {
           "value": 0.5,
           "version": 0,
           "unit": "Load/15M",
           "last_advertised_time": "2017-01-30T11:02:29.933898807+01:00"
      }
    }
  }
}

Anyway, I don't think that changing format of data is good way to achieve lower bandwidth. Correct way to optimize data trasfer is compressing data. It seems that data compression is supported both by Kafka and sarama (Kafka's client library for Go). You can find simple compression example here:
https://github.com/Shopify/sarama/blob/9a9e66f928cbc9febba598484ab34dcef6f8f43e/examples/http_server/http_server.go#L219

I see that in snap-plugin-publisher kafka there is "nil" provided as config parameter for NewSyncProducer, so config Compression parameter is set to 0, which corresponds to "CompressionNone" (https://godoc.org/github.com/Shopify/sarama#CompressionCodec).

Maybe it's worth to play with compression parameter instead of refactoring whole output format?

@evilezh
Copy link
Author

evilezh commented Mar 3, 2017

There is nothing about to argue ... who wants .. outputs in previous format, i added new output format which throws out all unnecessary data which in real life maybe 0.01% users need. This is not only about compressing, but also data representation for further processing.
Do you know what is last_advertised_time and for what it can be used ? do you need version ?
If you want extended format .. you still do [{},{},{}], if you want something simpler, i suggest to use my way.
My use case is systems based on logstash/elasticsearch who takes documents in input rather than single metric. And all extra field like version,unit,last_advertised_time in such case is garbage.

@PatrykMatyjasek
Copy link

@evilezh maybe it can be resolved like in elasticearch publisher. I've added configuration option which enables user to choose fields from metric to publish. Default is publishing whole metric structure, but you can choose for example only Namespace and Data. And only those parts are published in database.
Sample manifest:
https://github.com/intelsdi-x/snap-plugin-publisher-elasticsearch/blob/master/examples/tasks/logs-openstack-elasticsearch.json#L40
Maybe it should be configurable for other publisher plugins

@evilezh
Copy link
Author

evilezh commented Mar 10, 2017

As for elasticsearch ...
If you use it .. then you will agree that this format is most effective. In elastic you will define mappings anyway, so - anything except value is garbage. You do not store single metric per document.
you do like:

{ "timestamp":12123123,
  "data1": 123,
  "data2": 321
}

not like:

{ "data1" : { 
    "value": 123,
    "timestamp":123123123
   },
  "data2": { 
    "value": 321,
    "timestamp":123123123
  }
}

I did some research on plugins i use, all other fields except value usually is filled with same repeatable value.

And about compression on wire. Yes it makes smaller amount of data to transfer, but it is huge amount to store .. it is huge amount to parse. And I will need post-process, as I did before and cut it all out. It makes much more sense - to cut all unnecessary data out before even sending to kafka. Otherwise it is server -> kafka -> post-processing -> kafka -> many readers.

And you can always add some other format for kafka output:
like csv, syslog etc :) I just added one - because i think, any elastic user would benefit much from it, as data can be post-processed very easy without expensive post-processing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants