How we obtain data/model from contributor for (profiling and) training? #93

sunya-ch · 2023-03-08T00:57:50Z

sunya-ch
Mar 8, 2023
Maintainer

We need a contributing process and guideline for obtaining the data/model from the contributors.

(i) What and how?

(i-1) raw data and being processed at central server (github action?)
(i-1-1) kepler metrics: anonymise container name? need license?
(ii-1-2) server metadata? cpuinfo memoryinfo (related to How can we serve a model for the server that has no power measurement? #91 (ii-1))
(iii-1-3) performance values (not low-level counter) reported by benchmark workload (related to How can we serve a model for the server that has no power measurement? #91 (ii-2))
(i-2) locally-processed data (such as retrained model) - How to verify that it is properly processed locally?
(ii) How to validate and merge data (PRs) from contributors? How to verify contributor? limited to maintainer? github verification?

marceloamaral · 2023-03-08T02:47:24Z

marceloamaral
Mar 8, 2023
Maintainer

As we discussed previously, we should not accept pre-trained models as they are difficult to validate. Rather, we should accept new data to train a new model or extend an existing one. We can ensure that the new data includes all the necessary scenarios for our purposes.

For example, we can verify the dataset has

our target set of workloads,
different cpu frequency (e.g., at least 10 different slopes),
power metrics from the sources that we need,
- Intel Running Average Power Limit (RAPL), and/or
- AMD Application Power Management (APM),
- NVIDIA NVML, Motherboard SENSORS (exposed via Advanced Configuration and accessible via Linux files or IPMI-based probes),
- Rack Power Distribution Unit - PDU (accessible by IPMI),
- Storage (e.g., NVMe can report different power state)

Also, the dataset should have metadata of the server:

CPU architecture
CPU model
Number of processors (the server can have multiple sockets without processors)
How many hyper-threads were enabled (this impact the power consumption)
DRAM (frequency and amount)
Storage (type and if possible power state)
GPUs (amount, architecture, model)
etc.

This approach enables us to accurately classify the model for a particular server configuration and validate the data. Since each characteristics will create a different power model.

To protect privacy, users will have the option to include or not include their workloads in the dataset, in addition to our chosen set of applications. To prevent sensitive information from being exposed, we should recommend running the benchmarks in isolation to ensure that no private data is leaked.

By the way, how can we enhance workload isolation? Disabling scheduling on certain CPU cores might be an area worth investigating.

2 replies

marceloamaral Mar 8, 2023
Maintainer

We also need to define the feature list with both hardware and software performance counters. For flexibility, it should extract many relevant metrics, even though they are not used in the current power model, but might be used in the future.

sunya-ch Mar 10, 2023
Maintainer Author

I think those could go to the new feature group defined in https://github.com/sustainable-computing-io/kepler-model-server/blob/main/src/util/train_types.py like what we just add for the IRQ_FEATURES. So that, it can easy to manage and search.

KaiyiLiu1234 · 2023-03-09T18:24:24Z

KaiyiLiu1234
Mar 9, 2023
Maintainer

As discussed previously, we can set up the profiling/clustering as part of a unsupervised learning pipeline that works separately from the model server proper. This way, we might be able to identify a variety of clusters that we may not have known about. Contributors can contribute directly to the profiling and clustering models (via k means clustering). We can store the data for use in our training pipelines as well.

1 reply

marceloamaral Mar 13, 2023
Maintainer

Yes, that's correct. And to create a meaningful clustering, we must be selective in choosing the features and meta data, such as at least the cpu/gpu architecture and model.

However, is it really necessary to use an advanced clustering algorithm to classify the models? Or we can just use simple metadata?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How we obtain data/model from contributor for (profiling and) training? #93

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How we obtain data/model from contributor for (profiling and) training? #93

Uh oh!

sunya-ch Mar 8, 2023 Maintainer

Replies: 2 comments · 3 replies

Uh oh!

marceloamaral Mar 8, 2023 Maintainer

Uh oh!

marceloamaral Mar 8, 2023 Maintainer

Uh oh!

sunya-ch Mar 10, 2023 Maintainer Author

Uh oh!

KaiyiLiu1234 Mar 9, 2023 Maintainer

Uh oh!

marceloamaral Mar 13, 2023 Maintainer

sunya-ch
Mar 8, 2023
Maintainer

Replies: 2 comments 3 replies

marceloamaral
Mar 8, 2023
Maintainer

marceloamaral Mar 8, 2023
Maintainer

sunya-ch Mar 10, 2023
Maintainer Author

KaiyiLiu1234
Mar 9, 2023
Maintainer

marceloamaral Mar 13, 2023
Maintainer