-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass PhEDEx metadata to FTS tasks for hadoop monitoring. #1085
Comments
See a similar opened issue : |
ASO implementation of Job metadata passed to FTS : https://github.com/dmwm/AsyncStageout/blob/master/src/python/AsyncStageOut/TransferWorker.py#L417 "job_metadata": {"issuer": "ASO","user": self.user } |
I'll advise to extend this info more, it is better to use DN instead of user
name. Also add timestamp and application name/versions.
…On 0, nataliaratnikova ***@***.***> wrote:
ASO implementation of Job metadata passed to FTS :
https://github.com/dmwm/AsyncStageout/blob/master/src/python/AsyncStageOut/TransferWorker.py#L417
<pre>"job_metadata": {"issuer": "ASO","user": self.user } </pre>
--
You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub:
#1085 (comment)
|
Natalia,
does PhEDEx db itself record information about application and DN who requested the
transfer? If so, which table holds this info and do you have any API for access
this info?
|
yes, the DB has the information on who submitted the transfer request and who decided on the approval, have a look at https://cmsweb.cern.ch/phedex/datasvc/doc/transferrequests API. |
Have address the extra info in the json file. This open the possibility of more option. Meanwhile I have tried to address the basic request of Valentin in order to distinguish the transfers from PHEDEX in his monitoring. If you want to test it, before a new release happen. just patch you perl_lib/PHEDEX/Transfer/Backend/Job.pm file. |
Hi all, sorry for the delay, but I forgot to subscribe this issue :/ Btw, summarizing on ASO side in order to be compliant with what proposed above we may need to add: While, @vkuznet , by application name/version you meant ASO or FTS client? |
I guess we also need to be consistent with what WMA/CRAB will report to ES via WMArchice and HTCondor classAd feeding. So that we can e.g. find both jobs and transfers for a given user acitvity. |
given that Not that I like to make things complex and vague, but IIUC this naming schema is the fundation of monitoring work for next N years, may not want to hurry too much to a conclusion. |
On 0, dciangot ***@***.***> wrote:
Hi all, sorry for the delay, but I forgot to subscribe this issue :/
Btw, summarizing on ASO side in order to be compliant with what proposed above we may need to add:
"dn": /abc/abc/...,
"time": 12315,
request:{
"source": T....
"destination": T...
}
IMO it probably doesn't make to much sense for CRAB case the field "dataset", while for example we may go for "taskname": "1231231_123123:user_taskname". What do you think?
do you have standard convention for task names? What are the integer fields
means, why your example contains semicolon. These and other type of question
will be important at parsing level, so we better to standardize them.
While, @vkuznet , by application name/version you meant ASO or FTS client?
it is a tricky question. In case of PhEDEx we have both PhEDEx version and
underlying middleware one. And, we should capture both.
"client": {
"phedex": bla, "fts": bla, another_middleware...
}
…
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1085 (comment)
|
about task names, let's refer to the documentation for ES which Brian wrote and I pointed to earlier. I am all for collecting all such descriptions in a single place, but not this issue. |
The taskname is in format YYMMDD_hhmmss:_ In any case please find below a proposed schema slightly different from one at the beginning of the thread: { |
Time should be always in seconds since epoch, then you don't really care about
time zones. And, we may be more generic about middleware, e.g. instead of
fts_client:bla we may say middleware:fts-version This type of generalization
is better since it allows to keep intact schema keys when middleware
client may change. You may also add metadata about phedex agent, e.g.
agent:<some description about agent> (such as host name|ip, etc.)
…On 0, dciangot ***@***.***> wrote:
The taskname is in format YYMMDD_hhmmss:<username>_<user specified field>
In any case please find below a proposed schema slightly different from one at the beginning of the thread:
{
"issuer": "ASO | PHEDEX/user | DDM | ...",
"time": 123456, (need to specify common time zone)
"client": {
service:"AsyncStaseOut_v1.0.8 | PHEXED-client-v4",
fts_client: ”blabla”
}
“user”: “username as from SiteDB”,
"dn": "/my/DN/here",
"request": {
“workflow”: “belforte_crab_Pbp_HM185_250_q3_v3”,
“CRAB_Workflow”: ”170406_201711:belforte_crab_Pbp_HM185_250_q3_v3”,
"dataset": "/a/b/c", (may be empty for CRAB)
"source": "T1_XXX",
"destination": "T2_XXX"
}
}
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1085 (comment)
|
Hi,
I want to point out one important issue which may require slight change in this
meta-data structure.
Quite often people who visualize data want to see aggregation on dataset
sub-parts, e.g. data-tier. Therefore it will make more sense to replace dataset
by three pieces: primds, procds, tier.
When you store these three pieces it would be easily to aggregate data let's
say by data-tier or similar query. At the same time out of them it is easy
to "compose" a dataset name since a dataset is just /primds/procds/tier
Best,
Valentin.
…On 0, dciangot ***@***.***> wrote:
The taskname is in format YYMMDD_hhmmss:<username>_<user specified field>
In any case please find below a proposed schema slightly different from one at the beginning of the thread:
{
"issuer": "ASO | PHEDEX/user | DDM | ...",
"time": 123456, (need to specify common time zone)
"client": {
service:"AsyncStaseOut_v1.0.8 | PHEXED-client-v4",
fts_client: ”blabla”
}
“user”: “username as from SiteDB”,
"dn": "/my/DN/here",
"request": {
“workflow”: “belforte_crab_Pbp_HM185_250_q3_v3”,
“CRAB_Workflow”: ”170406_201711:belforte_crab_Pbp_HM185_250_q3_v3”,
"dataset": "/a/b/c", (may be empty for CRAB)
"source": "T1_XXX",
"destination": "T2_XXX"
}
}
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1085 (comment)
|
Hi @vkuznet, what if we report 4 things, the full dataset name and the three pieces,
would that be a problem ? In the spirit that IIUC we are encouraging every operator,
coordinator, user to build their favorite ES search of kibana/grafana dashboard,
I would not mind looking also at "convenience".
|
I don't mind, but data-wise it is redundant. If user know the dataset
(s)he may construct a filter simply using primds,procds,tier parts.
The only difference for users would be either use a string
/a/b/c or parts.
But storage wise (if we do really care) additional strings (dataset)
will add some overhead.
…On 0, Stefano Belforte ***@***.***> wrote:
Hi @vkuznet, what if we report 4 things, the full dataset name and the three pieces,
would that be a problem ? In the spirit that IIUC we are encouraging every operator,
coordinator, user to build their favorite ES search of kibana/grafana dashboard,
I would not mind looking also at "convenience".
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1085 (comment)
|
Hi @vkuznet, I wonder if you can see in your monitoring the fts job: dc6a077e-4640-11e7-b2fa-a0369f23cf8e or 52f63152-464a-11e7-873e-a0369f23cf8e I wonder if the metadata I put in there is visible?. Or could you please share the way you look at this, so we can have a look as well. |
Alberto,
what are those values? Is it job_id?
I run Spark job over 20170531 and 20170601 dates and asked to find
docs with job_id equal to values you posted, so far I found nothing.
But I'm not sure what I'm looking for and neither know which dates
such document may belong to.
Please clarify.
For example here is how typical fts docs looks like:
/afs/cern.ch/user/v/valya/public/DDP/fts.json
which has job_id == "10b4f232-aecd-5b51-b596-e943541159cb"
and I was able to find these type of jobs.
So I need to know the date to look at and confirmation that the values you
posted are job_id or something else.
Best,
Valentin.
…On 0, Alberto Sanchez Hernandez ***@***.***> wrote:
Hi @vkuznet, I wonder if you can see in your monitoring the fts job:
dc6a077e-4640-11e7-b2fa-a0369f23cf8e
or
52f63152-464a-11e7-873e-a0369f23cf8e
I wonder if the metadata I put in there is visible?. Or could you please share the way you look at this, so we can have a look as well.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1085 (comment)
|
Hi Valentin yes, they are job_ids. They are from 2017-05-31 best regards |
Alberto,
in this case I can't find these jobs on HDFS.
Valentin.
…On 0, Alberto Sanchez Hernandez ***@***.***> wrote:
Hi Valentin yes, they are job_ids. They are from 2017-05-31
best regards
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1085 (comment)
|
Hi, |
Yes, I'm looking at FTS logs at CERN HDFS.
…On 0, nataliaratnikova ***@***.***> wrote:
Hi,
I found both Alberto's job-ID's on FNAL fts server. I guess for the purpose of this test we need to submit to CERN FTS?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1085 (comment)
|
I have submitted to cern, the job_id is (few minutes ago) f2857a08-47c0-11e7-9660-02163e018fe3 |
Alberto,
I found few records with this job id, but when I look at metadata of particular
record it is far from complete, i.e. what we discussed here, e.g.
there is no request part, user_dn is empty, etc.
Here is a record I found:
```
{
"activity": "PHEDEX",
"block_size": 0,
"buf_size": 0,
"channel_type": "urlcopy",
"chk_timeout": 0,
"dest_srm_v": " ",
"dst_hostname": "proton.fis.cinvestav.mx",
"dst_se": "",
"dst_site_name": "",
"dst_url": "gsiftp://proton.fis.cin vestav.mx//meson/data/store/mc/RunIISummer16MiniAODv2/InclusiveBtoJpsitoMuMu_SoftQCDnonD_TuneCUEP8M1_wFilter_13TeV-pythia8- evtgen/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/120000/028FFD1E-1D15-E711-A6FA-008CFA0A58F8.root",
"endpnt": "fts3.cern.ch",
"f_size": 2207040911,
"file_id": "1545492631",
"file_size": 2207040911,
"ipv6": false,
"job _id": "f2857a08-47c0-11e7-9660-02163e018fe3",
"job_metadata": {
"client": "fts-client-3.6.8",
"issuer": "PHEDEX",
"time": "1495113054",
"user": "phedex"
},
"job_state": "UNKNOWN",
"log_link": "https://fts3.cern.ch:8449/fts3/ftsmon/#/f 2857a08-47c0-11e7-9660-02163e018fe3",
"nstreams": 0,
"remote_access": true,
"retry": 0,
"retry_max": 0,
"src_hostname": " cmsdcatape01.fnal.gov",
"src_se": "srm://cmsdcatape01.fnal.gov",
"src_site_name": "",
"src_srm_v": "2.2.0",
"src_url ": "srm://cmsdcatape01.fnal.gov:8443/srm/managerv2?SFN=/11/store/mc/RunIISummer16MiniAODv2/InclusiveBtoJpsitoMuMu_SoftQCDn onD_TuneCUEP8M1_wFilter_13TeV-pythia8-evtgen/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/120000/028FF D1E-1D15-E711-A6FA-008CFA0A58F8.root",
"srm_space_token_dst": "null",
"srm_space_token_src": "",
"t__error_message": "Destination file exists and overwrite is not enabled",
"t_channel": "cmsdcatape01.fnal.gov__proton.fis.cinvestav.mx",
"t_error_code": 17,
"t_failure_phase": "TRANSFER_PREPARATION",
"t_final_transfer_state": "Error",
"t_final_transfer_stat e_flag": 0,
"t_timeout": 3600,
"tcp_buf_size": 0,
"time_srm_fin_end": 0,
"time_srm_fin_st": 0,
"time_srm_prep_end": 0,
"tim e_srm_prep_st": 0,
"timestamp_checksum_dest_ended": 0,
"timestamp_checksum_dest_st": 0,
"timestamp_chk_src_ended": 0,
"time stamp_chk_src_st": 0,
"timestamp_tr_comp": 0,
"timestamp_tr_st": 0,
"tr_bt_transfered": 0,
"tr_error_category": "FILE_EXISTS",
"tr_error_scope": "DESTINATION",
"tr_id": "2017-06-02-1841__cmsdcatape01.fnal.gov__proton.fis.cinvestav.mx__154549 2631__f2857a08-47c0-11e7-9660-02163e018fe3",
"tr_timestamp_complete": 1496428892196,
"tr_timestamp_start": 1496428886331,
"user": "",
"user_dn": "",
"vo": "
```
…On 0, Alberto Sanchez Hernandez ***@***.***> wrote:
I have submitted to cern, the job_id is (few minutes ago)
f2857a08-47c0-11e7-9660-02163e018fe3
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1085 (comment)
|
Hi Valentin, |
Hi All, "issuer": "PHEDEX"
"client": "fts-client-3.6.8"
"time": "1495113054"
"user": "phedex"
|
Valentin, do you khow where this "activity" field value is coming from? : { "activity": "PHEDEX", thanks, |
Natalia,
I have no idea how/where and by whom meta-data attributes are filled out.
My understanding that the docs are initiated at Phedex/clients, then they're
probably wrapped by MONIT/kafka/etc., i.e. by tools where you sent the info.
Best,
Valentin.
…On 0, nataliaratnikova ***@***.***> wrote:
Valentin, do you khow where this "activity" field value is coming from? :
{ "activity": "PHEDEX",
thanks,
Natalia.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1085 (comment)
|
Here is few comments from my side:
"issuer": "PHEDEX"
- looks fine to me and is consistent both with ASO (see their code snippet above) and Atlas jobs: {"multi_sources": false, "issuer": "rucio"} .
it would be nice if you'll settle either use lower case convention or upper
case one. As you pointed out ATLAS uses lower case, I don't remember how ASO
filled out this part but you should be at least consistent.
"time": "1495113054"
- need to clarify to which event this timestamp belongs. We could use name format similar to FTS fields: "tr_timestamp_complete": 1496428892196, "tr_timestamp_start": 1496428886331 .
time should be long data-type, not string, and, it should always be put as
seconds since epoch, which can be converted to any other format.
"user": "phedex"
- the name of the local user running the daemon does not have much value for transfers monitoring.. What people actually want to know the "activity", or group, for the requested transfer, see the related issue #1041. However, it will require substantial changes to the current TMDB schema and central agents to propagate this info down to the download agent, which actually submits the transfer jobs.
more important probably DN of the user rather then user name itself, since we
can resolve from DN user attributes via SiteDB.
Best,
Valentin.
|
@nataliaratnikova could you please update me where do you stand on this issue? DId you implement required features? Did you propagate them to agents? Did you verified that these features now appearts in FTS logs? |
@vkuznet The code is released, however we are not yet asking the sites to upgrade, because CERN reported high load on the servers when they upgraded to the new version of the agents, and this is not yet fully understood.
|
Natalia,
thanks for update. I'll be glad if you'll notify me when all agents will be upgraded.
I only need this to know that we can start doing analysis with FTS data on HDFS.
Thanks,
Valentin.
…On 0, nataliaratnikova ***@***.***> wrote:
@vkuznet The code is released, however we are not yet asking the sites to upgrade, because CERN reported high load on the servers when they upgraded to the new version of the clients, and this is not yet fully understood.
OTOH I see T2_GR_Ioannina site has upgraded and is successfully using PHEDEX_4_2_2: both new features are propagated as expected, see e.g. b622c26c-bd83-11e7-bdb4-02163e01811c on CERN fts:
```INFO Mon Oct 30 16:05:26 2017; Job metadata: {\"client\":?\"fts-client-3.6.10\",?\"issuer\":?\"PHEDEX\"}
```
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1085 (comment)
|
hi all- i'm curious as to the status of this effort? It naively looks like few if any of the non-aso cms transfers have a job_metadata field (well, 0 of the first 1000 I looked at) |
Hi David, where did you look this up? PhEDEx is sending the metadata starting from 4.2.2, see my previous post. Only a few sites upgraded to 4.2.2, but I can already see the stats for PhEDEx submitted transfers, and also Rucio transfers from our evaluation tests in the CERN Monit grafana dashboard: Natalia. |
hi @nataliaratnikova - a belated reply - i'm looking at fts records in hadoop. Ones from atlas or from crab have useful metadata, ones from phedex do not. But I was looking at T2s and could easily have missed two(!!) of them that were using the new version. That is certainly far below a useful threshold for me. Is there a planned time scale for completing this? |
Hi @davidlange6, the development part is complete. For deployment I do not have any particular goal within PhEDEx project, as this is not for our internal use. If you have a dependent milestone, we can bring this up with the site support team and ask for their help with the upgrade. |
i'm not sure what a dependent milestone is, sorry - it would be great that this was all deployed before routine data taking starts this year... |
Valentin requests that PhEDEx metadata are propagated to FTS.
The FTS logs will appear on HDFS, then he can run spark job over FTS logs to extract some context for the global transfers monitoring.
Here are the metadata Valentin is interested in:
How such metadata should be structure is a subject of FTS metadata
format/schema. I can't tell you more and you'll need to coordinate with FTS
developers. It would be nice to use the same schema among different CMS FTS
users, e.g. PhEDEx and ASO. This is why I CC'ed Diego since according to him
and, in fact, the FTS logs already carry ASO metadata, e.g.
job_metadata.issuer=ASO.
Since we're talking about meta-data its structure may change/adjusted over
time.
Here are typical examples:
if I'm placing a request to PhEDEx then metadata would be something like:
{
"issuer": "PHEDEX/user",
"time": 123456,
"client": "PHEDEX-client-version-1",
"dn": "/my/DN/here",
"request": {
"dataset":"/a/b/c",
"source": "T1_XXX",
"destination": "T2_XXX"
}
}
if DDM placing a request you'll create
{
"issuer":"PHEDEX/DDM",
...
}
The text was updated successfully, but these errors were encountered: