Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exact sentence which caused 'end_idx = -1' issue #12

Open
yaof20 opened this issue Dec 10, 2019 · 6 comments
Open

exact sentence which caused 'end_idx = -1' issue #12

yaof20 opened this issue Dec 10, 2019 · 6 comments

Comments

@yaof20
Copy link

yaof20 commented Dec 10, 2019

Hi there!
Sorry for bothering again.
I am using ace_2005_td_v7_LDC2006T06.tgz dataset and I have downloaded the latest version of this github repo.

During the processing of the training data, assertion error occurred:
assert end_idx != -1, "end_idx: {}, end_pos: {}, phrase: {}, tokens: {}, chars:{}".format(end_idx, end_pos, phrase, tokens, chars)
AssertionError: end_idx: -1, end_pos: 133, phrase: Doctors Without Borders/Médecins Sans Frontières (MSF, tokens: [{'index': 1, 'word': '', 'originalText': '"', 'lemma': '', 'characterOffsetBegin': 0,

I simply commented the assertion code and the main.py finished running without exception.

Here is what I found in the output file:

"sentence": ""Doctors Without Borders/M\u8305decins Sans Fronti\u732bres (MSF) has received an extraordinary outpouring of support for the people of South Asia and we are extremely grateful.",
"golden-entity-mentions": [

  {
    "text": "Doctors Without Borders/M\u00e9decins Sans Fronti\u00e8res (MSF",
    "entity-type": "ORG:Non-Governmental",
    "start": 12,
    **"end": -1**
  },...]

How to solve this end: -1 problem?
The entity recognition could be incomplete.

@Hanlard
Copy link

Hanlard commented Dec 30, 2019

I meet the same problem with you!

@scarydemon2
Copy link

meet same problem with same data

@scarydemon2
Copy link

you can change the raw data that in Engish/un/timex2norm/alt.vacation.las-vegas_20050109.0133.apf.xml and alt.vacation.las-vegas_20050109.0133.sgm.
In this two files,you can search "Doctors Without" and change following é to e .and the problem will solve.

@daviddongkc
Copy link

Hi,

I am doing research on information extraction and need to use ACE2005 dataset urgently. But unfortunately, the LDC licence for ACE2005 is not available for my university.
May I know if you can by any chances share the dataset for research purpose?

Many thanks,
Regards,
kc

@yaof20
Copy link
Author

yaof20 commented Feb 26, 2021

Hi,

I am doing research on information extraction and need to use ACE2005 dataset urgently. But unfortunately, the LDC licence for ACE2005 is not available for my university.
May I know if you can by any chances share the dataset for research purpose?

Many thanks,
Regards,
kc

Hi there,

sorry for the late response. I am wondering if you are still in need of the dataset. Contact me through email (fengya0@outlook.com) if you are still interested.

Regards,
Feng Yao

@zyz0000
Copy link

zyz0000 commented Aug 7, 2022

you can change the raw data that in Engish/un/timex2norm/alt.vacation.las-vegas_20050109.0133.apf.xml and alt.vacation.las-vegas_20050109.0133.sgm. In this two files,you can search "Doctors Without" and change following é to e .and the problem will solve.

In addition to change é to e, one should also change è to e to solve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants