Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some characters in schema.org documents cause them not be indexed #4

Open
taojing2002 opened this issue Jul 18, 2021 · 6 comments
Open
Assignees

Comments

@taojing2002
Copy link
Collaborator

When we index the objects from BCODMO, we saw some errors like:

[ERROR] 2021-07-18 19:12:04,664 (HTTPService:writeError:241) URL: http://localhost:8983/solr/search_core/update?commit=true
[ERROR] 2021-07-18 19:12:04,664 (HTTPService:writeError:242) Post: 
[ERROR] 2021-07-18 19:12:04,664 (HTTPService:writeError:245) <?xml version="1.0" encoding="utf-8"?>
<add><doc><field name="id">sha256:26a061c8f8177d417d5ed8b29d8c6cf62f0d9a96bbd08a6afa3e7bc309bc9624</field><field name="seriesId">http://lod.bco-dmo.org/id/dataset/3782</field><field name="fileName">tmpb4ps2wqk</field><field name="mediaType">application/ld+json</field><field name="formatId">science-on-schema.org/Dataset;ld+json</field><field name="formatType">METADATA</field><field name="size">37672</field><field name="checksum">b40b52ac2651f915f0a7b29da8b20bf2</field><field name="submitter">http://orcid.org/0000-0002-6513-4996</field><field name="checksumAlgorithm">MD5</field><field name="rightsHolder">urn:node:BCODMO</field><field name="replicationAllowed">true</field><field name="numberReplicas">3</field><field name="dateUploaded">2019-05-29T20:24:00.000Z</field><field name="dateModified">2021-07-17T20:50:37.000Z</field><field name="datasource">urn:node:BCODMO</field><field name="authoritativeMN">urn:node:BCODMO</field><field name="replicaMN">urn:node:BCODMO</field><field name="replicaMN">urn:node:CN</field><field name="replicationStatus">completed</field><field name="replicationStatus">completed</field><field name="replicaVerifiedDate">2021-07-18T00:17:18.899Z</field><field name="replicaVerifiedDate">2021-07-18T00:17:18.939Z</field><field name="readPermission">public</field><field name="isPublic">true</field><field name="dataUrl">https://cn.dataone.org/cn/v2/resolve/sha256%3A26a061c8f8177d417d5ed8b29d8c6cf62f0d9a96bbd08a6afa3e7bc309bc9624</field><field name="abstract">&lt;p&gt;CTD measurements at water sample depths and Niskin bottle water samples from the Bermuda Atlantic Time-series Study (BATS) and from Station S, located 25 km SE of Bermuda (32°10&#4;N, 64°30&#4;W)&amp;nbsp;Measurements have been collected since 1988 and include nutrients, biogeochemical concentration, bacterial enumeration, and cyanobacteria.&lt;/p&gt;
</field><field name="title">Niskin bottle water samples and CTD measurements at water sample depths collected at Bermuda Atlantic Time-Series sites in the Sargasso Sea ongoing from 1955-01-29 (BATS project)</field><field name="label">Niskin bottle samples</field><field name="awardNumber">http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0752366</field><field name="awardTitle">OCE-0752366</field><field name="author">Rodney Johnson</field><field name="pubDate">2019-05-29T00:00:00.000Z</field><field name="funderIdentifier">https://doi.org/10.13039/100000141</field><field name="funderName">NSF Division of Ocean Sciences</field><field name="origin">Rodney Johnson</field><field name="keywords">oceans</field><field name="southBoundCoord">19.225</field><field name="westBoundCoord">-74.6</field><field name="northBoundCoord">39.455</field><field name="eastBoundCoord">-59.649</field><field name="beginDate">1955-01-29T00:00:00.000Z</field><field name="endDate">2016-12-18T00:00:00.000Z</field><field name="parameter">Sigma-Theta</field><field name="parameter">Nitrite-1</field><field name="parameter">pig7 (19-Hexfu ng/kg)</field><field name="parameter">pig20 (a-Carotene ng/kg)</field><field name="parameter">Nitrite-1</field><field name="parameter">pig5 (19-Butfu ng/kg)</field><field name="parameter">cast number Cast number; 1-80=CTD casts; 81-99=Hydrocasts (i.e. 83 = Data from Hydrocast number 3)</field><field name="parameter">Prochlorococcus</field><field name="parameter">Oxygen-1</field><field name="parameter">Cruise type; 1=BATS core; 2=BATS Bloom a; 3=BATS Bloom b; 5=BATS Validation cruise; 6=Hydrostation</field><field name="parameter">longitude with positive values East</field><field name="parameter">Bacteria enumeration</field><field name="parameter">pig12 (Zea+lut ng/kg)</field><field name="parameter">Salinity-1</field><field name="parameter">Nitrate+Nitrite-1</field><field name="parameter">pig19 (Zeax ng/kg)</field><field name="parameter">Nitrate+Nitrite-1</field><field name="parameter">pig4 (peri ng/kg)</field><field name="parameter">cruise number</field><field name="parameter">Particulate lithogenic silica</field><field name="parameter">date and time represented in ISO 8601 format</field><field name="parameter">Latitude with positive values North</field><field name="parameter">TN NOTE: Prior to BATS 121; DON is reported instead of TON</field><field name="parameter">pig11 (Diat ng/kg)</field><field name="parameter">CTD Salinity</field><field name="parameter">A unique bottle id which identifies cruise; cast; and Nisken number</field><field name="parameter">Nanoeukaryotes</field><field name="parameter">pig18 (Lutein ng/kg)</field><field name="parameter">Alkalinity</field><field name="parameter">pig3 (chl c1+c2 ng/kg)</field><field name="parameter">Particulate biogenic silica</field><field name="parameter">pig16 (Turn Chl a ug/kg)</field><field name="parameter">Pressure</field><field name="parameter">pig21 (b-Carotene ng/kg)</field><field name="parameter">pig17 (Turn Phaeo ug/kg)</field><field name="parameter">Phosphate-1</field><field name="parameter">dissolved inorganic carbon</field><field name="parameter">Synechococcus</field><field name="parameter">pig9 (Diad ng/kg)</field><field name="parameter">Phosphate-1</field><field name="parameter">PON</field><field name="parameter">pig1 (Chl3 c3 ng/kg)</field><field name="parameter">Temperature ITS-90</field><field name="parameter">pig14 (Chl a ng/kg)</field><field name="parameter">POC</field><field name="parameter">pig15 (a+b Carotene ng/kg)</field><field name="parameter">Oxy Anomaly-1</field><field name="parameter">pig8 (Pras ng/kg)</field><field name="parameter">textual description of the cruise type</field><field name="parameter">Total dissolved Phosphorus</field><field name="parameter">Niskin number</field><field name="parameter">Low-level phosphorus</field><field name="parameter">Picoeukaryotes</field><field name="parameter">depth</field><field name="parameter">name of the originators file</field><field name="parameter">pig13 (Chl b ng/kg)</field><field name="parameter">Silicate-1</field><field name="parameter">Decimal Year</field><field name="parameter">pig10 (Allox ng/kg)</field><field name="parameter">Oxygen Fix Temp</field><field name="parameter">Silicate-1</field><field name="parameter">POP</field><field name="parameter">TOC</field><field name="parameter">pig6 (fuco ng/kg)</field><field name="parameter">pig2 (chlidea ng/kg)</field><field name="edition">1</field><field name="serviceEndpoint">https://www.bco-dmo.org/dataset/3782</field></doc></add>
[ERROR] 2021-07-18 19:12:04,665 (HTTPService:writeError:246) 


Response: 

[ERROR] 2021-07-18 19:12:04,665 (HTTPService:writeError:249) <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">500</int><int name="QTime">1</int></lst><lst name="error"><str name="msg">[com.ctc.wstx.exc.WstxLazyException] Illegal character entity: expansion character (code 0x4) not a valid XML character
 at [row,col {unknown-source}]: [2,1686]</str><str name="trace">[com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x4) not a valid XML character
 at [row,col {unknown-source}]: [2,1686]
	at com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
	at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:671)
	at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3505)
	at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:804)
	at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:403)
	at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:249)
	at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
	at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
	at org.eclipse.jetty.server.Server.handle(Server.java:497)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
	at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x4) not a valid XML character
 at [row,col {unknown-source}]: [2,1686]
	at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:451)
	at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2342)
	at com.ctc.wstx.sr.StreamScanner.checkAndExpandChar(StreamScanner.java:2288)
	at com.ctc.wstx.sr.StreamScanner.resolveSimpleEntity(StreamScanner.java:1147)
	at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4492)
	at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:3964)
	at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3543)
	at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3503)
	... 32 more
</str><int name="code">500</int></lst>
</response>

[ERROR] 2021-07-18 19:12:04,665 (HTTPService:writeError:238) Unable to write to stream
java.io.IOException: unable to update solr, non 200 response code.
	at org.dataone.cn.indexer.solrhttp.HTTPService.sendUpdate(HTTPService.java:139)
	at org.dataone.cn.indexer.solrhttp.HTTPService.sendUpdate(HTTPService.java:117)
	at org.dataone.cn.indexer.SolrIndexService.sendCommand(SolrIndexService.java:343)
	at org.dataone.cn.indexer.SolrIndexService.insertIntoIndex(SolrIndexService.java:307)
	at org.dataone.cn.index.processor.IndexTaskUpdateProcessor.process(IndexTaskUpdateProcessor.java:50)
	at org.dataone.cn.index.processor.IndexTaskProcessor.processTask(IndexTaskProcessor.java:288)
	at org.dataone.cn.index.processor.IndexTaskProcessor.access$000(IndexTaskProcessor.java:80)
	at org.dataone.cn.index.processor.IndexTaskProcessor$1.run(IndexTaskProcessor.java:265)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

It sounds like the document has some special characters we need to escape.

@taojing2002 taojing2002 changed the title Some schema.org document can't be indexed Some characters in schema.org document cause it not be indexed Jul 18, 2021
@taojing2002 taojing2002 changed the title Some characters in schema.org document cause it not be indexed Some characters in schema.org documents cause them not be indexed Jul 18, 2021
@mbjones mbjones transferred this issue from DataONEorg/d1_cn_index_processor Jun 22, 2022
@mbjones mbjones added this to the 2.4.0 milestone Aug 5, 2022
@mbjones mbjones modified the milestones: 2.4.0, 3.0.0 Sep 6, 2022
@taojing2002 taojing2002 self-assigned this Sep 29, 2022
@taojing2002
Copy link
Collaborator Author

taojing2002 commented Sep 29, 2022

It seems the document contains some like (32\u00b010\u0004N, 64\u00b030\u0004W) in the description. So the parser can't handle the unicode.

@mbjones
Copy link
Member

mbjones commented Sep 30, 2022

can you show those characters in context of the schema.org document please?

@taojing2002
Copy link
Collaborator Author

The original string is:

(32\u00b010\u0004N, 64\u00b030\u0004W)

After expansion (adding context):

(32°10\u0004N, 64°30\u0004W)

In the solr doc before sending to the solr serever:

(32°10&#4;N, 64°30&#4;W)

@taojing2002
Copy link
Collaborator Author

taojing2002 commented Oct 11, 2022

In another description, it has the value:

32\u00b0 10'N, 64\u00b0 30'W

after expansion and compaction:

32° 10'N, 64° 30'W

The solr doc is:

32° 10&apos;N, 64° 30&apos;W

It works well.

@taojing2002
Copy link
Collaborator Author

taojing2002 commented Oct 11, 2022

It seems the author uses \u0004, which is (EOT) to replace the apostrophe, which is \u0027. After I replace \u0004 by \u0027. Everything works. But I am not sure why solr can't handle EOT (&#4;).

@taojing2002
Copy link
Collaborator Author

We need to escape the special character in dataone-indexer

@mbjones mbjones removed this from the 3.0.0 milestone Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants