https://pypi.org/project/langdetect/
Run Command: pip install langdetect
https://gowda.ai/posts/2021/04/mtdata-nlcodec-rtg-many-english/ The result will be a REST service running on port 6000
MTData A tool to download machine translation datasets https://github.com/thammegowda/mtdata/
pip install mtdata
pip install nlcodec
pip install rtg
http://github.com/chrismattmann/tika-python
- With the RTG running, running tika’s translate module will automatically work fine (since it will pick up the RTG server)
- The Tika language module (it’s language detector) should also work fine
Run Command: pip install tika
- To check what tika is installed, run command:
pip show tika
- Refer to DSCI550 Slack thread: https://uscdatascience.slack.com/archives/C04JM790KHS/p1679846853060489
https://cwiki.apache.org/confluence/display/tika/GeoTopicParser
First you will need to download the Lucene Geo Gazetteer project and to install it. You can do so by:
- $
cd $HOME/src
- change directory to the Repo directory - $
git clone https://github.com/chrismattmann/lucene-geo-gazetteer.git
- $
cd lucene-geo-gazetteer
- $
mvn install assembly:assembly
- refer to "How to Install Apache Maven MVN"
- Change directory to
/src/main/bin/
inside of dir lucene-geo-gazetteer. - $
export PATH=$PWD:$PATH
- Install Maven:
brew install maven
- The result of this should be the Lucene GeoGazetter REST server running as specified here: https://github.com/chrismattmann/lucene-geo-gazetteer
Note: Run these commands inside the lucene-geo-gazetteer directory
curl -O http://download.geonames.org/export/dump/allCountries.zip
unzip allCountries.zip
java -cp target/lucene-geo-gazetteer-0.3-SNAPSHOT-jar-with-dependencies.jar edu.usc.ir.geo.gazetteer.GeoNameResolver -i geoIndex -b allCountries.txt
java -cp target/lucene-geo-gazetteer-0.3-SNAPSHOT-jar-with-dependencies.jar edu.usc.ir.geo.gazetteer.GeoNameResolver -i geoIndex -s Pasadena Texas
(testing with e.g. Pasedena, Texas)- Test with:
lucene-geo-gazetteer -s Pasadena Texas -json | python -mjson.tool
- Test Service Mode (with e.g. Pasedena, Texas) in base terminal:
- Launch Server: $
lucene-geo-gazetteer -server
- Query:
$ curl "localhost:8765/api/search?s=Pasadena&s=Texas&c=2"
curl "http://localhost:8765/api/search?s=Pasadena&s=Texas" | python -mjson.tool
- Launch Server: $
- You can connect the GeoGazetteer to Tika-Python using the instructions here: https://github.com/chrismattmann/tika-python#changing-the-tika-classpath
Once Lucene GeoGazetter Server is Installed and Working, Now download and set up the NER model, and then link it to Tika [SKIP_IF_location-ner-model_dir_CREATED]
- Create new directory in repo:
mkdir location-ner-model
cd location-ner-model
- Run curl command inside location-ner-model directory:
curl -O https://opennlp.sourceforge.net/models-1.5/en-ner-location.bin
mkdir -p org/apache/tika/parser/geo/
mv en-ner-location.bin org/apache/tika/parser/geo/
ls org
pwd
ls -alR
- Should see something like this:
total 0
drwxr-xr-x@ 3 toddgavin staff 96 Mar 26 19:06 .
drwxr-xr-x@ 19 toddgavin staff 608 Mar 26 19:04 ..
drwxr-xr-x@ 3 toddgavin staff 96 Mar 26 19:06 or
Now we have to create the new application/geotopic MIME type, and map it to Tika. [SKIP_IF_geotopic-mime_dir_CREATED]
mkdir geotopic-mime
cd geotopic-mime
mkdir -p org/apache/tika/mime
curl -O https://raw.githubusercontent.com/chrismattmann/geotopicparser-utils/master/mime/org/apache/tika/mime/custom-mimetypes.xml
mv custom-mimetypes.xml org/apache/tika/mime
ls org/apache/tika/mime/
Now you need to grab an example of a file that you want to run the GeoTopicParser on... [SKIP_IF_polar.geot_CREATED]
curl -LO https://raw.githubusercontent.com/chrismattmann/geotopicparser-utils/master/geotopics/polar.geot
cat polar.geot
Final step is that you need a copy of Tika App, Tika Server, and also the Tika NLP-ML module (which has the GeoTopicParser in it). You can build all of these by building Tika, but there's an easier way. Just grab the JAR files. [SKIP_IF_tika-build_dir_CREATED]
mkdir tika-build
cd tika-build
- Tika Parser NLP Package:
curl -LO https://repo1.maven.org/maven2/org/apache/tika/tika-parser-nlp-package/2.7.0/tika-parser-nlp-package-2.7.0.jar
- Tika App:
curl -LO https://repo1.maven.org/maven2/org/apache/tika/tika-app/2.7.0/tika-app-2.7.0.jar
- Tika Server:
curl -LO https://repo1.maven.org/maven2/org/apache/tika/tika-server-standard/2.7.0/tika-server-standard-2.7.0.jar
> Now, we can run the command to test out the GeoTopicParser, first from the TikaApp / Command line interface (CLI). Then we'll run a Tika REST server, and try it there too.
Created a simple script, which I will paste below called geotopic-parser that wraps the Java command and classpaths and allows you to run it on a single file. [SKIP_IF_geotopic-parser_CREATED]
- Create file geotopic-parser file:
TIKA_VERSION=2.7.0
export f=$1
java -classpath tika-build/tika-app-${TIKA_VERSION}.jar:tika-build/tika-parser-nlp-package-${TIKA_VERSION}.jar:${PWD}/location-ner-model:${PWD}/geotopic-mime \
org.apache.tika.cli.TikaCLI -m $f
- Give executable permission to the file geotopic-parser by running command:
chmod +x geotopic-parser
- Now, I will run it on the polar.geot file:
./geotopic-parser polar.geot
- And then similarly create a simple script to do that too, called geotopic-server which I will paste below (don't forget to chmod +x geotopic-server before running it.)
- Note once you run it, it will take control of the terminal unless you put it in the background, so you'll need a new terminal to test it out.
- Create file geotopic-server file:
export TIKA_VERSION=2.7.0
java -classpath ${PWD}/location-ner-model:${PWD}/geotopic-mime:tika-build/tika-server-standard-${TIKA_VERSION}.jar:tika-build/tika-parser-nlp-package-${TIKA_VERSION}.jar \
org.apache.tika.server.core.TikaServerCli
- Give executable permission to the file geotopic-server by running command:
chmod +x geotopic-server
- Run command:
./geotopic-server
- If its working, test out the GeoTopic REST server by opening new base terminal and running:
curl -T polar.geot -H "Content-Disposition: attachment; filename=polar.geot" http://localhost:9998/rmeta | python -mjson.tool
curl -T polar.geot -H "Content-Type: application/geotopic; filename=polar.geot" http://localhost:9998/rmeta | python -mjson.tool
OUTPUT:
[
{
"Geographic_LONGITUDE": "105.0",
"Geographic_NAME": "People\u2019s Republic of China",
"X-TIKA:Parsed-By-Full-Set": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.geo.GeoParser"
],
"resourceName": "polar.geot",
"Optional_NAME1": "United States",
"Optional_LATITUDE1": "39.76",
"Optional_LONGITUDE1": "-98.5",
"X-TIKA:Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.geo.GeoParser"
],
"X-TIKA:parse_time_millis": "1300",
"X-TIKA:embedded_depth": "0",
"Geographic_LATITUDE": "35.0",
"Content-Length": "881",
"Content-Type": "application/geotopic"
}
]
- Kill all java processes and kill tika server is already running (refer to Errors section fo ReadMe)
killall java
- List current processes:
jps
- Kill all tika processes with:
kill <process number>
- Kill all tika processes with:
- Navigate to directory /3_Tika_GeoTopic_Parser and run command to start lucene server:
lucene-geo-gazetteer -server
- Should see this if working:
INFO: Starting ProtocolHandler ["http-nio-8765"]
- If its not working, re-run command to set up path inside of /lucene-geo-gazetteer/src/main/bin:
export PATH=$PWD:$PATH
- Should see this if working:
- In new terminal window, navigate to directory /3_Tika_GeoTopic_Parser and run command to start geotopic server:
./geotopic-server
- Should see this if working:
INFO [main] 16:25:04,222 org.apache.tika.server.core.TikaServerProcess Started Apache Tika server ff835cb6-9aa1-4817-ba8d-d035eb174c87 at http://localhost:9998/
- Should see this if working:
- In new terminal window again, navigate to directory /3_Tika_GeoTopic_Parser and run command to test servers:
curl -T polar.geot -H "Content-Type: application/geotopic; filename=polar.geot" http://localhost:9998/rmeta | python -mjson.tool
- Should Get if running correctly:
United States, 39.76, -98.5
- Should Get if running correctly:
If you want to call your new GeoTopic server from Python, using Tika-Python it's simple! You just drop into Python, and run Tika on a *.geot file.
from tika import parser
parsed = parser.from_file('polar.geot', headers={ 'Content-Type' : 'application/geotopic'})
print(parsed)
OUTPUT:
{'metadata': {'Geographic_LONGITUDE': '-98.5', 'Geographic_NAME': 'United States', 'X-TIKA:Parsed-By-Full-Set': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.geo.GeoParser'], 'resourceName': "b'polar.geot'", 'Optional_NAME1': 'People's Republic of China', 'Optional_LATITUDE1': '35.0', 'Optional_LONGITUDE1': '105.0', 'X-TIKA:Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.geo.GeoParser'], 'X-TIKA:parse_time_millis': '61', 'X-TIKA:embedded_depth': '0', 'Geographic_LATITUDE': '39.76', 'Content-Length': '881', 'Content-Type-Override': 'application/geotopic', 'Content-Type': 'application/geotopic'}, 'content': None, 'status': 200}
https://pypi.org/project/detoxify/
Run Command: pip install detoxify
- Note that if you are using Mac and Python, using pyenv, and you run into issues installing Detoxify and torch with PIP, see this for an easy workaround pytorch/pytorch#53601 (comment)
- To access the images, use the URL from the post and give it the URL prefix “/optimized”, such as: https://image.pixstory.com/optimized/Pixstory-image-164416629024955.jpeg
- Download all 95k images associated with the posts
- Write a simple python script to do this
- Install Tika Dockers package for Image Captioning and Object Recognition
- Run Commands:
git clone https://github.com/USCDataScience/tika-dockers.git
- Need to open "Docker Desktop" on Mac and ensure it is running before exectuign this command.
docker pull uscdatascience/im2txt-rest-tika
- Read and test out: https://cwiki.apache.org/confluence/display/TIKA/TikaAndVisionDL4J
- Read and test out: apache/tika#189
- Run Commands:
- Iterate through all the Pixstory posts and add the generated image caption and the detect object(s) column to your dataset
- Kill all java processes:
killall java
- Kill tika server:
- To view running processes:
jps
- They type the process number:
kill <process>
- To view running processes: