Skip to content

This repository is about producing nballs embeddings for Hindi language which takes into account the word embeddings and hypernym relations among the words.

License

Notifications You must be signed in to change notification settings

fnc11/nball4treehindi

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Install the package

  • for Ubuntu platform please first install python3-tk
sudo apt-get install python3-tk
  • for Ubuntu or Mac platform type:
$ git clone https://github.com/gnodisnait/nball4tree.git
$ cd nball4tree
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Experiment 1: Training and evaluating nball embeddings

Experiment 1.1: Training nball embeddings

  • For Hindi data generation follow instructions in the hindinballs directory.
  • Please also go through this Informative Report on how Hindi Data is structure and how to process it to use it for this experiment.
  • Files used for Hindi data generation are taken from this github repo which mainly took data from IIT Bombay University.
  • You need to download w2v from this website and make sure you remove first line of this file as it contains information about number of words and dimensions.
% you need to create an empty file nball.txt for output

$ python nball.py --train_nball /Users/<user-name>/data/nball.txt --w2v /Users/<user-name>/data/cc.hi.300.vec  --ws_child /Users/<user-name>/data/wordSenseChildren.txt  --ws_catcode /Users/<user-name>/data/glove/catCodes.txt  --log log.txt
% --train_nball: output file of nball embeddings
% --w2v: file of pre-trained word embeddings
% --ws_child: file of parent-children relations among word-senses
% --ws_catcode: file of the parent location code of a word-sense in the tree structure
% --log: log file, shall be located in the same directory as the file of nball embeddings

The training process can take around 3 days.

Experiment 1.2: Checking whether tree structures are perfectly embedded into word-embeddings

  • main input is the output directory of nballs created in Experiment 1.1
  • shell command for running the nball construction and training process
$ python nball.py --zero_energy <output-path> --ball <output-file> --ws_child /Users/<user-name>/data/wordSenseChildren.txt
% --zero_energy <output-path> : output path of the nballs of Experiment 1.1, e.g. ```/Users/<user-name>/data/data_out```
% --ball <output-file> : the name of the output nball-embedding file
% --ws_child /Users/<user-name>/data/wordSenseChildren.txt: file of parent-children relations among word-senses

The checking process can take a very long time around 3-4 hours.

  • result

If zero-energy is achieved, a big nball-embedding file will be created <output-path>/<output-file> otherwise, failed relations and word-senses will be printed.

** Test result at Ubuntu platform:

Experiment 2: Observe neighbors of word-sense using nball embeddings

$ python nball.py --neighbors दिल्ली.n.01 फिलीपीन्स.n.01 मंगलवार.n.01 --ball /Users/<user-name>/data/nball.txt  --num 6
% --neighbors: list of word-senses
% --ball: file location of the nball embeddings
% --num: number of neighbors
  • Results of nearest neighbors look like below:

{ 'दिल्ली.n.01':
[ 'पटना.n.01',
'देहली.n.01',
'कोलकाता.n.01',
'बंगलूर.n.01',
'त्रिवेंद्रम.n.01',
'बंगलुरु.n.01'],
'फिलीपीन्स.n.01':
[ 'फिलीपींस.n.01',
'फिलिपीन्स.n.01',
'फिलिपींस.n.01',
'बोसनिया.n.01',
'बोट्सवाना.n.01',
'मलयेशिया.n.01'],
'मंगलवार.n.01':
[ 'बुधवार.n.01',
'सोमवार.n.01',
'शुक्रवार.n.01',
'शनिवार.n.01',
'गुरुवार.n.01',
'रविवार.n.01']}

English Translation:

{ ‘Delhi.n.01’:
[ ‘Patna.n.01’,
‘Delhi.n.01’, <----- Different written form of Delhi in Hindi
‘Kolkata.n.01’
‘Bangalur.n.01’,
‘Trivandrum.n.01’,
‘Bangaluru.n.01’],
‘Philippines.n.01’:
[ ‘Philippines.n.01’, <----- Different written form of Philippines in Hindi
‘Philippines.n.01’, <----- Different written form of Philippines in Hindi
‘Philippines.n.01’, <----- Different written form of Philippines in Hindi
‘Bosnia.n.01’,
‘Botswana.n.01’,
‘Malaysia.n.01’],
‘Tuesday.n.01’:
[ ‘Wednesday.n.01’,
‘Monday.n.01’,
‘Friday.n.01’,
‘Saturday.n.01’,
‘Thrusday.n.01’,
‘Sunday.n.01’]}

Cite

If you use the code, please cite the following paper:

Tiansi Dong, Chrisitan Bauckhage, Hailong Jin, Juanzi Li, Olaf Cremers, Daniel Speicher, Armin B. Cremers, Joerg Zimmermann (2019). Imposing Category Trees Onto Word-Embeddings Using A Geometric Construction. ICLR-19 The Seventh International Conference on Learning Representations, May 6 – 9, New Orleans, Louisiana, USA.

About

This repository is about producing nballs embeddings for Hindi language which takes into account the word embeddings and hypernym relations among the words.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%