Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial load functionality for xri_etl script #474

Merged
merged 1 commit into from
Aug 15, 2024

Conversation

rminsil
Copy link
Collaborator

@rminsil rminsil commented Aug 3, 2024

Implements just the loading of a tsv file into internal models.

Addresses #473

Damien asked to have the work chunked up into smaller PR's so I have left the logic for creating the extract files not implemented currently.


If you want to play with it, here is some data you can test with:

id	source	target	split
0	Wakati wa kupanda, watu waligundua ingine dhaifu mlimani, ikiwa imejeruhiwa na kuachwa.	Enkagha yekebhusuro ,abhanto bhamanyire, obhurosu bhamorukereghe ,urabhone bhaikarwiri nokoghetigha.	train
1	Nyaraka za kale za ufafanuzi zilitumiwa mapema alfajiri.	!	train
2	Lengo lake lilikuwa wazi, lakini safari yake ilikatishwa ghafla wakati mgeni alipompa sadaka iliyowakilisha kila kitu alichokuwa akitafuta.	Obhusemi bhooche bhware harabhu, ghasi bhono orughendo rooche ndwakubhire hari remwe enkagha yumugheni amuhaire obhuhani bhunobhwerekiri eghento ghiinsi kino areghutuna .	train
3	Wote walikubaliana kuingia kwenye banda kuona kama lilikuwa mbaya kama kila mtu alivyosema.	Bhansi bhaitabheraine oghusoha kwing'ando okuhenehi obhobhe bho hase hayo kebhore bhahaghambhire nahasinenu.	train
4	Unapopimia matokeo, ongeza marekebisho, na kutambua kila kosa, unaelewa kweli nini maana ya taarifa.	Hano okurenghi amarore ,engheri ubhuchochi ,nokumanya amasari ghaansi nke okumanya kebhore yorubhoti	train
5	Aliweza kufundisha wanafunzi kwa kina na walifunguliwa akili zao, wakastaajabu sana.	Nanaghiri okweghi abheegha kobhutambe na mbhabhwine amang'aini ,nghabharoghori bhukong'u	train
6	Bei za vitu vya wazawa wa Edomu huhifadhiwa vizuri, na hivyo kulifanya tukio hilo kuwa tukio lisilopaswa kutohudhuriwa na wapenzi wa historia.	Obhughori bho ebhinto  momoghi ghuno ghwarekubherekeru Etomu, bhihabhechibhu bhwahene ,kwikore riyo yagheri abhanto abhaaru bhanghe ghuucha abhaseighi bhamang'ana ghakare.	train
7	Mwalimu alikasirika sana na akaeleza kwa kina kutuliza hali hiyo.	Omweghi nakaraire bhukonghu akaghambha ghokwangharara oghokiri ahase hayo.	train
8	Miongoni mwa maonyesho, tochi ya zamani na mto wa kichwa vilikuwa vimewashwa na mwanga wa radi, vikionyesha jinsi vitu hivi vya zamani vilivyokuwa sehemu ya maisha ya kila siku.	Mumaerekeri, amakore ghakare ghabhaire ghetara ino yarekorusi emireghari ghino ghetubhire enkobha ne  etara bhyerekiri kebhore ebhinto bhekare mnabhyare hamwe kwimenya resingherere.	train
9	Jamii ilikuwa tayari imepogolewa na kifo cha watu watatu ambao hawakuweza kulipa madeni yao,baada ya kukopeshana na hatimae kuuana.	Ehamate yotorru bhwahene, nerusiribhu koroku robhanto bhatato bhono bhatanaghiri okureha amasire ghaabho, hano bhamare okuhanerana amaasire nokumaruho okoyiita.	train

Save that to a tsv file somewhere then cd into the xri dir and do:

python xri_etl.py [PATH TO YOUR FILE] swa kcz XRI-2024-08-12

You'll need to add some print statements to see the data.


After this I'll implement generating the output files, and then add more docs if you guys feel that's necessary.


This change is Reviewable

@rminsil rminsil linked an issue Aug 3, 2024 that may be closed by this pull request
Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. The code is clean and has good use of type hints.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mmartin9684-sil)

@rminsil rminsil merged commit 6faff4f into master Aug 15, 2024
1 check passed
@rminsil rminsil deleted the issue-473-introduce-xri-etl-script branch September 5, 2024 04:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create initial xri_etl script
2 participants