feat: added download from clinvar #4

PauliusPreiksaCode · 2024-02-19T07:49:03Z

No description provided.

data_collection/tools.py

Strexas

Needs changes

data_collection/tools.py

Strexas · 2024-02-19T16:09:35Z

data_collection/tools.py

+
+def get_file_from_clinvar(override=False):
+
+    file_name = 'clinvar_result.txt'


make an argument with default value

this file has a static name when downloading. So if we want to have it with another name, we would download the file "clinvar_result" and then make os call to rename it with desired name, renaming only can be done once the file is fully downloaded.

provide functionality to rename it as user wants

Strexas · 2024-02-19T16:23:37Z

data_collection/tools.py

+    firefox_options.set_preference("browser.download.dir", download_dir)
+    firefox_options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/octet-stream")
+
+    file_exist = file_exists_and_not_empty(download_dir, file_name)


Does it matter, if it is not empty?

Well if there's no data in the file its either file is not fully downloaded yet or the file is corrupted. I reckon it should always contain data

User accidently deleted all data from file and saved it. He wants to download it again

Strexas · 2024-02-19T16:26:59Z

data_collection/tools.py

+    driver = webdriver.Firefox(options=firefox_options)
+    driver.get(url)
+    driver.execute_script("document.getElementsByName(\"EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_DisplayBar.SendToSubmit\")[0].click()")
+    WebDriverWait(driver, 30).until(lambda driver: file_exists_and_not_empty(download_dir, file_name))


What if it will take longer than 30 seconds to download?

Need to increase the value, but from testing it seemed fine to me. From what I have read, selenium WebDriverWait is synchronous based so the whole script would freeze and wait for it to stop downloading. That's why this timeout is included. I guess we can create a new process on another thread to execute it, but that could also just hang

ok, tested it a bit more. It will download anyway ether the program is running or not (script can crash, but the file will still to continue downloading). So it's the matter of time when to break from waiting (file is downloading for 30 sec, we break out, it continues to download further until it's done)

You are not using driver.quit() that's why it doesn't stop. It also means that the geckodriver keeps running and using computer's resources.

add driver.quit() as Nojus suggested.

data_collection/tools.py

Strexas · 2024-02-19T16:33:47Z

data_collection/tools.py

+        os.remove(os.path.join(download_dir, file_name))
+
+    driver = webdriver.Firefox(options=firefox_options)
+    driver.get(url)


What will it throw if there's no internet connection?

will get selenium.common.exceptions.WebDriverExceptio
I don't know why are we testing this since this problem can occur with every data retrieval in pipeline

We are handling the same error for get_file_from_url except the only difference that for that one we were using requests. Better to catch this error.

Strexas · 2024-02-19T16:35:24Z

data_collection/tools.py

+    driver.get(url)
+    driver.execute_script("document.getElementsByName(\"EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_DisplayBar.SendToSubmit\")[0].click()")
+    WebDriverWait(driver, 30).until(lambda driver: file_exists_and_not_empty(download_dir, file_name))
+    print("File downloaded")


Better to check, if file is downloaded and return True/False, so user could decide if he needs to print this info or no

As of testing, it will always download. Maybe just can check if it's downloaded now (as time of function ending).

Server may crash, internet may dissapear, unpredicted things may happen. Therefore

Make sure it doesn't leave function until we are sure that download was either successful or failed

Use logging module instead of print

Specify which file was downloaded

Strexas · 2024-02-26T20:54:17Z

data_collection/tools.py

+    driver.execute_script("document.getElementsByName(\"EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_DisplayBar.SendToSubmit\")[0].click()")
+    WebDriverWait(driver, 30).until(lambda driver: file_exists_and_not_empty(download_dir, file_name))
+
+    return file_exists_and_not_empty(download_dir, file_name)


add new line at the end of file

Strexas · 2024-02-26T20:55:41Z

Rename PR according to standarts

Strexas · 2024-04-03T14:20:52Z

Closed as issues were fixed in different PR

feat: added download from clinvar

31754d5

PauliusPreiksaCode self-assigned this Feb 19, 2024

Strexas reviewed Feb 19, 2024

View reviewed changes

data_collection/tools.py Outdated Show resolved Hide resolved

Strexas requested changes Feb 19, 2024

View reviewed changes

PauliusPreiksaCode added 2 commits February 20, 2024 18:03

fix: extracted constant

3031bf7

fix: added return value

8f9a69a

PauliusPreiksaCode requested a review from Strexas February 26, 2024 18:38

Strexas requested review from N3UR0515 and removed request for N3UR0515 February 26, 2024 19:27

Strexas requested changes Feb 26, 2024

View reviewed changes

Strexas unassigned PauliusPreiksaCode Mar 4, 2024

merge

9018cd2

Strexas closed this Apr 3, 2024

Strexas deleted the clinvar_web_scraper branch April 3, 2024 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: added download from clinvar #4

feat: added download from clinvar #4

PauliusPreiksaCode commented Feb 19, 2024

Strexas left a comment

Strexas Feb 19, 2024

PauliusPreiksaCode Feb 20, 2024

Strexas Feb 26, 2024

Strexas Feb 19, 2024

PauliusPreiksaCode Feb 20, 2024

Strexas Feb 26, 2024

Strexas Feb 19, 2024

PauliusPreiksaCode Feb 20, 2024

PauliusPreiksaCode Feb 20, 2024

N3UR0515 Feb 22, 2024

Strexas Feb 26, 2024

Strexas Feb 19, 2024

PauliusPreiksaCode Feb 20, 2024

Strexas Feb 26, 2024

Strexas Feb 19, 2024

PauliusPreiksaCode Feb 20, 2024

Strexas Feb 26, 2024

Strexas Feb 26, 2024

Strexas commented Feb 26, 2024

Strexas commented Apr 3, 2024


		def get_file_from_clinvar(override=False):

		file_name = 'clinvar_result.txt'

feat: added download from clinvar #4

feat: added download from clinvar #4

Conversation

PauliusPreiksaCode commented Feb 19, 2024

Strexas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Strexas commented Feb 26, 2024

Strexas commented Apr 3, 2024