Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added download from clinvar #4

Closed
wants to merge 4 commits into from

Conversation

PauliusPreiksaCode
Copy link
Collaborator

No description provided.

@PauliusPreiksaCode PauliusPreiksaCode self-assigned this Feb 19, 2024
data_collection/tools.py Outdated Show resolved Hide resolved
Copy link
Owner

@Strexas Strexas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs changes

data_collection/tools.py Outdated Show resolved Hide resolved
data_collection/tools.py Outdated Show resolved Hide resolved

def get_file_from_clinvar(override=False):

file_name = 'clinvar_result.txt'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make an argument with default value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file has a static name when downloading. So if we want to have it with another name, we would download the file "clinvar_result" and then make os call to rename it with desired name, renaming only can be done once the file is fully downloaded.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provide functionality to rename it as user wants

firefox_options.set_preference("browser.download.dir", download_dir)
firefox_options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/octet-stream")

file_exist = file_exists_and_not_empty(download_dir, file_name)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter, if it is not empty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if there's no data in the file its either file is not fully downloaded yet or the file is corrupted. I reckon it should always contain data

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User accidently deleted all data from file and saved it. He wants to download it again

driver = webdriver.Firefox(options=firefox_options)
driver.get(url)
driver.execute_script("document.getElementsByName(\"EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_DisplayBar.SendToSubmit\")[0].click()")
WebDriverWait(driver, 30).until(lambda driver: file_exists_and_not_empty(download_dir, file_name))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it will take longer than 30 seconds to download?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to increase the value, but from testing it seemed fine to me. From what I have read, selenium WebDriverWait is synchronous based so the whole script would freeze and wait for it to stop downloading. That's why this timeout is included. I guess we can create a new process on another thread to execute it, but that could also just hang

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, tested it a bit more. It will download anyway ether the program is running or not (script can crash, but the file will still to continue downloading). So it's the matter of time when to break from waiting (file is downloading for 30 sec, we break out, it continues to download further until it's done)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not using driver.quit() that's why it doesn't stop. It also means that the geckodriver keeps running and using computer's resources.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add driver.quit() as Nojus suggested.

data_collection/tools.py Outdated Show resolved Hide resolved
data_collection/tools.py Outdated Show resolved Hide resolved
os.remove(os.path.join(download_dir, file_name))

driver = webdriver.Firefox(options=firefox_options)
driver.get(url)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will it throw if there's no internet connection?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will get selenium.common.exceptions.WebDriverExceptio
I don't know why are we testing this since this problem can occur with every data retrieval in pipeline

image

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are handling the same error for get_file_from_url except the only difference that for that one we were using requests. Better to catch this error.

driver.get(url)
driver.execute_script("document.getElementsByName(\"EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_DisplayBar.SendToSubmit\")[0].click()")
WebDriverWait(driver, 30).until(lambda driver: file_exists_and_not_empty(download_dir, file_name))
print("File downloaded")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to check, if file is downloaded and return True/False, so user could decide if he needs to print this info or no

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of testing, it will always download. Maybe just can check if it's downloaded now (as time of function ending).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Server may crash, internet may dissapear, unpredicted things may happen. Therefore

  1. Make sure it doesn't leave function until we are sure that download was either successful or failed
  2. Use logging module instead of print
  3. Specify which file was downloaded

@Strexas Strexas requested review from N3UR0515 and removed request for N3UR0515 February 26, 2024 19:27
driver.execute_script("document.getElementsByName(\"EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_DisplayBar.SendToSubmit\")[0].click()")
WebDriverWait(driver, 30).until(lambda driver: file_exists_and_not_empty(download_dir, file_name))

return file_exists_and_not_empty(download_dir, file_name)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add new line at the end of file

@Strexas
Copy link
Owner

Strexas commented Feb 26, 2024

Rename PR according to standarts

@Strexas Strexas closed this Apr 3, 2024
@Strexas
Copy link
Owner

Strexas commented Apr 3, 2024

Closed as issues were fixed in different PR

@Strexas Strexas deleted the clinvar_web_scraper branch April 3, 2024 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants