-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Migration Automation script - patch 2 #1031
Conversation
niccolopaganini
commented
Sep 13, 2023
- Converted the .ipynb file to python scripts (12 in total; each serving their own purpose; explained in README.md)
- Added a shell script which executes the aforementioned python scripts
- README.md file included which explaining the whole process
- Screenshots directory included for README
- Archive folder containing the .ipynb file (with updated code) and YML file for reference (included if in-case for future reference)
|
Mentioned in issue - Also mentioning here for reference: Spotty internet connection (I couldn't connect my work computer to campus wifi); I will try to update the error logs sometime tonight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot better than the previous notebook, but still needs some cleanup.
High level comments below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- why is
environment.yml
inan_archive
? It is still valid and needs to be activated for the script to run, correct? - Please make sure to pin the dependencies for all packages!
- The e-mission environment already has several of these dependencies; have you considered just adding on to the existing environment with a separate file, similar to https://github.com/e-mission/e-mission-server/blob/master/setup/environment36.notebook.additions.yml? What are the pros and cons of that approach?
- _anaconda_depends | ||
- python=3.11 | ||
- os | ||
- csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need csv when you have pandas?
- requests | ||
- bs4 | ||
- re | ||
- subprocess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need subprocess?
bin/API_migration_scripts/README.md
Outdated
``` | ||
links.py -> links = get_links("https://developer.android.com/sdk/api_diff/33/changes/alldiffs_index_changes") | ||
``` | ||
_Link should look something like this:_ | ||
``` | ||
https://developer.android.com/sdk/api_diff/33/changes/alldiffs_index_changes | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would be much better for this to be a command line argument instead of requiring code edits?
bin/API_migration_scripts/README.md
Outdated
**3. Update locations for ```output.py``` accordingly** | ||
``` | ||
match_csv_java(".../e-mission-phone/plugins", ".../e-mission-phone/bin/API_Migration_scripts/simplify_classes.csv") | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to update this location given that the default value is a relative path? Are you thinking that people may have different names for the plugins
directory? If so, why and how?
bin/API_migration_scripts/README.md
Outdated
``` | ||
chmod +x run_them_py_and_delete_em_csvs.sh | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this if you are going to use bash
to run the script anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't actually run with the current dependencies (which don't include any of the jupyter notebook modules).
And it is actually possible to edit the python scripts directly.
So is it actually worth keeping the notebook around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not encountered this kind of script structure before, where you essentially run 10 different python files, each of which creates a new csv, before. It almost seems like you took every cell from the notebook and made it a separate python file. While modularity is important, this seems extreme, and I'm concerned that the fragmentation will make it harder for people to understand the code and the flow.
Have you considered making each of these a function in a shared file, and then creating a __main__
that calls all the functions? You can then pass the dataframe around the functions instead of writing and reading csvs at every stage.
If you have, please explain your rationale for this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something can be represented as text, it should be, to make it easier to search, copy/paste etc
In this case, this can be replaced by
$ ls -al
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
|
|
What is your rationale for even needing 4 scripts? Why not just combine into one script? |
Not sure what happened but when I deleted the repo on GitHub, the local dir was also deleted. I restored it from a recent HEAD and had to work on it afterward. I have now (hopefully) addressed all the comments. With this push,
|
README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this file included here?
|
||
links = [] | ||
for a in soup.find_all("a"): | ||
if a.has_attr("href") and a["href"].startswith("/sdk/api_diff/33/changes/"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still hardcoded
bin/API_migration_scripts/API_mcf.sh
Outdated
if [ -e classes.csv ]; then | ||
rm classes.csv | ||
fi | ||
|
||
if [ -e modified_csv.csv ]; then | ||
rm modified_csv.csv | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have to write csv files any more given that all the functions are in the same file? We can just pass the dataframes around and avoid the I/O overhead.
@@ -0,0 +1,248 @@ | |||
#API changes identifier | |||
import os | |||
import csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had this an earlier comment. Why do we need csv
instead just using pandas
?
#1031 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, where is the environment.yml
with the new packages needed by this code?
unique_packages = set() | ||
with open("unique_packages.csv", "r") as unique_file: | ||
reader = csv.reader(unique_file) | ||
for row in reader: | ||
if row: | ||
unique_packages.add(row[0].strip()) # Remove leading/trailing whitespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique_packages = set() | |
with open("unique_packages.csv", "r") as unique_file: | |
reader = csv.reader(unique_file) | |
for row in reader: | |
if row: | |
unique_packages.add(row[0].strip()) # Remove leading/trailing whitespace | |
output_df = match_csv_java(....) | |
unique_packages = get_unique_packages(output_df) |
with open("unique_packages.csv", "w") as f: | ||
csvwriter = csv.writer(f) | ||
for package in packages: | ||
csvwriter.writerow([package]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with open("unique_packages.csv", "w") as f: | |
csvwriter = csv.writer(f) | |
for package in packages: | |
csvwriter.writerow([package]) | |
return packages |
with concurrent.futures.ThreadPoolExecutor() as executor: | ||
results = executor.map(get_changed_content, links) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job on experimenting with python concurrency.
Although from a design perspective, this is a bit of overkill for what should be a super simple script.
Four minutes is not that long for a script that is run once every year, and concurrency can add complexity that makes it harder to maintain simple scripts.
However, this appears to be embarrassingly parallel, so the complexity seems manageable for now.
So I will not request a change for now, but will just note that simplicity is also a virtue, especially for simple scripts that are not part of the production system.
5566a8b
to
b8a5cce
Compare