Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Migration Automation script - patch 2 #1031

Closed
wants to merge 0 commits into from

Conversation

niccolopaganini
Copy link
Contributor

  1. Converted the .ipynb file to python scripts (12 in total; each serving their own purpose; explained in README.md)
  2. Added a shell script which executes the aforementioned python scripts
  3. README.md file included which explaining the whole process
  4. Screenshots directory included for README
  5. Archive folder containing the .ipynb file (with updated code) and YML file for reference (included if in-case for future reference)

@niccolopaganini
Copy link
Contributor Author

@niccolopaganini
Copy link
Contributor Author

Mentioned in issue - Also mentioning here for reference: Spotty internet connection (I couldn't connect my work computer to campus wifi); I will try to update the error logs sometime tonight

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot better than the previous notebook, but still needs some cleanup.
High level comments below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. why is environment.yml in an_archive? It is still valid and needs to be activated for the script to run, correct?
  2. Please make sure to pin the dependencies for all packages!
  3. The e-mission environment already has several of these dependencies; have you considered just adding on to the existing environment with a separate file, similar to https://github.com/e-mission/e-mission-server/blob/master/setup/environment36.notebook.additions.yml? What are the pros and cons of that approach?

- _anaconda_depends
- python=3.11
- os
- csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need csv when you have pandas?

- requests
- bs4
- re
- subprocess
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need subprocess?

Comment on lines 72 to 78
```
links.py -> links = get_links("https://developer.android.com/sdk/api_diff/33/changes/alldiffs_index_changes")
```
_Link should look something like this:_
```
https://developer.android.com/sdk/api_diff/33/changes/alldiffs_index_changes
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would be much better for this to be a command line argument instead of requiring code edits?

Comment on lines 80 to 83
**3. Update locations for ```output.py``` accordingly**
```
match_csv_java(".../e-mission-phone/plugins", ".../e-mission-phone/bin/API_Migration_scripts/simplify_classes.csv")
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to update this location given that the default value is a relative path? Are you thinking that people may have different names for the plugins directory? If so, why and how?

Comment on lines 89 to 91
```
chmod +x run_them_py_and_delete_em_csvs.sh
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this if you are going to use bash to run the script anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't actually run with the current dependencies (which don't include any of the jupyter notebook modules).
And it is actually possible to edit the python scripts directly.

So is it actually worth keeping the notebook around?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not encountered this kind of script structure before, where you essentially run 10 different python files, each of which creates a new csv, before. It almost seems like you took every cell from the notebook and made it a separate python file. While modularity is important, this seems extreme, and I'm concerned that the fragmentation will make it harder for people to understand the code and the flow.

Have you considered making each of these a function in a shared file, and then creating a __main__ that calls all the functions? You can then pass the dataframe around the functions instead of writing and reading csvs at every stage.

If you have, please explain your rationale for this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something can be represented as text, it should be, to make it easier to search, copy/paste etc
In this case, this can be replaced by

$ ls -al 
...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@niccolopaganini
Copy link
Contributor Author

  1. Addressed issues regarding fragmentation (changing from 12 scripts to 4 "head" python scripts) which allows the modularity without being fragmented
  2. Paths have been changed top be relative
  3. API changes link previously had to be changed by the user. Now it can be inputted by the user.
  4. Pending: README needs to be updated & setup script needs to be created.

@niccolopaganini
Copy link
Contributor Author

4. README needs to be updated ✅

@shankari
Copy link
Contributor

shankari commented Sep 24, 2023

Addressed issues regarding fragmentation (changing from 12 scripts to 4 "head" python scripts) which allows the modularity without being fragmented

What is your rationale for even needing 4 scripts? Why not just combine into one script?
As I pointed out, you can still have separate methods in the one script.

@niccolopaganini
Copy link
Contributor Author

niccolopaganini commented Oct 3, 2023

Not sure what happened but when I deleted the repo on GitHub, the local dir was also deleted. I restored it from a recent HEAD and had to work on it afterward. I have now (hopefully) addressed all the comments. With this push,

  1. I have boiled it down to one python script
  2. Parallelized the process so now instead of taking 4 minutes to run, it takes 28 seconds on average to execute everything
  3. Two bash files
    • One to execute everything
    • One to remove the rest once the work is done
  4. Updated README
  5. Also removed all the CSV nonsense so a person using this will just have to refer one file

README.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this file included here?


links = []
for a in soup.find_all("a"):
if a.has_attr("href") and a["href"].startswith("/sdk/api_diff/33/changes/"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still hardcoded

Comment on lines 9 to 15
if [ -e classes.csv ]; then
rm classes.csv
fi

if [ -e modified_csv.csv ]; then
rm modified_csv.csv
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have to write csv files any more given that all the functions are in the same file? We can just pass the dataframes around and avoid the I/O overhead.

@@ -0,0 +1,248 @@
#API changes identifier
import os
import csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this an earlier comment. Why do we need csv instead just using pandas?
#1031 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, where is the environment.yml with the new packages needed by this code?

Comment on lines 209 to 214
unique_packages = set()
with open("unique_packages.csv", "r") as unique_file:
reader = csv.reader(unique_file)
for row in reader:
if row:
unique_packages.add(row[0].strip()) # Remove leading/trailing whitespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unique_packages = set()
with open("unique_packages.csv", "r") as unique_file:
reader = csv.reader(unique_file)
for row in reader:
if row:
unique_packages.add(row[0].strip()) # Remove leading/trailing whitespace
output_df = match_csv_java(....)
unique_packages = get_unique_packages(output_df)

Comment on lines 118 to 121
with open("unique_packages.csv", "w") as f:
csvwriter = csv.writer(f)
for package in packages:
csvwriter.writerow([package])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with open("unique_packages.csv", "w") as f:
csvwriter = csv.writer(f)
for package in packages:
csvwriter.writerow([package])
return packages

Comment on lines 175 to 177
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(get_changed_content, links)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job on experimenting with python concurrency.

Although from a design perspective, this is a bit of overkill for what should be a super simple script.
Four minutes is not that long for a script that is run once every year, and concurrency can add complexity that makes it harder to maintain simple scripts.

However, this appears to be embarrassingly parallel, so the complexity seems manageable for now.
So I will not request a change for now, but will just note that simplicity is also a virtue, especially for simple scripts that are not part of the production system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants