-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Courageous Comets #3
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: thijsfranck <thijsfranck@gmail.com>
Co-authored-by: thijsfranck <thijsfranck@gmail.com>
Co-authored-by: thijsfranck <thijsfranck@gmail.com>
Co-authored-by: isaa-ctaylor <isaac@isaactaylor.xyz>
BREAKING CHANGE: dependencies are no longer updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This project was huge. It was a lot of fun to see what it was doing, go through NLTK functions and see them used and applied here, and see how the data was managed and tricks you used to keep things fast.
Overall I liked it a lot. My comments are really critical and are written as if you had way more time to work on this project, so I want you know they're coming from someone who really enjoyed stepping through the code, and are written from the perspective of what I'd like to see in the next steps.
There was a lot I didn't comment on just because I'm not familiar enough with the format/linting/ci/cd/etc. From all I can tell it looks solid and stable, but the nlp side was more my speed so my comments focus more on that.
Things I really liked
The professionalism of the project is impressive. The commit history with leading words indicating what the commit was suppose achieve, feature vs fix, etc, was really nice. The documentation was also exceptional and helped with a lot of figuring things out.
I liked the 'avoid hammering the api' features of the bot. Using the db and cache is an important step past simple call/response loops that make scaling as restarting, changes, and multiple users begin adding load.
Also I love NLP projects, I liked the stemming and search features with embeddings. That stuff is just fun and discord is a great place to implement and learn more about them while having fun with data that makes a lot more sense than academia data can at first blush.
If you had more time
The project was huge. There was so much boilerplate, and I think it added to the difficulty to the organization of the project overall.
A lot of the boilerplate could be condensed into a family of functions to just check error conditions and kept in a single location. Then on a command from a user, it could run through it's checks, then a function for things like semantics could just be the semantics with a single function call for the error checks at the very top rather than the column of checks in so many functions.
Similarly there were points where you were building thresholds to grab within a radius of an embedding in the 'grab from redis function'. that prep should probably be further from the redis layer that way the only thing the redis file would handle is the database interaction itself. Mapping the text to an embedding and saying how similar you want your matches to be all belongs to the same process, and the db call just be very minimal.
Additionally there's a lot of nlp functions spread everywhere. You have stuff about the word frequency in a cog, in that you have another directory for keywords and sentiment, then in nltk you have functionality, then you've got a hugging face transformer directory. The UI view has a lot of functions for each of the nlp functions. The main level has a sentiment file as well.
It got to be a wild chase across a lot of tabs to figure out what went where. I think I'd like to have a large refactor, and rather than having it broken out into cogs, nltk, redis, ui, and main level, I feel like you should have a class for each feature (sentiment/frequency/etc), and the class has a ui attribute, a query redis attribute, a process text attribute, etc.
I feel like the project grew very wide very fast, and that in combination with the boilerplate made it really hard to see how best to group things and refactoring became technical debt very rapidly.
A lot of the function groups do make a ton of sense--redis is great, discord interactions make sense. But the nlp was spread across so many places.
Future development
Ok if you were to keep building this, here's added bells and whistles I'd love in the theme of "information overload". And this is absolutely out of scope, but since you've built a great platform and a fun bot I thought you might enjoy hearing about possible ways to extend it.
Guild Similarity Scores: If a user brings the bot into two guilds, how similar are they? Latent Semantic Analysis and Latent Dirichlet Allocation are great ways to process the semantic similarity of servers.
MORE PLOTS: I love visualization, and yours are great but there's so much you can do. For each channel in a guild, you can parse the messages, do the aforementioned LSA or LDA, and then plot processed messages into some point space and reduce it with something like SVD or UMAP.
- https://scikit-learn.org/stable/modules/decomposition.html#truncated-singular-value-decomposition-and-latent-semantic-analysis
- https://umap-learn.readthedocs.io/en/latest/exploratory_analysis.html#open-syllabus-galaxy
No Deep Learning:
There's no reason not to use transformers or deep learning, but there are really powerful ways to make embeddings that don't rely on them. TF-IDF combined with dimension reduction works on smaller training sets than a lot of deep learning models so if you find some specific servers are not performing well, you can build your own embedding and use that without the need for anything from pytorch.
In all, this was an awesome project! I really enjoyed stepping through it and seeing what features you added and how you processed calls. Well done!
If a critical error occurs, attempt to shut down gracefully. | ||
""" | ||
# Override logging configuration by dependencies | ||
settings.setup_logging() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the amount of logging--it is very helpful to trace what broke if a version of something isn't working, but when you have bots interact with other users it might be helpful to think of being able to give those logs to various levels of users.
The logs you've got working right now are great dev logs, but moderators who have your bot in their community will also want to be able to query the bot to understand why the bot performed a behavior even if the bot isn't running on their hardware, and users will need to be able to understand what broke when they have a broken command/response.
A fair amount of the logging calls made are pieces of info that shouldn't be held to only the devs. This project team has a great grasp on what kind of information is helpful to log, but to level it up I'd like to see the team split the logged information into blocks for different groups of users. That'll go the needed extra step--having information is good, but until you get it to the right people it doesn't do nearly as much as it could.
interaction.id, | ||
) | ||
|
||
if self.bot.redis is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of these boilerplate safety checks should be moved into their own function.
With it kept in every call, each function is very bloated and takes a lot of time for the reader to parse, and it's so many repetitive tasks. Collecting all of the typical, 'check if redis is up, check if we're in a guild' etc tasks in one location each function is a lot easier to review and to make changes to.
The cost of the abstraction is added ambiguity with what threw the error, so to match that there will also probably need to be bot-state flags added to the call to process each of these checks. While it's a small amount of overhead, it makes tracing errors faster and clearer, and that is generally helpful even without this move.
query_processed = preprocessing.process(query) | ||
embedding = await self.bot.vectorizer.aencode(query_processed) | ||
|
||
messages = await get_messages_by_semantics_similarity( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how this order of operations avoids putting many api calls on Discord. processing from the local db first and paring down the message space is a really good practice
""" | ||
search_scope = build_search_scope(guild_id, ids, scope) | ||
|
||
query = VectorQuery( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to comment on how queries will get bloated and the embedding should be managed on their own database, that way the space can be subdivided instead of iterated over, but this seems to do exactly that. I've learned something pretty nifty today.
redis=self.bot.redis, | ||
vectorizer=self.bot.vectorizer, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to either see comments here about each step, or a bit in the docstrings about the steps taken, that way if a changes the behavior of a function, you have some comment explaining what the function 'should' be doing. Unit tests are great to catch when a change broke something, but then you still are often left trying to decide what the function was originally suppose to be doing.
As it stands as an outside reviewer, I'm scanning trying to ignore the safety checks as those are repetitive, then seeing you're preprocessing a message, to pass it to an embedding, etc. A heads up that,
This function uses the context menu,
takes a message,
preprocesses then vectorizes it,
compares the embedding of the input to the embedded in the db
and returns up to 6 messages with similar semantics
Or comments ahead of each processing step would make it easier to follow. I had to click through the project to get the bot vectorizer function on display to actually check what was happening in this step, and that's a lot of work to just follow along.
|
||
# Boilerplate cog code. | ||
# To build a cog: | ||
# 1. Copy this boilerplate to another file in the cogs directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cog needed in the codebase at this point?
""" | ||
The `test__example` module contains example tests for a Python project. | ||
|
||
For guidelines on writing good tests, see the contributor guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, this example probably is either unneeded, or it should live somewhere else. A top level directory with this sitting in it feels like it's a forgotten artifact.
from courageous_comets.redis.schema import MESSAGE_SCHEMA | ||
from courageous_comets.transformers import init_transformers | ||
from courageous_comets.vectorizer import Vectorizer | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this configuration split out, it's easy to read and helps setup the rest of the tests.
pos=mocker.ANY, | ||
compound=mocker.ANY, | ||
) | ||
result = calculate_sentiment("I love this product!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very nice little test string. :)
As the feature set grows, it might be convenient to have all of the strings you test, and the expected outcome over each kind of test, held in their own file rather than in each test.
from courageous_comets.sentiment import calculate_sentiment | ||
from courageous_comets.vectorizer import Vectorizer | ||
from courageous_comets.words import tokenize_sentence, word_frequency | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are good overall and are way more than usually seen, but if a specific class of functions have a common failure mode, it's be nice to have blurbs that explain it in the test.
Having something fail, vs having it tell you what needs to change is a big difference. With NLP processes, "Success" can be ambiguous, so a commentary about what a failure means and what could make it a success could help a lot when it comes to integrating upgrades..
Here in the redis tests, there can be a lot of weird db failure modes and a bit explaining what a failure could be caused by might help during upgrade attempts versions down the line.
No description provided.