-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Description
Currently, in preprocess_text() within preprocess.py, we preserve a fixed set of ASCII-based emoticons by replacing them with EMOJI_i placeholders before cleaning the text. The current replacement approach uses text.replace(emoji, placeholder), which blindly replaces any substring match — even when the “emoji” is actually part of a word or punctuation pattern.
For example:
• "recap:" triggers a false match on "p:"
• "):" embedded in a longer token like "I'm sorry):" may be valid, but not in "(text): means something"
What's more, we are actually using a different strategy in count_emojis() within reddit_tags.py. Besides, Unicode-based emojis are not being detected. We should consider support this as they definitely exist in chat data from Reddit.
Possible Solution
Detecting text based emojis has been a challenge. There's no guarantee to be 100% accurate unless we use LLMs. We need to decide whether to accept false positives or false nagatives:
- What we're doing now: Allow false positives but make regex simple and consistent. If noise is tolerable, we can just accept that it'll catch things like "):" in "(text):", but rely on frequency and position to balance it out.
- If we're fine with missing things like "Hello:)", we can restrict to emojis that are spaced or followed by punctuation. This requires a comprehensive regex pattern.
As for Unicode-based emojis, the emoji package can easily detect them.