Skip to content

Improve Emoji recognition and counting #360

@sundy1994

Description

@sundy1994

Description

Currently, in preprocess_text() within preprocess.py, we preserve a fixed set of ASCII-based emoticons by replacing them with EMOJI_i placeholders before cleaning the text. The current replacement approach uses text.replace(emoji, placeholder), which blindly replaces any substring match — even when the “emoji” is actually part of a word or punctuation pattern.

For example:
• "recap:" triggers a false match on "p:"
• "):" embedded in a longer token like "I'm sorry):" may be valid, but not in "(text): means something"

What's more, we are actually using a different strategy in count_emojis() within reddit_tags.py. Besides, Unicode-based emojis are not being detected. We should consider support this as they definitely exist in chat data from Reddit.

Possible Solution

Detecting text based emojis has been a challenge. There's no guarantee to be 100% accurate unless we use LLMs. We need to decide whether to accept false positives or false nagatives:

  1. What we're doing now: Allow false positives but make regex simple and consistent. If noise is tolerable, we can just accept that it'll catch things like "):" in "(text):", but rely on frequency and position to balance it out.
  2. If we're fine with missing things like "Hello:)", we can restrict to emojis that are spaced or followed by punctuation. This requires a comprehensive regex pattern.

As for Unicode-based emojis, the emoji package can easily detect them.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions