Improve Emoji recognition and counting

## Description

Currently, in `preprocess_text()` within preprocess.py, we preserve a fixed set of ASCII-based emoticons by replacing them with EMOJI_i placeholders before cleaning the text. The current replacement approach uses text.replace(emoji, placeholder), which blindly replaces any substring match — even when the “emoji” is actually part of a word or punctuation pattern. 

For example:
	•	"recap:" triggers a false match on "p:"
	•	"):" embedded in a longer token like "I'm sorry):" may be valid, but not in "(text): means something"

What's more, we are actually using a different strategy in  `count_emojis()` within reddit_tags.py. Besides, Unicode-based emojis are not being detected. We should consider support this as they definitely exist in chat data from Reddit.

## Possible Solution

Detecting text based emojis has been a challenge. There's no guarantee to be 100% accurate unless we use LLMs. We need to decide whether to accept false positives or false nagatives:

1. What we're doing now: Allow false positives but make regex simple and consistent. If noise is tolerable, we can just accept that it'll catch things like "):" in "(text):", but rely on frequency and position to balance it out.
2. If we're fine with missing things like "Hello:)", we can restrict to emojis that are spaced or followed by punctuation. This requires a comprehensive regex pattern.

As for Unicode-based emojis, the emoji package can easily detect them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Emoji recognition and counting #360

Description

Possible Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve Emoji recognition and counting #360

Description

Description

Possible Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions