GitHub - phoenorg/zoryatrace: ZoryaTrace is a project designed to analyse texts and determine whether the content is AI-generated or not, training on individual data for each case to achieve high efficiency.

ZoryaTrace is a powerful artificial intelligence algorithm designed to analyze texts and determine whether the content is AI-generated or not. ZoryaTrace leverage individual user data to determinate if LLM was used to generate text. To get into the mathematical details, ZoryaTrace uses the Naive Bayes classifier, which is based on the Bayes' theorem to achieve this goal.

🚧 Currently in HEAVY development – many features are still being built and refined.

📚 Project progress :

📜 Informations (13 april 2025) : The data extractor, the algorithm (backend) and the interface (zorya.py) have been developed. I've also written the full documentation. I still have to implement default ai-generated training data.

Features

AI Detection - Analyze text to classify AI-generated content.
User-Based Analysis - Adapt detection based on individual user patterns.
Scalability - Designed to handle large-scale text processing.

🆚 Comparison Table: Why ZoryaTrace ?

Feature	ZoryaTrace	Other Solutions
Open Source	✅ Transparent and modifiable	❌ Often proprietary and closed-source
Privacy-focused	✅ No user tracking, fully local processing	❌ Often cloud-based, collects user data
Security	✅ Local execution, no external data leaks	❌ Data sent to third-party servers
Efficiency	✅ Optimized TF-IDF & Naive Bayes, few seconds analysis	⚠️ Can be slow due to complex models or API calls
Ease of Use	✅ Easy setup	⚠️ Usually easy, but sometimes requires API keys or dependencies
Customization	✅ Fully customizable training dataset	❌ Limited or no customization options
Lightweight	✅ Minimal dependencies, runs on low-end devices	❌ Heavy dependencies, requires cloud infrastructure
No API Limits	✅ Works offline, no request limits	❌ API-based, limited free requests
Modern LLMs detection	❌ May struggle with hightly creative AI texts from modern LLMs	✅ Hightly creative AI text is often better detected
Human/AI merged content	❌ May struggle with human-edited AI content	✅ Human/AI merged content may be betters supported

📖 Why the name "Zorya" ?

Zorya refers to two (sometimes three) deities in Slavic mythology — Zorya Utrennyaya (Morning Star) and Zorya Vechernyaya (Evening Star), occasionally joined by Zorya Polunochnaya (Midnight Star). These celestial sisters serve as guardians of Simargl, a cosmic hound chained to the star Polaris. If the chain ever breaks, it is said that the universe would be destroyed.

The Zoryas are tasked with watching the sky, opening the gates of the Sun each morning and closing them each night. They represent constant vigilance, protection against unseen threats, and balance between light and darkness.

ZoryaTrace draws inspiration from this mythos: just as the Zorya monitor the heavens for signs of cosmic disruption, the tool monitors digital texts to detect traces of artificial generation. The goal is not to judge, but to provide early signals, protect informational integrity, and offer clarity.

Disclaimer

Please Read Carefully

ZoryaTrace is provided as an experimental tool for content analysis and educational research purposes. By using this software, you acknowledge and agree to the following:

The results produced by ZoryaTrace are probabilistic in nature and do not constitute definitive or authoritative assessments.
The tool may generate false positives or false negatives, and must not be solely relied upon for critical decisions.
ZoryaTrace is not certified for legal, regulatory, compliance, or forensic use.
The accuracy and reliability of outputs depend heavily on the quality and relevance of the underlying data.
The developers, contributors, and associated entities assume no liability for any direct or indirect damages, losses, or consequences resulting from the use or interpretation of the software’s outputs.
Use of this tool is entirely at the user's own risk.

Recommended Best Practices:

Always conduct a human review of any content flagged or analyzed by the tool.
For professional or domain-specific usage, consider retraining the model on relevant datasets.
Maintain proper audit logs and documentation for transparency and accountability.
Ensure that your use of the software complies with applicable laws and ethical guidelines in your jurisdiction.

By proceeding, you acknowledge your understanding of these limitations and agree to use ZoryaTrace responsibly.

Limitations

🪫 As explained above, ZoryaTrace is still under development. This means that you can't download the executables directly at the moment, you have to run it in a python environment, as explained below. In addition, some databases are not currently provided; you need to provide your own to run ZoryaTrace. This will be temporary, as the project is evolving very quickly.

Installation

Option 1: From Source (Recommended)

git clone https://github.com/Malwprotector/ZoryaTrace.git
cd zoryatrace
pip install -r requirements.txt

Then run it with

python3 zorya.py

Option 2: Pre-built Executable

Windows, Linux and Mac binaries will be available here soon !

System Requirements

Python 3.8+
1GB RAM minimum (4GB recommended)
200MB disk space

Usage

🖥️ Application Interface Overview

Main application window with tab navigation

📚 1. First, we need to create training data !

You'll need to create training data before analyzing things. But don't worry, everything will be explained simply ! ✨

Step 1: Add Training Files

Click "Add human written PDF(s)" to import your human written data. You can import several : the aim is to import as many PDFs written by a human as possible. If you want to base yourself on a particular subject or langage, you can import the data for that subject.
Toggle checkbox to use default suspicious data or add custom data. So in fact, you don't have to import data written by an Artificial Intelligence, because ZoryaTrace comes with this data. But as mentioned above, if you're working on a particular subject and you have text data generated by Artificial Intelligence, you can import that too.
View loaded files in the listbox.

Step 2: Create Training Data

Click "Create Training Data".
The training data file is named data.csv, and it will be saved in the same directory of ZoryaTrace.
Monitor progress in status bar (It's often extremely fast).
Completion notification will appear !
Now that you have created the training data, you can use the algorithm freely. Make sure you don't delete or move your data.csv training data file.

🗺️ Exploring tabs

📄 4. Analyze PDF Tab

This is ZoryaTrace's main function. Here, you submit written files, and the algorithm analyses them to determine the percentage of content generated by AI.

Step 1: Load PDF

Click "Select PDF File".
Choose document to analyze.
Filename appears in status.

Step 2: Run Analysis

Click "Analyze PDF".
Watch real-time progress bar (often very fast).
Results display with:
- 🔴 Red highlighted suspicious sentences.
- 🟢 Green neutral text.
- Summary statistics.

🔍 2. Classify Text Tab

Paste or type text in the input box.
Click "Classify Text".
View results:
- 🟢 Green = Neutral.
- 🔴 Red = Suspicious.

🧪 3. Test Algorithm Tab

Click "Run Algorithm Test"
View metrics:
- Precision
- Recall
- F-score
- Accuracy
Sample test sentences with predictions

⚠️ Troubleshooting Common Issues

Problem: "Training data not loaded"
Solution:

Navigate to Data Preparation tab.
Ensure both human written and ai-generated samples are loaded.
Verify data.csv exists in working directory.

Problem: Slow PDF processing
Solution:

Split large documents (>50 pages).
Close other memory-intensive applications.
Use simpler PDFs (avoid scanned documents).

Algorithm

ZoryaTrace employs a hybrid NLP pipeline combining:

1. Text Processing Layer

Sentence Segmentation: Advanced regex patterns handle complex punctuation
Tokenization: NLTK's word_tokenize with custom modifications
Normalization:
- Case folding (lowercasing)
- Porter stemming
- Stop word removal (customizable list)

2. Feature Engineering

TF-IDF Vectorization:
- Term Frequency-Inverse Document Frequency weighting
- Adaptive n-gram range (1-3 grams)
- Sublinear TF scaling
Lexical Features:
- Sentence length analysis
- Word rarity scoring
- Syntactic pattern matching

3. Classification Core

Naive Bayes Classifier with TF-IDF features
Dual Probability Models:
- Neutral content profile
- Suspicious content profile
Adaptive Thresholding:
- Dynamic classification boundaries
- Confidence scoring

Resources

I aven't added my ressources yet I will probably write something about ZoryaTrace.

Contributing

Contributions are welcome! Feel free to make :

Bug reports
Feature requests
Performance optimizations
Additional language support

Thank you !

License

ZoriaTrace is licensed under CC BY-NC-SA 4.0

Credits

Developed with <3 by me. Special thanks to contributors and testers who help improve ZoryaTrace (there are none yet).

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
screenshots		screenshots
.gitignore		.gitignore
README.md		README.md
cython.sh		cython.sh
dataset_protocol.txt		dataset_protocol.txt
logo.jpg		logo.jpg
logo_noback.png		logo_noback.png
requirements.txt		requirements.txt
zorya.py		zorya.py
zoryaface.c		zoryaface.c
zoryaface.cpp		zoryaface.cpp

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Features

🆚 Comparison Table: Why ZoryaTrace ?

📖 Why the name "Zorya" ?

Disclaimer

Recommended Best Practices:

Limitations

Installation

Option 1: From Source (Recommended)

Option 2: Pre-built Executable

System Requirements

Usage

🖥️ Application Interface Overview

📚 1. First, we need to create training data !

Step 1: Add Training Files

Step 2: Create Training Data

🗺️ Exploring tabs

📄 4. Analyze PDF Tab

Step 1: Load PDF

Step 2: Run Analysis

🔍 2. Classify Text Tab

🧪 3. Test Algorithm Tab

⚠️ Troubleshooting Common Issues

Algorithm

1. Text Processing Layer

2. Feature Engineering

3. Classification Core

Resources

Contributing

License

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages