Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
13 changes: 13 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Input directory - where the producer will monitor for .txt files
# This can be any directory on your host machine
# Examples:
# - Relative path: ./data/input
# - Absolute path: /home/user/documents/text-files
# - Windows path: C:/Users/username/Documents/text-files
INPUT_DIR=./data/input

# Output directory - where JSON results will be saved
OUTPUT_DIR=./data/output

# Configuration file path
CONFIG_FILE=./config.json
26 changes: 26 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.idea
.gradle
build/
*.iml
*.class
*.jar
*.log

# Data directories (exclude output, keep input with samples)
data/output/*.json

# Go
vendor/

# Python
__pycache__/
*.pyc
*.pyo
venv/
.venv/

# Docker
.docker/

# Environment variables
.env
Binary file removed .gradle/8.4/checksums/checksums.lock
Binary file not shown.
Binary file removed .gradle/8.4/checksums/md5-checksums.bin
Binary file not shown.
Binary file removed .gradle/8.4/checksums/sha1-checksums.bin
Binary file not shown.
Binary file not shown.
Binary file removed .gradle/8.4/executionHistory/executionHistory.bin
Binary file not shown.
Binary file removed .gradle/8.4/executionHistory/executionHistory.lock
Binary file not shown.
Binary file removed .gradle/8.4/fileChanges/last-build.bin
Binary file not shown.
Binary file removed .gradle/8.4/fileHashes/fileHashes.bin
Binary file not shown.
Binary file removed .gradle/8.4/fileHashes/fileHashes.lock
Binary file not shown.
Binary file removed .gradle/8.4/fileHashes/resourceHashesCache.bin
Binary file not shown.
Empty file removed .gradle/8.4/gc.properties
Empty file.
Binary file removed .gradle/buildOutputCleanup/buildOutputCleanup.lock
Binary file not shown.
2 changes: 0 additions & 2 deletions .gradle/buildOutputCleanup/cache.properties

This file was deleted.

Binary file removed .gradle/buildOutputCleanup/outputFiles.bin
Binary file not shown.
Binary file removed .gradle/file-system.probe
Binary file not shown.
Empty file removed .gradle/vcs-1/gc.properties
Empty file.
21 changes: 21 additions & 0 deletions Dockerfile.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM golang:1.21-alpine AS builder

WORKDIR /app

COPY src/go.mod src/go.sum ./
RUN go mod download

COPY src/ ./

ARG SERVICE_NAME
RUN CGO_ENABLED=0 GOOS=linux go build -o /app/service ./${SERVICE_NAME}

FROM alpine:latest

RUN apk --no-cache add ca-certificates

WORKDIR /root/

COPY --from=builder /app/service .

CMD ["./service"]
11 changes: 11 additions & 0 deletions Dockerfile.python
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
FROM python:3.10-slim

WORKDIR /app

COPY src/workers/sentiment/requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY src/workers/sentiment/worker.py .

CMD ["python", "-u", "worker.py"]
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
[![Review Assignment Due Date](https://classroom.github.com/assets/deadline-readme-button-22041afd0340ce965d47ae6ef1cefeee28c7c493a6346c4f15d667ab976d596c.svg)](https://classroom.github.com/a/QODoQuhO)
# Распределенная обработка текстовых данных с использованием брокера сообщений

## Цель задания:
Expand Down
30 changes: 30 additions & 0 deletions config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"rabbitmq": {
"url": "amqp://guest:guest@rabbitmq:5672/",
"exchange_name": "text_processing",
"exchange_type": "topic",
"tasks_routing_key_prefix": "task",
"results_routing_key": "results"
},
"workers": {
"word_count_workers": 2,
"top_n_workers": 2,
"sentence_sort_workers": 2,
"sentiment_workers": 2,
"name_replacement_workers": 2
},
"processing": {
"top_n_words": 10,
"section_size_chars": 5000
},
"producer": {
"monitor_interval_seconds": 3,
"file_ready_check_delay_ms": 500
},
"input": {
"data_dir": "/data/input/articles"
},
"output": {
"results_dir": "/data/output"
}
}
1 change: 1 addition & 0 deletions data/input/performance-test.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This is a test file for performance tracking. The system works great and processes text efficiently. Every component performs well. The distributed architecture enables parallel processing. Workers handle tasks quickly and reliably.
10 changes: 10 additions & 0 deletions data/input/sample-russian.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#####
Иванов Иван Иванович
Смирнов Семён Семёнович
#####

Иванов Иван Иванович работает в крупной технологической компании. Работа Иванова связана с распределёнными системами. Ивану Ивановичу нравится программирование на Go и Python.
Вчера Иванов встретился с коллегой для обсуждения нового проекта. Идеи Иванова о внедрении RabbitMQ были хорошо восприняты командой. Все были впечатлены техническими знаниями Ивана.
Руководитель попросил Иванова Ивана Ивановича возглавить новую инициативу. Иван будет отвечать за архитектуру системы. Все согласны, что Иванов - идеальный выбор для этой роли.
В свободное время Иван Иванович любит читать технические книги. Недавно Иванов выступил с докладом на конференции разработчиков. Презентация Ивана была хорошо принята аудиторией.
Коллеги Иванова высоко ценят его опыт. Работа с Ивановым всегда продуктивна. Ивану удаётся находить решения для сложных задач.
11 changes: 11 additions & 0 deletions data/input/sample-with-name-replacement.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#####
Ivanov Ivan Ivanovich
Smirnov Semyon Semyonovich
#####

Ivanov Ivan Ivanovich is a talented software engineer who works at a major technology company. He has been developing distributed systems for many years. Ivan Ivanovich is known for his expertise in message queue systems and microservices architecture.
Yesterday, Ivanov met with his colleague to discuss the new project requirements. Mr. Ivanov presented his ideas about implementing a RabbitMQ-based solution. The team was impressed by Ivan's technical knowledge and problem-solving skills.
IVANOV has contributed significantly to open-source projects. His GitHub profile shows numerous repositories related to distributed computing. I. I. Ivanov is respected in the developer community for his clear documentation and helpful code reviews.
The project manager asked Ivanov I. I. to lead the new initiative. Ivan will be responsible for architecting the system and mentoring junior developers. Everyone agrees that Ivanov Ivan is the perfect choice for this challenging role.
In his personal life, Ivan Ivanovich enjoys reading technical books and attending conferences. He recently gave a talk about scalable architectures at a major developer conference. The presentation by Ivanov was well-received by the audience.
Looking forward, I.I. Ivanov plans to continue contributing to the field of distributed systems. His passion for technology and dedication to excellence make him an invaluable team member. The future looks bright for both Ivanov and his projects.
7 changes: 7 additions & 0 deletions data/input/sample1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
The distributed text processing system is an amazing achievement in modern software engineering. It demonstrates the power of microservices architecture and message-driven design patterns. RabbitMQ enables efficient communication between different components, allowing them to work together seamlessly.
This system processes text data in parallel using multiple workers. Each worker specializes in a specific task, making the overall system more efficient and scalable. The word count worker analyzes the number of words in each section. The top-N words worker identifies the most frequently occurring terms.
Sentiment analysis is crucial for understanding the emotional tone of text. Using advanced machine learning models from Hugging Face, we can determine whether the text expresses positive, negative, or neutral sentiment. This capability has numerous applications in business intelligence and social media monitoring.
The aggregator component plays a vital role in combining results from all workers. It waits for all sections to be processed before merging the data. The merge operation ensures that results are combined in the correct order, maintaining data integrity throughout the pipeline.
Modern distributed systems must be fault-tolerant and resilient. Docker containers provide isolation and consistency across different environments. Docker Compose orchestrates multiple services, making deployment simple and repeatable. This architecture supports horizontal scaling by adding more worker instances.
Performance optimization is essential for handling large datasets. The system splits text into manageable sections, enabling parallel processing. Load balancing ensures that work is distributed evenly among available workers. Monitoring and logging help identify bottlenecks and improve system performance.
The future of text processing lies in combining traditional algorithms with deep learning models. Natural language processing continues to evolve with new techniques and approaches. Cloud-native architectures enable systems to scale dynamically based on workload demands. Innovation in this field drives progress across many industries.
5 changes: 5 additions & 0 deletions data/input/sample2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Software development is a challenging but rewarding field. Every day brings new problems to solve and opportunities to learn. Developers must constantly adapt to changing technologies and methodologies. The pace of innovation never slows down.
Unfortunately, not all projects succeed. Many fail due to poor planning or communication issues. Technical debt accumulates when shortcuts are taken. Maintenance becomes increasingly difficult over time. These challenges test the resilience of development teams.
However, success is achievable with the right approach. Good documentation helps team members understand complex systems. Code reviews improve quality and share knowledge. Automated testing catches bugs early in the development cycle. These practices lead to better outcomes.
Collaboration tools have transformed how teams work together. Version control systems like Git enable parallel development. Issue trackers organize tasks and priorities. Communication platforms facilitate discussion and decision-making. Remote work has become the norm for many organizations.
The satisfaction of solving difficult problems makes the effort worthwhile. Seeing users benefit from your work provides motivation. Building something that improves peoples lives is incredibly fulfilling. This sense of purpose drives developers to excel in their craft.
7 changes: 7 additions & 0 deletions data/input/test-performance.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
The distributed text processing system is an amazing achievement in modern software engineering. It demonstrates the power of microservices architecture and message-driven design patterns. RabbitMQ enables efficient communication between different components, allowing them to work together seamlessly.
This system processes text data in parallel using multiple workers. Each worker specializes in a specific task, making the overall system more efficient and scalable. The word count worker analyzes the number of words in each section. The top-N words worker identifies the most frequently occurring terms.
Sentiment analysis is crucial for understanding the emotional tone of text. Using advanced machine learning models from Hugging Face, we can determine whether the text expresses positive, negative, or neutral sentiment. This capability has numerous applications in business intelligence and social media monitoring.
The aggregator component plays a vital role in combining results from all workers. It waits for all sections to be processed before merging the data. The merge operation ensures that results are combined in the correct order, maintaining data integrity throughout the pipeline.
Modern distributed systems must be fault-tolerant and resilient. Docker containers provide isolation and consistency across different environments. Docker Compose orchestrates multiple services, making deployment simple and repeatable. This architecture supports horizontal scaling by adding more worker instances.
Performance optimization is essential for handling large datasets. The system splits text into manageable sections, enabling parallel processing. Load balancing ensures that work is distributed evenly among available workers. Monitoring and logging help identify bottlenecks and improve system performance.
The future of text processing lies in combining traditional algorithms with deep learning models. Natural language processing continues to evolve with new techniques and approaches. Cloud-native architectures enable systems to scale dynamically based on workload demands. Innovation in this field drives progress across many industries.
File renamed without changes.
Loading