Skip to content

simons-hub/rust-word-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rust-word-analyzer

License: MIT Rust

A CLI tool that reads .txt and .docx files, counts word frequencies, and displays results sorted alphabetically and by count — built with custom linked list data structures in Rust.

This is an educational project that solves a real problem (word frequency analysis) using deliberately low-level data structures to explore Rust's ownership model, unsafe code, and pointer semantics.

Why Linked Lists?

A HashMap<String, usize> counts words in 5 lines. This project uses hand-rolled linked lists instead — not because it's practical, but because linked lists are the canonical hard problem in Rust.

Rust's ownership system makes linked lists genuinely difficult: every node owns the next, you can't have cycles without Rc/RefCell, and mutation requires careful management of borrows. This project tackles that head-on with real, working implementations.

Learning Goals

Concept Where it appears
Box<T> heap allocation Node storage — each node owns the next via Box<WordNode>
Raw pointer manipulation Tail pointer as *mut WordNode for O(1) append
unsafe blocks Dereferencing raw pointers for tail updates
Ownership transfer Moving nodes between positions during sort
Insertion sort on a linked list Alphabetical ordering during initial word collection
Merge sort on a linked list Re-sorting by word count after collection
Error handling with Result File I/O, XML parsing, ZIP extraction
Custom macros gprintln! and rprintln! for colored output

What makes this tricky in Rust

// This pattern — keeping a raw tail pointer alongside an owned head — is
// the core tension. Box gives you ownership, but the tail needs to mutate
// a node that Box already owns. You end up in unsafe territory:

struct WordList {
    head: Option<Box<WordNode>>,   // Owns the list
    tail: *mut WordNode,           // Points into it (unsafe)
}

Other languages let you do this trivially with garbage collection. Rust forces you to reason about who owns what, and this project is a worked example of navigating that.

Features

  • Reads .txt files (line-by-line word extraction)
  • Reads .docx files (ZIP archive extraction + XML content parsing)
  • Single linked list with alphabetical insertion sort
  • Double linked list variant (experimental)
  • Merge sort to re-order by word count
  • Colored terminal output
  • Integration tests with known expected output

Usage

cargo run -- path/to/file.txt
cargo run -- path/to/document.docx

Example Output

Printing list sorted alphabetically:
Node 1: Count: 5 Word: eight
Node 2: Count: 1 Word: five
Node 3: Count: 1 Word: four
...

Printing list sorted by word count:
Node 1: Count: 13 Word: one
Node 2: Count: 9 Word: two
Node 3: Count: 5 Word: ten
...

Total non-unique words: 50
Total unique words: 10

Project Structure

src/
├── main.rs                            # CLI entry point, file type routing
├── readfile.rs                        # .txt and .docx file parsers
├── word_tracker_single_linkedlist.rs  # Single linked list (primary)
├── word_tracker_double_linkedlist.rs  # Double linked list (experimental)
└── utilities/
    ├── mod.rs                         # Module declarations
    └── print_utils.rs                 # Colored terminal output macros

tests/
├── sort_by_word_count_txt_test.rs     # Integration test for .txt files
├── sort_by_word_count_docx_test.rs    # Integration test for .docx files
└── data/
    ├── input.txt                      # Test fixture
    └── input.docx                     # Test fixture (same content as .txt)

Testing

cargo test

Tests run the full binary against known input files and validate:

  • Alphabetical sort order and word counts
  • Word-count sort order (descending)
  • Correct handling of multiple whitespace and line breaks

Dependencies

Crate Purpose
zip Extract .docx ZIP archives
quick-xml Parse Word document XML
colored Colored terminal output

Further Reading

License

MIT License. See LICENSE for details.

About

Rust word frequency analyzer using custom linked list data structures in Rust — an educational project exploring Rust ownership, unsafe, and pointer semantics

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages