Document Similarity Tool Documentation

Repository Structure

document-similarity-tool/
├── Makefile
├── pyproject.toml
├── README.md
├── similarity_dashboard.html
├── similarity_results.csv
├── src/
│   ├── docsim/
│   └── docsim.egg-info/
└── tests/
    ├── __init__.py
    ├── test_file_handling.py
    └── test_similarity.py

Package Description

The Document Similarity Tool compares textual content between PDF and DOCX documents across folders.

Key Features

  • Compare documents within and between two folders

  • Group documents by subfolder (author)

  • Generate HTML dashboard with interactive results

  • Create CSV reports

Installation

pip install .

For development:

pip install -e ".[dev]"

Usage

Basic command:

docsim folder1 folder2 [options]

Options:

  • --threshold FLOAT - Similarity threshold (0.0-1.0, default: 0.85)

  • --csv FILE - Output CSV filename

  • --html FILE - Output HTML dashboard filename

  • --workers INT - Number of parallel processes

Example:

docsim ./theses ./papers --threshold 0.9 --csv my_results.csv

Development

Makefile targets:

make install-dev   # Install development dependencies
make test         # Run tests
make lint         # Run linters
make format       # Format code
make clean        # Remove temporary files
make all          # Run all checks

Testing:

pytest --cov=docsim tests/

File Descriptions

pyproject.toml

Package configuration with dependencies and build settings.

Makefile

Automation of development tasks (testing, linting, formatting).

src/docsim/

Main package containing:

  • core.py: Main comparison logic

  • file_handling.py: Document processing

  • similarity.py: Similarity calculations

  • visualization.py: Report generation

  • cli.py: Command line interface

tests/

Unit tests for package functionality.

Folder Structure Processing Folder Structure Processing —————————

The tool is designed to work with hierarchical folder structures containing document files.

Supported Structure

The expected folder structure is:

submissions_root/
├── group_1/
│   ├── participant_identifier_1/
│   │   ├── document1.pdf
│   │   ├── document2.docx
│   │   └── notes.txt (ignored)
│   └── participant_identifier_2/
│       └── submission.pdf
└── group_2/
    └── participant_identifier_3/
        ├── file_a.pdf
        └── file_b.docx

Example Structures

Minimal structure:

submissions_root/
└── participant_1/
    └── submission.pdf

Multiple groups:

course_work/
├── physics_lab/
│   └── student_01/
│       ├── report.pdf
│       └── appendix.docx
└── math_project/
    └── student_02/
        └── solution.pdf

File Handling

  1. Directory Scanning: - Processes all subdirectories recursively - Only analyzes .pdf and .docx files - Skips other file types and empty directories

  2. Participant Identification: - Extracts identifiers from folder names - Supports common naming patterns:

    • name_id

    • lastname_firstname

    • identifier_additionalinfo

  3. Content Processing: - Combines all PDF/DOCX files per participant - Removes identifiers from extracted text - Normalizes whitespace and formatting

Technical Implementation

Key functions in file_handling.py:

  1. find_files(): - Uses os.walk() for directory traversal - Case-insensitive file extension check - Returns list of absolute file paths

  2. group_files_by_subfolder(): - Groups files by immediate parent directory - Special cases handled:

    • __MACOSX folders ignored

    • Hidden files (starting with .) skipped

    • Handles nested submission folders

  3. Text extraction: - PDF: PyMuPDF (fitz) with Tesseract OCR fallback - DOCX: python-docx library - All text converted to UTF-8 encoding

Example Structures

Minimal structure: submissions_root/

├── group_1/ │ ├── participant_identifier_1/ │ │ ├── document1.pdf │ │ ├── document2.docx │ │ └── notes.txt (ignored) │ └── participant_identifier_2/ │ └── submission.pdf └── group_2/

└── participant_identifier_3/

├── file_a.pdf └── file_b.docx

Output Files

similarity_dashboard.html

Interactive HTML dashboard showing similarity results.

similarity_results.csv

CSV file containing detailed similarity comparisons.

Dependencies

Core:

  • pymupdf

  • python-docx

  • scikit-learn

  • pdf2image

  • pytesseract

  • reportlab

  • tqdm

Development:

  • pytest

  • black

  • flake8

  • mypy

  • isort

License

MIT License

Acknowledgements

This project was developed with assistance from AI tools.