Document Similarity Tool Documentation¶
Repository Structure¶
document-similarity-tool/
├── Makefile
├── pyproject.toml
├── README.md
├── similarity_dashboard.html
├── similarity_results.csv
├── src/
│ ├── docsim/
│ └── docsim.egg-info/
└── tests/
├── __init__.py
├── test_file_handling.py
└── test_similarity.py
Package Description¶
The Document Similarity Tool compares textual content between PDF and DOCX documents across folders.
Key Features¶
Compare documents within and between two folders
Group documents by subfolder (author)
Generate HTML dashboard with interactive results
Create CSV reports
Installation¶
pip install .
For development:
pip install -e ".[dev]"
Usage¶
Basic command:
docsim folder1 folder2 [options]
Options:
--threshold FLOAT- Similarity threshold (0.0-1.0, default: 0.85)--csv FILE- Output CSV filename--html FILE- Output HTML dashboard filename--workers INT- Number of parallel processes
Example:
docsim ./theses ./papers --threshold 0.9 --csv my_results.csv
Development¶
Makefile targets:
make install-dev # Install development dependencies
make test # Run tests
make lint # Run linters
make format # Format code
make clean # Remove temporary files
make all # Run all checks
Testing:
pytest --cov=docsim tests/
File Descriptions¶
pyproject.toml¶
Package configuration with dependencies and build settings.
Makefile¶
Automation of development tasks (testing, linting, formatting).
src/docsim/¶
Main package containing:
core.py: Main comparison logic
file_handling.py: Document processing
similarity.py: Similarity calculations
visualization.py: Report generation
cli.py: Command line interface
tests/¶
Unit tests for package functionality.
Folder Structure Processing Folder Structure Processing —————————
The tool is designed to work with hierarchical folder structures containing document files.
Supported Structure¶
The expected folder structure is:
submissions_root/
├── group_1/
│ ├── participant_identifier_1/
│ │ ├── document1.pdf
│ │ ├── document2.docx
│ │ └── notes.txt (ignored)
│ └── participant_identifier_2/
│ └── submission.pdf
└── group_2/
└── participant_identifier_3/
├── file_a.pdf
└── file_b.docx
Example Structures¶
Minimal structure:
submissions_root/
└── participant_1/
└── submission.pdf
Multiple groups:
course_work/
├── physics_lab/
│ └── student_01/
│ ├── report.pdf
│ └── appendix.docx
└── math_project/
└── student_02/
└── solution.pdf
File Handling¶
Directory Scanning: - Processes all subdirectories recursively - Only analyzes .pdf and .docx files - Skips other file types and empty directories
Participant Identification: - Extracts identifiers from folder names - Supports common naming patterns:
name_id
lastname_firstname
identifier_additionalinfo
Content Processing: - Combines all PDF/DOCX files per participant - Removes identifiers from extracted text - Normalizes whitespace and formatting
Technical Implementation¶
Key functions in file_handling.py:
find_files(): - Uses os.walk() for directory traversal - Case-insensitive file extension check - Returns list of absolute file paths
group_files_by_subfolder(): - Groups files by immediate parent directory - Special cases handled:
__MACOSX folders ignored
Hidden files (starting with .) skipped
Handles nested submission folders
Text extraction: - PDF: PyMuPDF (fitz) with Tesseract OCR fallback - DOCX: python-docx library - All text converted to UTF-8 encoding
Example Structures¶
Minimal structure: submissions_root/
├── group_1/ │ ├── participant_identifier_1/ │ │ ├── document1.pdf │ │ ├── document2.docx │ │ └── notes.txt (ignored) │ └── participant_identifier_2/ │ └── submission.pdf └── group_2/
- └── participant_identifier_3/
├── file_a.pdf └── file_b.docx
Output Files¶
similarity_dashboard.html¶
Interactive HTML dashboard showing similarity results.
similarity_results.csv¶
CSV file containing detailed similarity comparisons.
Dependencies¶
Core:
pymupdf
python-docx
scikit-learn
pdf2image
pytesseract
reportlab
tqdm
Development:
pytest
black
flake8
mypy
isort
License¶
MIT License
Acknowledgements¶
This project was developed with assistance from AI tools.