grid-extractor / README.md
Advanced Grid Extraction Pipeline for Scanned Documents with handwritten Arabic text
Computer Vision based grid extraction tool for scanned documents and forms with Arabic handwritten text.
Advanced Grid Extraction Pipeline for Scanned Documents with handwritten Arabic text
Version: 1.0
Overview
This project provides an interactive tool, built with Streamlit, for extracting table grid structures from scanned documents, particularly targeting challenging cases like:
- Documents containing tabular data with Arabic handwritten text (where text strokes can be confused with grid lines).
- Scans with potentially faint or broken grid lines.
The primary goal is to accurately identify and extract the coordinates of table cells (bounding boxes) to facilitate subsequent Optical Character Recognition (OCR) on a cell-by-cell basis, ultimately reconstructing a structured digital table from the scanned image.
Table of Contents
- Overview
- Features
- Technology Stack
- Setup and Installation
- Usage
- Pipeline Explained
- Parameter Guide
- Current Status & Limitations
Features
- Interactive Parameter Tuning: Uses Streamlit sliders and widgets to adjust key parameters of the image processing pipeline in real-time.
- Step-by-Step Visualization: Displays intermediate results (preprocessing, line detection, filtering, intersections, clustering, final boxes) to aid understanding and debugging.
- Robust Grid Line Detection: Employs morphological operations specifically designed to distinguish long grid lines from shorter text strokes.
- Line Filtering & Continuity Checks: Includes steps to filter potential lines based on shape (length, thickness) and pixel continuity.
- Intersection Clustering & Refinement: Groups nearby raw intersection points and refines their positions (using Median or experimental RANSAC) to handle imperfections and warping.
- Bounding Box Generation: Computes cell bounding boxes from the refined grid coordinates.
- Merged Cell Indication: Flags potential merged rows/columns based on gap analysis between grid lines.
- Cell Filtering: Optionally filters out unlikely cell boxes based on aspect ratio.
- Optional Deskewing: Includes a feature to automatically detect and correct the overall skew of the document image.
Technology Stack
- Python 3: Version 3.10 or higher.
- Poetry: For dependency management and packaging.
- Streamlit: For the interactive web application interface.
- OpenCV (
opencv-python): For core image processing tasks. - PyMuPDF (
fitz): For efficient PDF-to-image conversion. - NumPy: For numerical operations.
- Scikit-learn (
sklearn): Used for the optional RANSAC cluster refinement. - Pillow: For image handling.
- Matplotlib: (Used implicitly/potentially).
Setup and Installation
Prerequisites
- Python: Version 3.10 or higher (Needed mainly if running locally without Docker).
- Poetry: (Needed mainly if running locally without Docker). Installation instructions at https://python-poetry.org/docs/#installation.
- Git: For cloning the repository.
- Docker: Required if running via Docker. Installation instructions at https://docs.docker.com/engine/install/.
Installation Steps (for Local Development)
These steps are needed if you intend to run the application directly using Poetry, without Docker.
-
Clone the Repository:
git clone https://github.com/rihal-om/grid-extraction-tool.git cd grid-extraction-tool -
Install Dependencies using Poetry:
- Ensure you have Poetry installed.
poetry install --no-dev
Usage
There are two primary ways to run this application:
Option 1: Running with Docker (Recommended for Deployment/Consistency)
This method uses the container defined in the Dockerfile, ensuring all dependencies (including system ones) are correctly managed.
-
Build the Docker Image:
- Ensure Docker Desktop or Docker Engine is running.
- Navigate to the project's root directory (containing the
Dockerfile). - Choose an image name (e.g.,
grid-extractor-app).
docker build -t grid-extractor-app . -
Run the Docker Container:
- Map a port on your host machine (e.g., 8501) to the container's exposed port (8501).
docker run -p 8501:8501 --name grid-app-container grid-extractor-app- (
--name grid-app-containeris optional, provides a convenient name for managing the container).
-
Access the App: Open your web browser and navigate to
http://localhost:8501. -
Stopping the Container: Press
Ctrl+Cin the terminal wheredocker runis active, or open a new terminal and rundocker stop grid-app-container. You can remove the stopped container withdocker rm grid-app-container.
Option 2: Running Locally with Poetry (Useful for Development)
This method runs the application directly using your local Python environment managed by Poetry. Make sure you have completed the "Installation Steps (for Local Development)".
-
Activate Environment (Optional but Recommended):
- Navigate to the project directory.
poetry shell -
Run the Streamlit App:
- If you activated the shell (
poetry shell):streamlit run main.py - If you did not activate the shell, use
poetry run:poetry run streamlit run main.py
- If you activated the shell (
-
Access the App: Open your browser to the URL provided by Streamlit in the terminal output (usually
http://localhost:8501).
Interacting with the App
* A browser window will open with the application.
* Use the sidebar to **upload a PDF** document.
* Adjust the **parameters** using the sliders and options in the sidebar to optimize the grid detection for your document.
* Observe the **intermediate and final results** displayed in the main panel. Expand sections like "Show Preprocessing Results", "Show Morphological Results", etc., to understand how the parameters affect each stage.
* The final image shows the detected bounding boxes (Green = Regular, Orange = Potentially Merged). Discarded boxes (due to aspect ratio) can be viewed in an expandable section.
Pipeline Explained
This section details the main image processing steps the application takes to find the table grid.
1. Get the Image Ready
- Load from PDF: Converts the first page of the uploaded PDF into a high-quality digital image. High resolution is used to retain detail, especially for faint lines.
- (Optional) Straighten Up: If the "Deskewing" option is enabled, the code attempts to detect and correct any overall tilt in the scanned page, aligning the grid more closely to horizontal and vertical axes.
- Improve Contrast & Go Black/White: Applies Contrast Limited Adaptive Histogram Equalization (CLAHE) to enhance local contrast, making lines clearer across different brightness levels. Then, uses Adaptive Thresholding to convert the image to black and white, carefully trying to make grid lines white and background/text black.
2. Find Potential Grid Lines (Separating from Text)
- Look for Long Horizontal Lines: Uses Morphological Opening with a wide, short kernel. This acts like a horizontal "eraser" that removes short vertical elements (like text) but preserves long horizontal lines.
- Look for Long Vertical Lines: Uses Morphological Opening with a tall, thin kernel. This acts like a vertical "eraser" removing short horizontal elements while preserving long vertical lines.
- Fix Broken Lines (Bridging): Applies Morphological Closing with a small kernel. This helps to connect small gaps or breaks that might exist in faint or damaged grid lines, improving their continuity.
3. Clean Up the Detected Lines
- Filter by Shape & Length: Examines the connected components (potential line segments) remaining after the morphological steps. It removes components that are too short relative to the page dimensions or components that are too thick to be typical grid lines.
- Check for Solid Lines (Continuity): Filters the remaining line candidates based on how "solid" they are. It ensures a significant portion of the line's bounding box actually contains line pixels, removing dashed lines or sparse artifacts.
4. Pinpoint the Grid Corners
- Find Where Lines Cross: Takes the cleaned-up horizontal line mask and the cleaned-up vertical line mask and performs a logical AND operation. The resulting white pixels represent the locations where horizontal and vertical lines intersect – the corners of the grid cells.
5. Figure Out the Grid Structure
- Group Nearby Corners (Clustering): Due to scanning imperfections or warping, intersection points belonging to the same logical grid line might not be perfectly aligned. This step groups nearby intersection points along the X and Y axes, identifying clusters that likely represent single grid lines.
- Refine Line Positions: Calculates a precise, robust central coordinate for each cluster (representing the final grid line position). It uses the Median of the points in the cluster (default, good against outliers) or optionally RANSAC (fits a line model, potentially better for noise/curves).
6. Draw the Table Cells
- Create Bounding Boxes: Uses the refined X (vertical) and Y (horizontal) grid line coordinates. For every adjacent pair of X coordinates and adjacent pair of Y coordinates, it defines a rectangular bounding box representing a table cell.
7. Final Touches & Display
- Filter Weird Cells: Applies an aspect ratio filter to the generated bounding boxes. It removes cells that are extremely tall/skinny or short/wide, as these are unlikely to be valid table cells.
- Mark Big Gaps (Merged Cells): Compares the distance between adjacent refined grid lines to the median distance. If a gap is significantly larger (controlled by the
Merge Factor), the cells spanning this gap are marked visually (e.g., orange) to indicate potential merged rows or columns in the original table. - Show Results: The Streamlit application displays the final processed image with the valid bounding boxes overlaid, along with options to view various intermediate processing steps.
Parameter Guide
The Streamlit sidebar provides sliders and options to adjust how the pipeline processes the image. Tuning these can significantly improve results on different documents.
1. PDF Loading & Initial Setup
-
PDF Zoom Factor:- What it does: Controls image resolution when converting from PDF.
- Effect: Higher values = clearer image, better for faint lines, but slower & more memory. Lower values = faster, less detail.
- Why adjust: Increase for blurry/broken lines; decrease for speed if quality is sufficient.
-
Apply Global Deskewing(Checkbox):- What it does: Toggles automatic page rotation to straighten the grid.
- Effect: If checked, attempts to level a tilted scan.
- Why adjust: Enable if the document was scanned crookedly.
2. Preprocessing & Thresholding (Making Lines Stand Out)
-
CLAHE Clip Limit:- What it does: Controls local contrast enhancement.
- Effect: Higher values increase local contrast more, potentially amplifying noise.
- Why adjust: Increase slightly for faint lines in dark/light areas; decrease if output looks noisy.
-
Adaptive Thresh Block Size:- What it does: Sets the neighbourhood size for black/white conversion. (Must be odd).
- Effect: Smaller = more local detail, noise sensitive. Larger = smoother, might miss fine details.
- Why adjust: Tune with
Constant Cif lines/background separation is poor.
-
Adaptive Thresh Constant (C):- What it does: Fine-tunes the black/white threshold decision.
- Effect: Lower = more white pixels (lines/noise). Higher = fewer white pixels (less noise, might miss faint lines).
- Why adjust: Key parameter. Decrease to find faint lines; Increase if too much noise/text becomes white.
3. Morphological Operations (Finding Long Lines, Fixing Breaks)
-
Horizontal Kernel Length/Vertical Kernel Length:- What it does: Length of the "eraser" for removing short elements.
- Effect: Larger = keeps only very long lines (better vs text, worse vs wavy lines). Smaller = keeps shorter segments (might include text).
- Why adjust: Increase if text strokes survive; decrease if parts of wavy grid lines are erased.
-
Hor/Ver Bridging Kernel Size:- What it does: Size of the "connector" for fixing broken lines.
- Effect: Larger = bridges bigger gaps (can wrongly connect to text). Smaller = bridges only small gaps.
- Why adjust: Increase slightly for lines with small gaps; use cautiously.
-
Bridging Iterations:- What it does: How many times bridging is applied.
- Effect: More iterations = bridge larger gaps via multiple small jumps.
- Why adjust: Increase if lines remain broken after one bridging pass (often better than one large kernel).
4. Line Filtering (Cleaning Up Detected Lines)
-
Min Line Length Fraction:- What it does: Minimum required line length relative to page size.
- Effect: Higher = stricter, removes shorter noise. Lower = less strict.
- Why adjust: Increase if short noisy segments remain; decrease if valid short segments (e.g., near edges) are removed.
-
Max Line Thickness Ratio:- What it does: Maximum allowed line thickness relative to length.
- Effect: Lower = requires very thin lines. Higher = allows thicker components.
- Why adjust: Lower if thick text strokes are mistaken for grid lines.
-
Min Line Continuity Ratio:- What it does: Required "solidness" of a line segment (pixel coverage).
- Effect: Higher = requires solid lines. Lower = allows dashed/broken lines.
- Why adjust: Increase if dashed artifacts survive; decrease if valid faint/broken lines are filtered out.
5. Intersection Clustering & Refinement (Finding Grid Coordinates)
-
Cluster Refinement Method:- What it does: Algorithm (Median or RANSAC) to find the precise grid line position from scattered intersection points.
- Effect: Median is robust to outliers. RANSAC attempts line fitting, potentially better for noise/curves but experimental.
- Why adjust: Try RANSAC if Median struggles with alignment on noisy/wavy grids.
-
Cluster Threshold Factor:- What it does: Controls how close points must be (relative to typical spacing) to be grouped into one line cluster.
- Effect: Lower = requires tight grouping. Higher = allows looser grouping (better for waves, risk of merging distinct lines).
- Why adjust: Increase if wavy lines are split; decrease if close lines are merged.
6. Cell Definition & Filtering (Drawing and Cleaning Cells)
-
Merged Cell Factor:- What it does: Threshold for flagging gaps between lines as "merged" (relative to median gap size).
- Effect: Lower = flags smaller deviations as merged. Higher = requires a very large gap.
- Why adjust: Tune based on visual inspection of merged cells in your documents and the orange highlighting.
-
Min/Max Cell Aspect Ratio (W/H):- What it does: Allowed range for cell width/height ratio.
- Effect: Discards cells outside this shape range.
- Why adjust: Widen range if valid but unusually shaped cells are discarded; narrow if artifacts are kept.
Current Status & Limitations
- The pipeline provides a robust baseline for grid extraction on challenging documents.
- Performance heavily depends on parameter tuning; optimal values may vary between different document types or scan qualities.
- The RANSAC refinement method is marked as experimental.
- Handling extremely severe, non-linear warping might require more advanced unwarping techniques beyond simple deskewing.