iqra-ml / README.md

Prepare datasets

Last updated: 4/16/2026GitHub

iqra-ml

This contains the ML code for Iqra.

Prepare datasets

  1. Clone PaddleOCR inside src/IqraData.
  2. Go to paddlepaddle.org and install the specific paddlepaddle according to your setup (cpu, cuda, etc). Downloading the version for CPU is straight forward. But with cuda it needs extra work. Look at Important notes to set a local conda environment for a different CUDA version
  3. Install requirements for PaddleOCR using pip install -r src/IqraData/PaddleOCR/requirements.txt. Install scikit-learn.
  4. If you want to prepare the dataset for PaddleOCR finetuning, look at the notebook example in src/data_preparation_example.ipynb.
  5. TODO: write guides for finetuning paddle.

Adding Custom Dataset to support a new labeling tool

The code currently supports the data to be labeled using Label Studio and exported in mini-json format. If the data is labeled using another tool, you need to extend OCRDataset class inside src/IqraData/ocr_datasets.

Note that it is recommended to have a look at OCRDatasetLabelStudio while extending your class.

  • Extend the build method of the class, which takes a list of directories where the raw data is located as an argument.

  • Inside build, for each labeled file, you need to create a LabeledFile object, which has the following arguments:

    • image_path (str): the path to the image file.
    • bboxes (List[List[Tuple]]): the list of bounding box coordinates for the annotated objects in the image. Each item in the list corresponds to bounding box coordinates of the annotated object represented as a list of tuples for the four corners of bbox starting from top_left corner: (top_left, top_right, bottom_right, bottom_left)
    • transcriptions (List[str]): the list of transcriptions of the annotated objects in the image.
    • labels (List[str]): the list of labels of the annotated objects in the image.
    • annotator (Optional[str], optional): the name of the person who annotated the image, by default None.
    • company (Optional[str], optional): the name of the company that owns the image, by default None.

    Note: transcriptions, bboxes and labels should be lists of the same length that are ordered correspondingly.
    Note: If there a bbox that does not have a label, you need to pass an empty string as the corresponding label.

  • The new LabeledFile created for each file should be appended to self.files.

  • Add useful logger statements as necessary using self.logger.

It is highly recommended to follow these steps, since this allows you to easily visualize the dataset, merge annotations if needed, and to directly build the data for PaddleOCR finetuning as in the example in the notebook without much work!


Important notes to set a local conda environment for a different CUDA version.

Up to this date, paddlepaddle supports cuda version up to 11.7. So, if you have a different version, you need to setup a conda environment and include this version. Note: You need to install the appropriate Cuda version using conda.

  • Create a conda environment using the compatible versions required by the link given above:

    conda create --prefix paddle_env python==3.10 cudatoolkit=<CUDA_TOOLKIT_VESION> cudnn=<CudNN_VERSION>
    

    The necessary versions are found here.

    --prefix is to make the env files local. Paddlepaddle that I worked on works with cudatoolkit=11.7 cudnn=8.4.1

  • Next, install the appropriate nvcc version from here.

  • I needed to create a kernel for this environment so that I can run notebooks in it using python -m ipykernel install --user --name=paddle_kernel, since I needed to run the jupyter notebook inside this environment. (This is needed if you are working on Rihal server).

  • Then the challenge was to set the environment variable paths to cuda, because surprisingly, even if I set the paths from the terminal inside the environment, the change is not reflected in the notebook that runs in the same env.

    For this I need to add this line in /home/jovyan/.local/share/jupyter/kernels/paddle_kernel/kernel.json (The created kernel). The line is this "env": {"LD_LIBRARY_PATH":"/home/jovyan/work/IqraOCR/paddle_env/lib","PATH":"/home/jovyan/work/IqraOCR/paddle_env/bin:${PATH}"},

    notice that paths point to inside the local conda environment: paddle_env.

    Then, restart the kernel or shutdown (I needed to do this many times to see that the changes reflect in os.getenv("LD_LIBRARY_PATH") ). And then finally paddle could recognize the 11.7 version files.

Project Organization

The general folder structure tries to follow the popular cookiecutter template for organizing machine learning projects. I recommend that you try to follow this structure as much as you can.


├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Project based on the cookiecutter data science project template. #cookiecutterdatascience