iqra-ml / README.md
Prepare datasets
iqra-ml
This contains the ML code for Iqra.
Prepare datasets
- Clone PaddleOCR inside
src/IqraData. - Go to paddlepaddle.org and install the specific
paddlepaddleaccording to your setup (cpu, cuda, etc). Downloading the version forCPUis straight forward. But withcudait needs extra work. Look at Important notes to set a local conda environment for a different CUDA version - Install requirements for
PaddleOCRusingpip install -r src/IqraData/PaddleOCR/requirements.txt. Installscikit-learn. - If you want to prepare the dataset for
PaddleOCRfinetuning, look at the notebook example insrc/data_preparation_example.ipynb. - TODO: write guides for finetuning paddle.
Adding Custom Dataset to support a new labeling tool
The code currently supports the data to be labeled using Label Studio and exported in mini-json format. If the data is labeled using another tool, you need to extend OCRDataset class inside src/IqraData/ocr_datasets.
Note that it is recommended to have a look at OCRDatasetLabelStudio while extending your class.
-
Extend the
buildmethod of the class, which takes a list of directories where the raw data is located as an argument. -
Inside
build, for each labeled file, you need to create aLabeledFileobject, which has the following arguments:image_path(str): the path to the image file.bboxes(List[List[Tuple]]): the list of bounding box coordinates for the annotated objects in the image. Each item in the list corresponds to bounding box coordinates of the annotated object represented as a list of tuples for the four corners of bbox starting from top_left corner: (top_left, top_right, bottom_right, bottom_left)transcriptions(List[str]): the list of transcriptions of the annotated objects in the image.labels(List[str]): the list of labels of the annotated objects in the image.annotator(Optional[str], optional): the name of the person who annotated the image, by default None.company(Optional[str], optional): the name of the company that owns the image, by default None.
Note:
transcriptions,bboxesandlabelsshould be lists of the same length that are ordered correspondingly.
Note: If there a bbox that does not have a label, you need to pass an empty string as the corresponding label. -
The new
LabeledFilecreated for each file should be appended toself.files. -
Add useful logger statements as necessary using
self.logger.
It is highly recommended to follow these steps, since this allows you to easily visualize the dataset, merge annotations if needed, and to directly build the data for
PaddleOCRfinetuning as in the example in the notebook without much work!
Important notes to set a local conda environment for a different CUDA version.
Up to this date, paddlepaddle supports cuda version up to 11.7. So, if you have a different version, you need to setup a conda environment and include this version.
Note: You need to install the appropriate Cuda version using conda.
-
Create a
condaenvironment using the compatible versions required by the link given above:conda create --prefix paddle_env python==3.10 cudatoolkit=<CUDA_TOOLKIT_VESION> cudnn=<CudNN_VERSION>The necessary versions are found here.
--prefixis to make the env files local. Paddlepaddle that I worked on works with cudatoolkit=11.7 cudnn=8.4.1 -
Next, install the appropriate
nvccversion from here. -
I needed to create a kernel for this environment so that I can run notebooks in it using
python -m ipykernel install --user --name=paddle_kernel, since I needed to run the jupyter notebook inside this environment. (This is needed if you are working on Rihal server). -
Then the challenge was to set the environment variable paths to cuda, because surprisingly, even if I set the paths from the terminal inside the environment, the change is not reflected in the notebook that runs in the same env.
For this I need to add this line in
/home/jovyan/.local/share/jupyter/kernels/paddle_kernel/kernel.json(The created kernel). The line is this"env": {"LD_LIBRARY_PATH":"/home/jovyan/work/IqraOCR/paddle_env/lib","PATH":"/home/jovyan/work/IqraOCR/paddle_env/bin:${PATH}"},notice that paths point to inside the local conda environment: paddle_env.
Then, restart the kernel or shutdown (I needed to do this many times to see that the changes reflect in
os.getenv("LD_LIBRARY_PATH")). And then finally paddle could recognize the 11.7 version files.
Project Organization
The general folder structure tries to follow the popular cookiecutter template for organizing machine learning projects. I recommend that you try to follow this structure as much as you can.
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
Project based on the cookiecutter data science project template. #cookiecutterdatascience