tcis-v2 / README.md

Text Classification

Last updated: 4/16/2026GitHub

Text Classification

Overview

Problem Statement

Organisations generate gigabytes of structured and unstructured data stored in different locations and formats during its ongoing operations. As a result, challenges arise when searching for the right content with the correct version to support the process of making business decisions.

TCIS Workflow

happy path

Crawler Server

Please refer to the ./airflow/docker-compose.yaml file for the full definition.

Airflow is used to orchestrate the crawler and ML workflow.

  • RabbitMQ is used as the queueing sytem for Airflow.
  • Postgres is used as the database for storing Airflow related information, e.g., DAG definition, tasks instances, scheduler state, logs, etc.

The crawler server contains the airflow setup which contains the following DAG's:

  • crawler_DAG - Fetching files from different data sources, e.g., SharePoint, Network drives, Livelink, etc.

    • Contains one task which is resposible for checking if the file extension is valid.
    • If true, the file's path gets passed to the metadata_DAG.
  • metadata_DAG - Extracts basic information from files, e.g., extension, size, author, thumbnail, etc.

    • Contains one task which is resposible for extracting the basic metadata.
    • The extracted metadata gets inserted into elastic:
      • A POST request is sent to the /crawler/file-uuid endpoint.
    • The file then gets stored in the files MinIO bucket.
    • The MinIO file path then gets passed to the ml_DAG.
  • ml_DAG - Runs the ML processes, e.g., OCR, keywords extraction, classification, etc.

    • Contains multiple tasks.
      • download_file_task:
        • Receives the MinIO file path and passes the file path to the ocr_task.
      • ocr_task:
        • Receives the file path and runs the OCR processes.
        • Stores the text in the ocred MinIO bucket.
        • Passes the text files MinIO path to the classification_task, security_classification_task, unsupervised_keywords_extraction_task, special_keywords_extraction_task.
      • classification_task, security_classification_task, unsupervised_keywords_extraction_task, special_keywords_extraction_task
        • These tasks run concurrently in separate processes.
        • Each is responsible for the named ML task.
        • Once a processes finishes execution, the output gets inserted into elastic by making POST request to the /crawler/file-uuid endpoint.
      • delete_file_from_minio_task
        • Runs after the last ML tasks finishes and deleted the file from MinIO.
  • filewatcher_DAG - Runs the filewatcher to watch and update any changes made to the files.

    • Contains one tasks which triggers the filewatcher.
    • The filewatcher handles the on_created, on_deleted, on_modified, and on_moved events.
    • See ./airflow/dags/watcher/event_handler.py.

Web App Server

Please refer the ./docker-compose.yaml file for the full definition.

This server hosts:

  • The web service.
  • The server service.
  • The elasticsearch service.
  • The keycloak service.
  • The db service.
  • The kibana service (dev only).

The web service:

  • The frontend application implemented using TypeScript and React.

The server service:

  • The backend server implemented using Go and Fiber.

The elasticsearch service:

  • Used to store and search the files content and metadata.

The keycloak service:

  • Used to store and authenticate users.

The db service:

  • Used to store basic user information, search history, bookmarks, etc.

Note:


Contribution

Prerequisites

Make sure you're running the commands from the root of the application.

Before contributing (or running) this project, make sure that you have done the following:

Increase the virtual memory for elastic:

$ sudo sysctl -w vm.max_map_count=262144

Download the dependencies:

$ make install-dep

Running the app

Start the Web App first:

$ docker compose up -d --build

Once the app is running, start the Crawler app (airflow)

$ docker compose -f ./airflow/docker-compose.yaml up -d --build

To stop the containers

$ docker compose down

Credentials

Web App

URL - http://localhost:3000

There are three user types:

  • User

    • username: tcis-user
    • password: 123
  • Admin

    • username: tcis-admin
    • password: 123
  • Top Admin

    • username: tcis-top-admin
    • password: 123

Airflow

URL - http://localhost:3001

  • User
    • username: airflow
    • password: airflow

MinIO

URL - http://localhost:9001

  • User
    • username: minio
    • password: minio123

Kibana

URL - http://localhost:5601

  • User
    • username: elastic
    • password: klghjrtjklhjkjgh63745#$%

Design

The design system can be found:

Interfaces:

https://projects.invisionapp.com/share/J612SVYB8EU4#/screens/467667265_Search_Home

Whiteboard:

https://projects.invisionapp.com/freehand/document/ddKkqCyJN

Common Issues

If any problem arises when running the project, the first thing you should do is check the logs:

$ docker compose logs -f <service-name>

Example:

$ docker compose logs -f server

Elasticsearch

Issue - elasticsearch container is unhealthy, or increase vm.max_map_count.

Solution - increase the virtual memory:

$ sudo sysctl -w vm.max_map_count=262144