tcis-v2 / README.md
Text Classification
Text Classification
Overview
Problem Statement
Organisations generate gigabytes of structured and unstructured data stored in different locations and formats during its ongoing operations. As a result, challenges arise when searching for the right content with the correct version to support the process of making business decisions.
TCIS Workflow

Crawler Server
Please refer to the
./airflow/docker-compose.yamlfile for the full definition.Airflow is used to orchestrate the crawler and ML workflow.
- RabbitMQ is used as the queueing sytem for Airflow.
- Postgres is used as the database for storing Airflow related information, e.g., DAG definition, tasks instances, scheduler state, logs, etc.
The crawler server contains the airflow setup which contains the following DAG's:
-
crawler_DAG- Fetching files from different data sources, e.g., SharePoint, Network drives, Livelink, etc.- Contains one task which is resposible for checking if the file extension is valid.
- If true, the file's path gets passed to the
metadata_DAG.
-
metadata_DAG- Extracts basic information from files, e.g., extension, size, author, thumbnail, etc.- Contains one task which is resposible for extracting the basic metadata.
- The extracted metadata gets inserted into elastic:
- A
POSTrequest is sent to the/crawler/file-uuidendpoint.
- A
- The file then gets stored in the
filesMinIO bucket. - The MinIO file path then gets passed to the
ml_DAG.
-
ml_DAG- Runs the ML processes, e.g., OCR, keywords extraction, classification, etc.- Contains multiple tasks.
download_file_task:- Receives the MinIO file path and passes the file path to the
ocr_task.
- Receives the MinIO file path and passes the file path to the
ocr_task:- Receives the file path and runs the OCR processes.
- Stores the text in the
ocredMinIO bucket. - Passes the text files MinIO path to the
classification_task,security_classification_task,unsupervised_keywords_extraction_task,special_keywords_extraction_task.
classification_task,security_classification_task,unsupervised_keywords_extraction_task,special_keywords_extraction_task- These tasks run concurrently in separate processes.
- Each is responsible for the named ML task.
- Once a processes finishes execution, the output gets inserted into
elastic by making
POSTrequest to the/crawler/file-uuidendpoint.
delete_file_from_minio_task- Runs after the last ML tasks finishes and deleted the file from MinIO.
- Contains multiple tasks.
-
filewatcher_DAG- Runs the filewatcher to watch and update any changes made to the files.- Contains one tasks which triggers the filewatcher.
- The filewatcher handles the
on_created,on_deleted,on_modified, andon_movedevents. - See
./airflow/dags/watcher/event_handler.py.
Web App Server
Please refer the
./docker-compose.yamlfile for the full definition.This server hosts:
- The
webservice.- The
serverservice.- The
elasticsearchservice.- The
keycloakservice.- The
dbservice.- The
kibanaservice (dev only).
The web service:
- The frontend application implemented using
TypeScriptandReact.
The server service:
- The backend server implemented using Go and Fiber.
The elasticsearch service:
- Used to store and search the files content and metadata.
The keycloak service:
- Used to store and authenticate users.
The db service:
- Used to store basic user information, search history, bookmarks, etc.
Note:
- The
kibanaservice (dev only):
- Used to test elastic queries during development.
- Visit http://localhost:5601/app/dev_tools#/console.
- The credentials can be found under the credentials section.
- The
nginxservice (prod only):
- Used to serve the application.
Contribution
Prerequisites
Make sure you're running the commands from the root of the application.
Before contributing (or running) this project, make sure that you have done the following:
Increase the virtual memory for elastic:
$ sudo sysctl -w vm.max_map_count=262144
Download the dependencies:
$ make install-dep
Running the app
Start the Web App first:
$ docker compose up -d --build
Once the app is running, start the Crawler app (airflow)
$ docker compose -f ./airflow/docker-compose.yaml up -d --build
To stop the containers
$ docker compose down
Credentials
Web App
URL - http://localhost:3000
There are three user types:
-
User
- username: tcis-user
- password: 123
-
Admin
- username: tcis-admin
- password: 123
-
Top Admin
- username: tcis-top-admin
- password: 123
Airflow
URL - http://localhost:3001
- User
- username: airflow
- password: airflow
MinIO
URL - http://localhost:9001
- User
- username: minio
- password: minio123
Kibana
URL - http://localhost:5601
- User
- username: elastic
- password: klghjrtjklhjkjgh63745#$%
Design
The design system can be found:
Interfaces:
https://projects.invisionapp.com/share/J612SVYB8EU4#/screens/467667265_Search_Home
Whiteboard:
https://projects.invisionapp.com/freehand/document/ddKkqCyJN
Common Issues
If any problem arises when running the project, the first thing you should do is check the logs:
$ docker compose logs -f <service-name>
Example:
$ docker compose logs -f server
Elasticsearch
Issue - elasticsearch container is unhealthy, or increase vm.max_map_count.
Solution - increase the virtual memory:
$ sudo sysctl -w vm.max_map_count=262144