text-classification / README.md

Text Classification

Text Classification and Intelligent Search

Last updated: 4/16/2026GitHubText classification

Text Classification

Overview

Problem Statement

PDO generates terabytes of structured and unstructured data stored in different locations and formats during its ongoing operations. As a result, challenges arise when searching for the right content with the correct version to support the process of making business decisions.

TCIS Workflow

happy path

System Architecture (Simplified)

system arch path

Crawler Logic

crawler

crawler

The crawler server is responsible for fetching files from different data sources (ex: SharePoint, Network drives, Livelink, ...etc).

When the crawler fetches a file, it extracts some metadata, such as author, size, thumbnail, etc. It then makes two requests:

  1. To the ocr service to run ML services.
  2. To the server service to upsert the basic metadata extracted into ElasticSearch.

The files:

  • In the production environment, the different data sources are mounted to the servers using rClone.
  • In the dev environment, the files are crawled from the ./data directory.

ocr

The ocr server is implemented with Python and Flask.

Once the ocr server receives a file, it runs the following processes:

  • OCR: extracting the content from all files
  • Special keywords extraction: extracts fields, wells, and resevoirs names from the content.
  • Unsupervised keywords extraction: extarcts keywords from the files content.
  • Classification: suggests some classification for a file.

In some cases the following processes are executed:

  • Image classification: suggests some classification for an image.
  • Incident report metadata: extracts metadata from Incident Reports.

Once a process is finished:

  1. The ocr sends the output to the crawler.
  2. The crawler forwards the request to the server.
  3. The server upserts the process's output into ElasticSearch.

server

The main server is hosted on a seperate server (mus-n-v00289).

  • It is also implemented using Go and the Gin framework.

The server is connected to the following services:

  • The ocr service. See the previous section.
  • The web service. See the next section.
  • The elastic service. See ElasticSearch.
  • The database service.
    • An MS SQL Server which is primarily used to store user-related data, such as search history.

The server is responsible for the following:

  • Upserting file metadata into the elastic service.
  • Acting as a search engine by searching through the ElasticSearch indices.
  • Communicating with the FE to serve users.

web

This is the FE for TCIS and is implemented using TypeScript and React.

This service is implemented in Go and its resposible for crawling files from Livelink.

It works by making direct requets using the Livelink API.

local_importer

This service is responsible for:

  1. Importing data into the database to simulate the views provided by PDO.
  2. Importing the chatbot Q&A's into the database.

Contribution

Prerequisites

Before contributing (or running) this project, make sure that you have done the following:

  1. Set up ~/.netrc file to pull rihal.tech/foundation library. Click here for instructions.
  2. add export GITHUB_LOGIN=YOUR_GITHUB_USERNAME and export GITHUB_TOKEN=YOUR_GITHUB_TOKEN to ~/.bashrc
  3. Set up the ~/.npmrm file to pull private NPM packages. Click here for instructions.
  4. Run make install-dep from the root of the repo to install all dependencies required.

Running the app for the first time

Run the following commands from the root of the repo:

To run everything:
make run-all

To run crawler only: make run-dev-temporal

To run the web: run-dev-web

To test the crawler or populate data: curl localhost:8881/walk then curl localhost:8881/crawl

Credentials

These are the credentials for the 2 user roles:

User

username: tcis-user
password: 123

Admin

username: tcis-admin
password: 123

Top Admin

username: tcis-top-admin
password: 123

Design

The design system can be found:

Interfaces:

https://projects.invisionapp.com/share/J612SVYB8EU4#/screens/467667265_Search_Home

Whiteboard:

https://projects.invisionapp.com/freehand/document/ddKkqCyJN

Common Issues

If any problem arises when running the project, the first thing you should do is check the logs:

$ docker-compose logs -f <service-name>

Example:

$ docker-compose logs -f server

Notes