text-classification / README.md
Text Classification
Text Classification and Intelligent Search
Text Classification
Overview
Problem Statement
PDO generates terabytes of structured and unstructured data stored in different locations and formats during its ongoing operations. As a result, challenges arise when searching for the right content with the correct version to support the process of making business decisions.
TCIS Workflow

System Architecture (Simplified)

Crawler Logic

crawler
The crawler server is responsible for fetching files from different data
sources (ex: SharePoint, Network drives, Livelink, ...etc).
- This service is implemented using Go and the Gin framework.
When the crawler fetches a file, it extracts some metadata, such as author, size, thumbnail, etc. It then makes two requests:
- To the
ocrservice to run ML services. - To the
serverservice to upsert the basic metadata extracted into ElasticSearch.
The files:
- In the production environment, the different data sources are mounted to the servers using rClone.
- In the dev environment, the files are crawled from the
./datadirectory.
ocr
The ocr server is implemented with Python and
Flask.
Once the ocr server receives a file, it runs the following processes:
- OCR: extracting the content from all files
- Special keywords extraction: extracts fields, wells, and resevoirs names from the content.
- Unsupervised keywords extraction: extarcts keywords from the files content.
- Classification: suggests some classification for a file.
In some cases the following processes are executed:
- Image classification: suggests some classification for an image.
- Incident report metadata: extracts metadata from Incident Reports.
Once a process is finished:
- The
ocrsends the output to thecrawler. - The
crawlerforwards the request to theserver. - The
serverupserts the process's output into ElasticSearch.
server
The main server is hosted on a seperate server (mus-n-v00289).
- It is also implemented using Go and the Gin framework.
The server is connected to the following services:
- The
ocrservice. See the previous section. - The
webservice. See the next section. - The
elasticservice. See ElasticSearch. - The
databaseservice.- An MS SQL Server which is primarily used to store user-related data, such as search history.
The server is responsible for the following:
- Upserting file metadata into the
elasticservice. - Acting as a search engine by searching through the ElasticSearch indices.
- Communicating with the FE to serve users.
web
This is the FE for TCIS and is implemented using TypeScript and React.
livelink_crawler
This service is implemented in Go and its resposible for crawling files from Livelink.
It works by making direct requets using the Livelink API.
local_importer
This service is responsible for:
- Importing data into the
databaseto simulate the views provided by PDO. - Importing the chatbot Q&A's into the
database.
Contribution
Prerequisites
Before contributing (or running) this project, make sure that you have done the following:
- Set up
~/.netrcfile to pullrihal.tech/foundationlibrary. Click here for instructions. - add
export GITHUB_LOGIN=YOUR_GITHUB_USERNAMEandexport GITHUB_TOKEN=YOUR_GITHUB_TOKENto~/.bashrc - Set up the
~/.npmrmfile to pull private NPM packages. Click here for instructions. - Run
make install-depfrom the root of the repo to install all dependencies required.
Running the app for the first time
Run the following commands from the root of the repo:
To run everything:
make run-all
To run crawler only:
make run-dev-temporal
To run the web:
run-dev-web
To test the crawler or populate data:
curl localhost:8881/walk
then curl localhost:8881/crawl
Credentials
These are the credentials for the 2 user roles:
User
username: tcis-user
password: 123
Admin
username: tcis-admin
password: 123
Top Admin
username: tcis-top-admin
password: 123
Design
The design system can be found:
Interfaces:
https://projects.invisionapp.com/share/J612SVYB8EU4#/screens/467667265_Search_Home
Whiteboard:
https://projects.invisionapp.com/freehand/document/ddKkqCyJN
Common Issues
If any problem arises when running the project, the first thing you should do is check the logs:
$ docker-compose logs -f <service-name>
Example:
$ docker-compose logs -f server
Notes
- The up to date production docs are in README in the production folder
- Other docs (which may be outdated) can be found in the READMEs in the dev docs folder
- Please note that the project is complex and has undergone various fundamental changes. Therefore, don't hesitate to contact the team.