Enriching data with BigConnect

BigConnect uses a complex pipeline to enrich the data stored in the system: Named Entity Extraction on text, Object detection, Machine Learning, OCR, Speech2Text, custom plugins

BigConnect has a complex data processing pipeline that can accommodate even the toughest tasks. We call it the Data Worker pipeline, and it uses… Data Workers :)

Data Workers are plugins that run in random order when something changes in the data. They can run on a single machine or on hundreds of machines for large deployments and resource-intensive tasks. A Data Worker can to do three things:

Initialize itself when the system starts up (the prepare method)
Say whether it can process an element (entity or relationship) or a property of an element (the isHandled method)
Execute processing logic (the execute method)

Every change on a data element or property inside BigConnect will notify the Data Worker pipeline which in turn will execute all available Data Workers, passing along the changed element and its changed property if needed. The pipeline works using queues, so it can be easily distributed to any number of machines. Some Data Workers can run on some machines, while others can run on other machines, depending on resource consumption and volume requirements. It’s completely up to the developer to properly architect the pipeline setup.

Data Workers don’t run in a particular order. First, the isHandled method is checked to see if the Data Worker can process the change and if so, its execute method will be called. It’s completely up to the developer to establish when a Data Worker will actually execute on a specific element. By not enforcing a specific order, developers have maximum flexibility.

We provide quite a few Data Workers out of the box to get you started. The most notable ones are:

audio-metadata - Extracts audio metadata like file size and duration
audio-mp4-encoder – Converts an audio file to the MP4 format to be playable in the browser
audio-ogg-encoder - Converts an audio file to the OGG format to be playable in the browser
azure-image-ocr – Extract text from image files using Azure Computer Vision and store it as a property of the entity
azure-image-tags – Predict tags found in an image using Azure Computer Vision and store them as properties of the entity
groovy – Execute custom groovy scripts
image-metadata-extractor – Extracts image metadata like file size, date taken, device, geo-location, heading,width and height. It also detects if images need to be rotated and/or flipped in order to be displayed in the correct orientation.
mime-type-ontology-mapper – Maps the object to an ontology concept using its mime type
opennlp-me-extractor – Extracts Named Entities from text to create Term Mentions on the provided text. A user can then resolve the Term Mentions to different entities.
phone-number-extractor – Extracts US Phone Numbers from text and store them as Term Mentions for the entity.
regex-extractor – Creates new entities by applying regexp expressions on text data
tika-mime-type - Uses Apache Tika to determine the MIME type of an object
tika-text-extractor – Uses Apache Tika to extract text from a document. It also identifies the language of the document and stores it as a property on the entity.
video-audio-extract – Extracts the audio track of a video file and pushes it down the pipeline for further processing
video-frame-extract – Extracts every frame from a video file and pushes them down the pipeline for further processing
video-metadata - Extracts video metadata like duration, geo-location, date taken, device, width, height, rotation and file size and store them as properties of the video entity.
video-mp4-encoder – Converts video data to the MP4 format to be playable in the browser
video-poster-frame – Extracts a frame to create an image to represent the video
video-webm-encoder - Converts video data to the WEBM format to be playable in the browser

You can easily develop you own data workers for specific tasks by extending the com.mware.core.ingest.dataworker.DataWorker class. Package your class in a jar file along with a META-INF/services/com.mware.core.ingest.dataworker.DataWorker service file and place it under the $BIGCONNECT_DIR/lib/ext folder.

Check out the source code for the provided Data Workers to learn how you can create your own one.