Sponge



Sponge is a data crawler utility, part of the bigConnect product suite. It is included in the bigConnect Enterprise Edition, but it can also be used as a standalone product


The real SPONGE
power

Sponge helps you transform the limitless amount of unstructured content into valuable information that can be used everywhere. Web sites, blogs, forums, social sites or documents, Sponge is a full-featured, flexible and extensible crawler that runs on an platform and will help you crawl whatever information you want, how you want it.

Using the Sponge

Sponge will help you do three things: crawl data from web sites, from social sites or documents from file systems. All three crawlers use a common platform that allows you to manage data, crawling jobs and security in a unified way.

The interface

Web sites are crawled with configurable HTTP spiders. We provide a very simple web user interface
where users can load any web page and define how to extract their data of interest.
Social Network data is extracted using the specialized APIs made available by each vendor.
We currently support Facebook, Twitter, Google+ and YouTube and are working to integrate more platforms.

Sponge reach

bigConnect can ingest and process any kind of information. Whether you have it stored in databases, office documents, text files, XML files, HTML files, images, audio files, video files or different streams, Facebook, Twitter, YouTube, Google+, and web sites, our content extraction engines and crawlers will automatically extract the relevant information as objects, relations and attributes.

Job Scheduling

Using a unified management console, Sponge allows users to run, schedule and monitor crawling jobs, configure crawling parameters and see the execution progress.

Structured files ingestion

Just drag an XLS or CSV file on the workspace and bigConnect will automatically parse the content and allow you to configure the mappings in our mapping editor.

Security

The content of a workspace, where data is collected, is completely secure and can be shared with others with different access rights. For example you might want some users to see the job status, but restrict access to the actual data that was extracted.

Data output

Once the information is collected and processed, it needs to go to the location and format of your choice. A pluggable data output mechanism will send data to bigConnect or other databases like SQL Databases, ElasticSearch, or Solr.

Monitoring

Running crawlers report their status to the management console for easy monitoring of crawling jobs. Users can review job execution status, crawling errors or intermediate data that was collected.

Data Processing

Crawler extracted data can be tagged with additional information, translated, split, merged, trimmed, filtered and more. We have lot of predefined taggers and transformers and users can even run scripts to tag and transform content according to their requirements.

Web Crawler

Any website is a valuable source of information. Sponge allows you to easily decode any raw website, by highlighting certain areas of the site and mapping them to the structure you want.

With a user-friendly web interface, data mapping can be done in a couple of clicks. Sponge identifies the unique relevant HTML landmarks using complex and smart algorithms.

  • Web UI for easy configuration
  • Multi-threaded
  • Language detection.
  • URL normalization.
  • Supports pages rendered with JavaScript.
  • Configurable crawling speed.
  • Detects modified and deleted documents.
  • Various web site authentication schemes.
  • Supports sitemap.xml.
  • Supports robot rules.
  • Supports canonical URLs.
  • Document filters based on URL, HTTP headers, content, or metadata
  • Can re-process or delete URLs no longer linked by other crawled pages
  • Different URL extraction strategies for different content types.
  • Reference XML/HTML elements using simple DOM tree navigation
  • Configurable hit intervals according to different schedules.
  • Customizable User Agents used for crawling.
  • Configurable maximum crawling depth.
  • Configurable crawling speed.
  • Different frequencies for re-crawling certain pages.
  • Can crawls millions of records on average hardware

Social Crawler

Social Networks have a lot of information that can be used for various purposes: social analysis, customer voice, influencers or hot topics.

All important social networks like Facebook, Twitter, Google+, VK etc. are exposing specialized APIs by which third party applications can connect and extract information like user profiles, pages, comments, friends, etc. Our Social Crawler uses these APIs in order to retrieve the information that is much more accurate than normal web scraping.

With a user-friendly interface, the Social Crawler enables users to define the starting point of their crawl process: a user page, a search result, a page, a group etc. and then to go deeper and identify objects of interest as results come in.

The crawling process uses a tree of queries. On each step, you define what is the starting query (all posts, all comments, all likes, all friends, all shares etc.) and then go deeper to fetch other information related to that object.

For example, you can start from one page and query all posts. Then, for each post you can query on one hand side its shares, and on another hand side the comments. For the comments query you can go deeper and query the reactions to the comments and the sub-comments.

  • Supports Facebook, Twitter, Google+ and YouTube.
  • Web UI for easy configuration.
  • Uses native API published by social networks
  • Language detection.
  • Publish data to bigConnect as objects and relationships for advanced analysis

Filesystem Crawler

A lot of valuable data still resides in local content silos: local disks, external drivers, network shares, archives or even HDFS. Sponge includes a crawler that can effectively collect, parse, manipulate and store this information.

  • Web UI for easy configuration
  • Multi-threaded
  • Extract text out of many file formats (HTML, PDF, MS Office, OpenOffice, Images etc.)
  • Extract metadata associated with documents.
  • Supports pages rendered with JavaScript.
  • Language detection.
  • OCR support on images and PDFs.
  • Translation support.
  • Allow easy text transformation and metadata manipulation.
  • Filter unwanted documents.
  • Detects modified and deleted documents.
  • Various date parsers and formatters.
  • Extract document ACLs from SMB/CIFS file systems.

The Sponge Users

Intelligence

Sponge is used to collect and correlate information from websites and social networks in order to identify potential threats to national security.

Anti Fraud and Law enforcement

Sponge is used to ingest information from news sites and social profiles to discover relationships between people, actions, facts and locations and interpret them in order to prevent fraud and corruption actions

Banking & Telecom

Sponge can extract information from social networks, forums and blogs to identify the customer voice and the general sentiment regarding customer services. It can also act as an antichurn and prevention tool for strengthening the client install base as well as market feedback receptor.

Social Network Analysis

Sponge extracts complete information from social networks and it is the key data provider for complex social network analysis

Customer 360

By extracting information from relevant web sites, forums, blogs and other public information sources and correlating this information with social network data enables organizations to create complete, 360 degree people profiles