SAIL LABS Announces New Release of Text Mining Indexer 6.0

SAIL LABS Announces New Release of Text Mining Indexer 6.0

SAIL LABS Technology, one of the world’s leading speech technology solution providers, today announced the availability of Text Mining Indexer 6.0 (TMI), the latest version of its flagship product for text-based multi-lingual information retrieval.

The Text Mining Indexer is part of a powerful suite of integrated multimedia indexing and mining technologies providing actionable intelligence. It produces a configurable automatic index from various text and media sources such as:

• Plain text documents
• Microsoft Word
• PDF documents
• Press Releases
• E-mails (of configurable POP3 and IMAP user accounts)
• Webpages
• Feeds

“Direct feedback of a worldwide clientele and network of partners has resulted in functionality and customizability enhancements of the TMI 6.0 to meet their needs,” said Mark Pfeiffer Chief Visionary Officer at Sail Labs. “With this new version featuring multiple ingestion formats and a variety of set-up combinations, now even more users can benefit from greater flexibility in terms of use and usability,” concluded Mark Pfeiffer.


Description of Components:

  • WebCollector: an application that automatically analyses configurable sites on the internet for RSS-feeds, web pages and updates of these resources.
  • E-mail Collector: content of E-mails are harvested and analyzed.
  • Electronic Documents: an electronic document converter allows for indexing of electronic documents in diverse formats


  • Story Segmentation: The text output is segmented into coherent stories
  • Topic Detection: The Topic Detection module is equipped with a wide variety of topics ranging from general to specific topics. This helps locate stories relating to user-specific interests.
  • Named Entity Detection: words are tagged with categories of entities such as persons, locations, organizations.
  • Instant Access/real-time Indexing: The indexed results are obtained as soon as content is available.


  • Web crawling
  • Language identification (for multilingual web pages)
  • Semantic key story content extraction (link/story segmentation)
  • Ad & link suppression
  • Selective include/exclude patterns
  • Offline mode (with web page snapshot in pdf format)
  • RSS & Atom support
  • Social Network support (Facebook, IMDb, LinkedIn)
  • Auto login (restricted web-sites access)
  • Twitter support
  • Blogs


  • Pop3 server support
  • IMAP server support

File Format:

  • Plain Text
  • Microsoft Word (.doc, .docx)
  • Microsoft PowerPoint (.ppt, .pptx)
  • Microsoft Excel (.xls, .xlsx)
  • Portable Document Format (.pdf)
  • Open Document Text (.odt)
  • Rich Text Format (.rtf)
  • HyperText Markup Language (.html)


  • Seamless integration with audio-visual sources
  • Extensible named-entity categories
  • Support of multiple language models
  • “Stop/Start/Resume”- for WebCollector
  • Improved scalability from collection to server-upload
  • Flexibility in set-up-combinations of all components involved
  • The WebCollector produces screen-snap-shots in PDF format for later reference and including all graphical content
  • Customizability of web-collection
  • User-friendly interface
  • Flexibility in model structures (language features)

Available Languages:
Currently Arabic, Catalan, English (international and US), Farsi, French, German, Greek, Hebrew, Italian, Mandarin, Norwegian, Polish, Portuguese (Brazilian), Russian and Spanish are supported.

Share on Social Media

Share on linkedin
Share on twitter
Share on google