SAIL LABS Technology, one of the world’s leading speech technology solution providers, today announced the availability of Text Mining Indexer 6.0 (TMI), the latest version of its flagship product for text-based multi-lingual information retrieval.
The Text Mining Indexer is part of a powerful suite of integrated multimedia indexing and mining technologies providing actionable intelligence. It produces a configurable automatic index from various text and media sources such as:
• Plain text documents
• Microsoft Word
• PDF documents
• Press Releases
• E-mails (of configurable POP3 and IMAP user accounts)
FEATURES AND BENEFITS OF THE TEXT MINING INDEXER INCLUDE:
Description of Components:
- WebCollector: an application that automatically analyses configurable sites on the internet for RSS-feeds, web pages and updates of these resources.
- E-mail Collector: content of E-mails are harvested and analyzed.
- Electronic Documents: an electronic document converter allows for indexing of electronic documents in diverse formats
- Story Segmentation: The text output is segmented into coherent stories
- Topic Detection: The Topic Detection module is equipped with a wide variety of topics ranging from general to specific topics. This helps locate stories relating to user-specific interests.
- Named Entity Detection: words are tagged with categories of entities such as persons, locations, organizations.
- Instant Access/real-time Indexing: The indexed results are obtained as soon as content is available.
- Web crawling
- Language identification (for multilingual web pages)
- Semantic key story content extraction (link/story segmentation)
- Ad & link suppression
- Selective include/exclude patterns
- Offline mode (with web page snapshot in pdf format)
- RSS & Atom support
- Social Network support (Facebook, IMDb, LinkedIn)
- Auto login (restricted web-sites access)
- Twitter support
- Pop3 server support
- IMAP server support
- Plain Text
- Microsoft Word (.doc, .docx)
- Microsoft PowerPoint (.ppt, .pptx)
- Microsoft Excel (.xls, .xlsx)
- Portable Document Format (.pdf)
- Open Document Text (.odt)
- Rich Text Format (.rtf)
- HyperText Markup Language (.html)
SOME OF THE EXCITING FEATURES OF THE NEW VERSION INCLUDE:
- Seamless integration with audio-visual sources
- Extensible named-entity categories
- Support of multiple language models
- “Stop/Start/Resume”- for WebCollector
- Improved scalability from collection to server-upload
- Flexibility in set-up-combinations of all components involved
- The WebCollector produces screen-snap-shots in PDF format for later reference and including all graphical content
- Customizability of web-collection
- User-friendly interface
- Flexibility in model structures (language features)
Currently Arabic, Catalan, English (international and US), Farsi, French, German, Greek, Hebrew, Italian, Mandarin, Norwegian, Polish, Portuguese (Brazilian), Russian and Spanish are supported.
Share on Social Media