GTCOM-GTCOM

Introduction

GTCOM established the 2020 Cognitive Intelligence Research Institute through collaboration with the world's foremost universities, research institutions and scientists. Based on the cutting-edge technologies of artificial intelligence and deep learning, such as the convolutional neural network, recurrent neural network, deep-belief network, conditional random fields, random forests and the word-embedding model, combined with hundreds of billions of global multilingual corpora data resources, GTCOM has built a unique algorithm cloud platform and a machine-translation training system.

The cloud platform of the multilingual natural-language processing algorithm brings developers highly stable, efficient cloud services that meet the needs of global information-processing applications. Additionally, the self-developed multilingual neural network machine-translation system supports multilingual, multi-field translation and can be customized according to a given customer's needs, thereby offering effective solutions for next-generation cross-language deep information-processing services. It now supports 10 languages and 53 algorithm services.

For more information: business@gtcom.com.cn

Applications

Word segmentation and part-of-speech tagging
The term "word segmentation" refers to the process of dividing a string in a written language into its component words according to grammatical norms. The term "part-of-speech tagging" refers to the process of defining a word sequence and finding the most likely part-of-speech sequence.

Named-entity recognition
Named-entity recognition is an important tool for applications such as information extraction, question-answering systems, semantic understanding and machine translation. Thus, it plays a fundamental role in natural-language processing. We employ a statistical machine-learning method and use massive corpus for training, which has achieved good results in a variety of Chinese and English application scenarios.

Sentiment analysis
The text sentiment analysis algorithm can automatically analyze and recognize the opinions or attitudes expressed in the articles and provide sentiment tendency indicators that can express the polarity and intensity of sentiments.The sentiment analysis algorithm is used to analyze the polarity of sentiments, so it plays a crucial role in public-opinion monitoring, topic monitoring and reputation analysis. This sentiment analysis algorithm uses a deep learning model and is trained based on 100,000 manually annotated corpora. The accuracy rate can reach 70% when the five indicators of very positive, positive, neutral, negative and very negative are used to indicate sentiment polarity. Currently, it supports 10 languages: Chinese, English, Japanese, Korean, Russian, Portuguese, Spanish, French, German and Arabic.

Keyword extraction
The keyword-extraction algorithm is used to extract text subjects and help users quickly obtain the desired core content. It integrates a variety of machine-learning methods and a large amount of corpus resources. It currently supports 10 languages: Chinese, English, Japanese, Korean, Russian, Portuguese, Spanish, French, German and Arabic. The keyword-extraction algorithm facilitates the quick generation of keyword extraction tools in other languages by using open data.

Text summarization
The automatic summarization algorithm refers to the process of automatically generating a simple, coherent essay that expresses the core content of the original document. It facilitates efficient compression of the original text and assists users in reading efficiently. We employ a data-driven machine-learning method which adapts to the characteristics of Internet big data and has such advantages as no field restrictions, high computing efficiency, rapid generation and controllable abstract length, thereby satisfying the needs of search engines, intelligent Q&A and other applications.

Language recognition
The language recognition algorithm refers to the process of automatically determining the language of the input texts. Based on N-Gram and Bayes' theorem, we have developed a set of language recognition technologies that support dozens of languages. The recognition of Simplified Chinese, Traditional Chinese, English, Japanese, Korean, Russian, Portuguese, Spanish, French, German and Arabic has been optimized for more accurate recognition.

Text classification
The text-classification algorithm refers to the process of automatically marking the text categories according to a classification system or standard. The text-classification algorithm can be used to classify unstructured information according to a given system. It serves as the basis for the application and management of massive data, but it also accommodates a wide variety of application scenarios. We've established our classification standard that meets both industry standards and users' behavioral habits by means of incorporation and mapping based on the secondary classification system in the GB/T 20093-2013 Classification and Code of News in Chinese, in combination with data and product characteristics. Currently, the text-classification algorithm supports Chinese and English.

Sensitivity determination
The sensitivity-determination algorithm is mainly used to filter sensitive information, including that of a reactionary, pornographic and/or violent nature. Based on the statistical machine-learning model, we've implemented a sensitivity analysis system that combines statistics and rules by using massive corpus resources that are manually annotated and then combined with multilingual sensitive-word lexicons based on linguistic knowledge and word embedding. The sensitivity-determination algorithm currently supports Chinese and English.

Text-quality assessment
The text-quality assessment algorithm is used to filter and clean data collected by Internet users, thereby improving information quality and enhancing user experience. It can quickly identify noisy data containing garbled characters, codes and scripts as well as casually written, syntactically chaotic "useless" data using technologies such as machine learning and intelligent recognition.

Event element extraction
The event element extraction algorithm can structure unstructured natural-language texts and can be used for in-depth analysis and mining of news events. We use an unsupervised learning method to extract the most important time, place, character and event features as well as other information in a text article without massive manually annotated corpora. Thus, it more fully accommodates the needs of open data processing in the age of big data.

Multilingual word embedding
The term "word embedding" is often used in the field of deep learning. Word embedding expresses not simply the word but also its semantic relationship with others. Word embedding is important as a means for the efficient, quantitative expression of natural-language vocabulary. Accordingly, we use a neural network model to build a multilingual word-embedding library based on massive parallel corpora, with Chinese or English as the bridge language and the monolingual corpus and the sentence-aligned corpus as training data. The library can quickly solve a variety of cross-language tasks, including multilingual text classification, multilingual text clustering, multilingual sentiment analysis and cross-language search engines.

AI Solutions

Introduction

Applications

Contact Us