中文
校园招聘信息
请以中国出版集团的校招通知为准
Home> Solutions> AI Solutions> Natural Language Processing

Natural Language Processing

All Multilingual Machine Translation Natural Language Processing Knowledge Graph Speech Recognition Video Recognition

Introduction

GTCOM has long been engaged in the R&D and application of natural language processing (NLP) technologies. Adopting advanced machine learning technologies such as recurrent neural network, convolutional neural network, conditional random field, support vector machine, and random forest, GTCOM makes good use of its 100-billion language database to build highly accurate and efficient NLP algorithms. 

There are a total of 65 algorithms in about a dozen categories, covering functions such as word segmentation, part-of-speech tagging, named-entity recognition, sensitivity determination, sentiment analysis, automatic summarization, keyword extraction, text classification, text quality assessment, hot topic clustering, event element extraction, and knowledge graph building. They support more than 30 languages and can process texts in vertical fields such as finance, science and technology, and engineering.


For more information: Globalbiz@gtcom.com.cn

Applications

  • Word segmentation and part-of-speech tagging
  • The term "word segmentation" refers to the process of dividing a string in a written language into its component words according to grammatical norms. The term "part-of-speech tagging" refers to the process of defining a word sequence and finding the most likely part-of-speech sequence.
  • Named-entity recognition
  • Named-entity recognition is an important tool for applications such as information extraction, question-answering systems, semantic understanding and machine translation. Thus, it plays a fundamental role in natural-language processing. We employ a statistical machine-learning method and use massive corpus for training, which has achieved good results in a variety of Chinese and English application scenarios.
  • Sentiment analysis
  • The text sentiment analysis algorithm can automatically analyze and recognize the opinions or attitudes expressed in the articles and provide sentiment tendency indicators that can express the polarity and intensity of sentiments.The sentiment analysis algorithm is used to analyze the polarity of sentiments, so it plays a crucial role in public-opinion monitoring, topic monitoring and reputation analysis.
  • Keyword extraction
  • The keyword-extraction algorithm is used to extract text subjects and help users quickly obtain the desired core content. It integrates a variety of machine-learning methods and a large amount of corpus resources. It currently supports 10 languages: Chinese, English, Japanese, Korean, Russian, Portuguese, Spanish, French, German and Arabic. The keyword-extraction algorithm facilitates the quick generation of keyword extraction tools in other languages by using open data.
  • Text summarization
  • The automatic summarization algorithm refers to the process of automatically generating a simple, coherent essay that expresses the core content of the original document. It facilitates efficient compression of the original text and assists users in reading efficiently. We employ a data-driven machine-learning method which adapts to the characteristics of Internet big data and has such advantages as no field restrictions, high computing efficiency, rapid generation and controllable abstract length, thereby satisfying the needs of search engines, intelligent Q&A and other applications.
  • Language recognition
  • The language recognition algorithm refers to the process of automatically determining the language of the input texts. Based on N-Gram and Bayes' theorem, we have developed a set of language recognition technologies that support dozens of languages. The recognition of Simplified Chinese, Traditional Chinese, English, Japanese, Korean, Russian, Portuguese, Spanish, French, German and Arabic has been optimized for more accurate recognition.
  • Text classification
  • The text-classification algorithm refers to the process of automatically marking the text categories according to a classification system or standard. The text-classification algorithm can be used to classify unstructured information according to a given system. It serves as the basis for the application and management of massive data, but it also accommodates a wide variety of application scenarios. We've established our classification standard that meets both industry standards and users' behavioral habits by means of incorporation and mapping based on the secondary classification system in the GB/T 20093-2013 Classification and Code of News in Chinese, in combination with data and product characteristics. Currently, the text-classification algorithm supports Chinese and English.
  • Sensitivity determination
  • The sensitivity-determination algorithm is mainly used to filter sensitive information, including that of a reactionary, pornographic and/or violent nature. Based on the statistical machine-learning model, we've implemented a sensitivity analysis system that combines statistics and rules by using massive corpus resources that are manually annotated and then combined with multilingual sensitive-word lexicons based on linguistic knowledge and word embedding. The sensitivity-determination algorithm currently supports Chinese and English.
  • Text-quality assessment
  • The text-quality assessment algorithm is used to filter and clean data collected by Internet users, thereby improving information quality and enhancing user experience. It can quickly identify noisy data containing garbled characters, codes and scripts as well as casually written, syntactically chaotic "useless" data using technologies such as machine learning and intelligent recognition.
  • Event element extraction
  • The event element extraction algorithm can structure unstructured natural-language texts and can be used for in-depth analysis and mining of news events. We use an unsupervised learning method to extract the most important time, place, character and event features as well as other information in a text article without massive manually annotated corpora. Thus, it more fully accommodates the needs of open data processing in the age of big data.
  • Multilingual word embedding
  • The term "word embedding" is often used in the field of deep learning. Word embedding expresses not simply the word but also its semantic relationship with others. Word embedding is important as a means for the efficient, quantitative expression of natural-language vocabulary. Accordingly, we use a neural network model to build a multilingual word-embedding library based on massive parallel corpora, with Chinese or English as the bridge language and the monolingual corpus and the sentence-aligned corpus as training data. The library can quickly solve a variety of cross-language tasks, including multilingual text classification, multilingual text clustering, multilingual sentiment analysis and cross-language search engines.

Contact Us