Introduction

Text mining is the practice of automated analysis of one document or a collection of documents (corpus), and the extraction of non-trivial information from the document(s). Text mining involves the process of transforming unstructured textual data into structured representation by analyzing the patterns derived from text. The results can be analyzed to discover interesting knowledge, some of which would only be found by carefully reading and analyzing the text. Widely-used tasks of text mining include Automatic Text Classification/Categorization, Topic Extraction, Concept Extraction, Documents/Terms Clustering, Sentiment Analysis, and Frequency-based Analysis. Some of these tasks cannot be performed by users, which makes text mining a particularly useful and applicable tool in modern data science.

Analytic Solver Data Science takes an integrated approach to text mining as it does not totally separate analysis of unstructured data from traditional data mining techniques applicable for structured information. While Analytic Solver Data Science is a very powerful tool for analyzing text only, it also offers automated treatment of mixed data (i.e., combination of multiple unstructured and structured fields). This is a particularly useful feature that has many real-world applications, such as analyzing maintenance reports, evaluation forms, and insurance claims. Analytic Solver Data Scienve uses the bag-of-words model -- the simplified representation of text, where the precise grammatical structure of text and exact word order is disregarded. Syntactic, frequency-based information is preserved and is used for text representation, and is proven to work well for applications such as Text Categorization and Concept Extraction, which are the particular areas addressed by Analytic Solver Data Science's text mining capabilities. It has been shown in many theoretical/empirical studies that syntactic similarity often implies semantic similarity. One way to access syntactic relationships is to represent text in terms of Generalized Vector Space Model (GVSP). Advantages of such representation is a meaningful mapping of text to the numeric space. The disadvantages are that some semantic elements (i.e., order of words) are lost.

Input to Text Miner could be of two main types: few relatively large documents (i.e., several books), or a relatively large number of smaller documents (i.e., collection of emails, news articles, product reviews, comments, tweets, and Facebook posts). While Analytic Solver Data Science is capable of analyzing large text documents, it is particularly effective for large corpuses of relatively small documents. Obviously, this functionality has a limitless number of applications (i.e., email spam detection, topic extraction in articles, automatic rerouting of correspondence, and sentiment analysis of product reviews).

The input for the text mining is a data set on a worksheet, with at least one column that contains free-form text (or file paths to documents in a file system containing free-form text), and other columns that contain traditional structured data. In the Data Source tab of the Text Mining dialog, the user selects the text variable(s), and the other variable(s) to be processed.

The output for the text mining is a set of reports that contain general explorative information about the collection of documents and structured representations of text. Free-form text columns are expanded to a set of new columns with numeric representation. The new columns will correspond to: 1) a single term (word) found in the corpus of documents; or 2) a concept extracted from the corpus through Latent Semantic Indexing (LSI), also called Latent Semantic Analysis (LSA). Each concept represents an automatically derived complex combination of terms/words that have been identified to be related to a particular topic in the corpus of documents.

The structural representation of text can serve as an input to any traditional data science techniques available in Analytic Solver Data Science: unsupervised/supervised, affinity, and visualization techniques. In addition, Analytic Solver Data Science presents a visual representation of text mining results to interactively explore the information that would be extremely hard to analyze manually. Typical visualizations that aid in understanding of text mining outputs and are produced by Analytic Solver Data Science include:

  • Zipf plot - for visual/interactive exploration of frequency-based information extracted by Text Miner
  • Scree Plot, Term-Concept, and Document-Concept 2D scatter plots - for visual/interactive exploration of Concept Extraction results

To visualize specific parts of text mining analysis outputs, Analytic Solver Data Science provides the capability of charting the functionality that can be used to explore text mining results and supplement standard charts.

The following example illustrates how to use Text Miner in Analytic Solver Data Science to process/analyze approximately 1,000 text files and use the results for automatic topic categorization. This will be achieved by using structured representation of text presented to Logistic Regression for building the model for classification.