Product Description
This conceptual introduction to data mining within the context of business and marketing research provides an eclectic approach to the field. Using worked examples and business case studies, the volume answers the four questions: why is data mining important to business and marketing research; how is data mining different from other types of research; what do we learn from data mining; and how do we do data mining? The book explains data mining, traditional methods… More >>
Data and Text Mining: A Business Applications Approach
Tags: Applications, Approach, Business, business applications, business case studies, conceptual introduction, Data, eclectic approach, marketing research, Mining, Text, text mining
One of the main reasons I bought the book was the promise of case data and sample code especially in R (and splus). However, the prentice Hall site had only presentation slides (pdf files) and no data or code. Moreover their own tech support had no cluse as to why these files were missing.
This is a classic example of overpromise and underdeliver. I would avoid this book in the future.
Rating: 1 / 5
I appreciate a book that lands in my “zone of proximal development”–that efficiently reviews familiar material to establish context, then extends my knowledge into new areas. Thomas Miller’s five-chapter introduction to data mining is appropriate continuing education for researchers who want to learn data-adaptive methods and text analysis. It was “in the zone” for me when it first came out. I keep it around as a model of how to explain data and text mining to others.
In Chapter 1 (“What is Data Mining?”), Miller compares data mining and traditional research, discussing not only specific statistical procedures, but different approaches to selecting a statistical model and the requirements for testing an emergent model with new data. Miller warns researchers about claims made by software vendors–this type of analysis is not automatic. We should not underestimate resources needed to identify appropriate data sources, restructure data, and clean up missing and miscoded data.
Chapter 2 (“Traditional Methods”) reviews multiple regression, logistic regression, principle components analysis and cluster analysis. Miller illustrates principles of data preparation and reduction. He emphasizes the need to partition data into training sets to develop models, validation sets to compare models, and test sets to evaluate a selected model. He stresses the role of parsimony and goodness of fit in selecting the best model.
Chapter 3 overviews “Data Adaptive Methods” appropriate for large data sets with many variables. These techniques produce models which emerge from the data. Challenges include choosing between many possible models and prioritizing relationships between variables when large numbers of observations make most relationships statistically significant. Miller reviews data visualization techniques, decision tree procedures, smoothing methods that make patterns more interpretable, and association-driven neural networks which “learn” to find patterns.
The fourth chapter (“Text Mining”) presents procedures and resources needed to prepare text for quantitative analysis. It introduces issues ranging from quick-and-dirty text data “munging” (reformatting) using Perl scripts to the core concepts of natural language processing. Miller shows how to transform text data into a “term by document” matrix that can be analyzed with statistical procedures. He explains how (and why) to capture information about sentence syntax, root words, word frequencies, phrases, and other text features.
This chapter explores the potential of creating “text measures” by “scoring documents based on predefined measurement categories” (p. 120). The most basic text measure uses the frequency of specific words to produce a score on a predefined dimension, such as Realism or Optimism. Content “dictionaries” or lists of carefully selected words can be constructed to measure text on many such dimensions. Miller sees promise for text measures in marketing research. There is also potential for those interested in resumes (see Kathryn Troutman’s Federal Resume Guidebook) and other employment documents.
The fifth chapter (“And in Conclusion…”) describes project management strategies that help data mining projects succeed. Two appendices contain information about the example data sets and caution researchers about data mining privacy concerns. The author provides most statistics and text manipulation algorithms in public domain tools such R and Perl, although he refers to the commercial tools S-PLUS and Insightful Miner as well.
Additional topics could have been included in a longer version of the book. Some statistical procedures could have been treated in greater depth and research design issues from content analysis (see Krippendorff’s Content Analysis: An Introduction to Its Methodology) could have been included in the discussion of text mining. But overall the author has made reasonable compromises for an introductory text. It’s a good and useful read.
Rating: 4 / 5