A Small Space for Ideas, Notes, Data Visualization, Collecting Resources, and Remembering What You’ve Read

Wednesday, January 7, 2015

Artificial Intelligence for Machine Learning: Data Mining Notes

Reading Assignment: DM Ch. 1; ISL Ch 1, 2.1

Data Mining

What’s It All About?

Making sense of enormous amounts of data-- minimizing the gap between the generation of data and our understanding of it. Looking for patterns!

Data- a set of recorded facts
Information- the set of patterns, or expectations, that underlie the data
Knowledge- the accumulation of your set of expectations
Wisdom- the value attached to your knowledge.

1.1 Data Mining and Machine Learning...............................................3

Data mining- Searching pre-existing databases for patterns. Data stored, and then electronically searched and operated.
Machine Learning- techniques to find patterns in data

Describing Structural Patterns.........................................................5

Patterns that describe trends in the data- it could be a set of rules that determine an outcome, or it could be structured as a more elaborate decision tree to determine an outcome. More simply, it could be that women tend to be shorter than men, or that it is more likely to be sunny in the summer.  

Machine Learning............................................................................7

My vague definition: Purposeful electronic work that gains and uses knowledge to improve an outcome. 

Data Mining.....................................................................................8

Finding and describing patterns in data. The output may also include an actual description of a structure that can be used to classify unknown examples.

1.2 Simple Examples: The Weather Problem and Others.....................9

The Weather Problem......................................................................9

Numeric-attribute problem- A problem in which the learning scheme must create inequalities rather than simple Boolean truths (yes/no outcomes).
Mixed-attribute problem- Not all attributes are numeric, or Boolean. 
Classification rules- rules that predict the classification of the example (play, or not play?)
Association rules- rules that are derived from correlations between attributes in a data set (if it is sunny, it might also be more likely to be hot)

Contact Lenses: An Idealized Problem.........................................12

A very complicated problem that asks questions about how simple our "set of rules" for making decisions should be. 

Irises: A Classic Numeric Dataset.................................................13

Looks at plants, a data set often used for data mining training. All attributes are numeric, and you are trying to predict the type of Iris flower the measurements come from.

CPU Performance: Introducing Numeric Prediction....................15

Essentially introducing regression-- "continuous prediction" via a sum of attributes with appropriate weights based on how much they matter.

Labor Negotiations: A More Realistic Example...........................15

Looks at the outcome of a proposed contract between a union and a business, public, or private service: acceptable, or unacceptable.

Soybean Classification: A Classic Machine Learning Success.....19

Domain knowledge- Basically, prior information that guides how you structure your "rules"

1.3 Fielded Applications......................................................................21

Projects that were actually done in the real world-- other data sets are "toy", or "test" data sets. 

Web Mining...................................................................................21

What shows up first in a Google query? Figuring out  page "prestige"- how visited is it? How linked is it to other pages? How relevant is the query to the actual content?

Decisions Involving Judgment......................................................22

A very basic set of rules determines the clear-cut cases of loan offers, but the borderline cases are often left to human evaluation. Machine learning refines these rules based on defaults as an outcome to reduce risk to the loan company.

Screening Images...........................................................................23

Remote sensing and image differentiation-- how can you tell where an oil slick is on the surface of the ocean?

Load Forecasting............................................................................24

Increasingly complex data were fed into a regression to predict electrical load. 

Diagnosis........................................................................................25

"Domain expertise" in the field often approves rules from machine learning.

Marketing and Sales......................................................................26

Churn- turnover in the customer base.
Market Basket Analysis- the use of association techniques to find groups of items that tend to occur in transactions, especially at checkout. 
Loyalty cards are basically a way of tracking you as an individual; it makes their business analysis more refined. 

Other Applications.........................................................................27

There are lots of other examples, usually when they involve continuous monitoring, or tasks that are tedious or time-consuming for humans. 

1.4 Machine Learning and Statistics...................................................28

Statistics plus marketing? Some continuum of data analysis techniques from statistics and computer science techniques. Often, machine learning is seen as a method for hypothesis generation, while statistics seems to deal with hypothesis testing.

1.5 Generalization as Search ..............................................................29

This is the intellectual framework.
Imagining the problem of "learning" as a search problem- through some kind of variable space. 
Concept descriptions- the result of learning, expressed through rules or decision trees.
Keep in mind that the number of rule sets are finite- it's the number of different observations in each variable field (column). However... the rule combinations get incredibly large, even when you restrict the number of rules to the number of data in the set. Keep in mind that the accuracy of your predictions are never going to be better than the accuracy of your data you are using.

1.6 Data Mining and Ethics.................................................................33

There are obvious ethical implications for data mining based on data from humans.

Reidentification..............................................................................33

Basically, you can use patterns and rules to discover who individuals are in an "anonymous" data set. However, if you remove all possible identification information from the database, you will probably have eliminated the trends in the data, and now have a useless dataset. 

Using Personal Information...........................................................34

Introduces the issue that "discoveries" based on correlations between personal choices can have real-world outcomes (like the increase of insurance premiums). Personal choice data can be easily exploited to leverage purchasing decisions. 

Wider Issues...................................................................................35

In essence, don't data mine in a vaccuum. Realize that what you are doing has a real-world impact, and that 

1.7 Further Reading.............................................................................36