A Small Space for Ideas, Notes, Data Visualization, Collecting Resources, and Remembering What You’ve Read

Wednesday, January 7, 2015

Artificial Intelligence for Machine Learning: Data Mining Notes

Reading Assignment: DM Ch. 1; ISL Ch 1, 2.1

Data Mining

What’s It All About?

Making sense of enormous amounts of data-- minimizing the gap between the generation of data and our understanding of it. Looking for patterns!

Data- a set of recorded facts
Information- the set of patterns, or expectations, that underlie the data
Knowledge- the accumulation of your set of expectations
Wisdom- the value attached to your knowledge.

1.1 Data Mining and Machine Learning...............................................3

Data mining- Searching pre-existing databases for patterns. Data stored, and then electronically searched and operated.
Machine Learning- techniques to find patterns in data

Describing Structural Patterns.........................................................5

Patterns that describe trends in the data- it could be a set of rules that determine an outcome, or it could be structured as a more elaborate decision tree to determine an outcome. More simply, it could be that women tend to be shorter than men, or that it is more likely to be sunny in the summer.  

Machine Learning............................................................................7

My vague definition: Purposeful electronic work that gains and uses knowledge to improve an outcome. 

Data Mining.....................................................................................8

Finding and describing patterns in data. The output may also include an actual description of a structure that can be used to classify unknown examples.

1.2 Simple Examples: The Weather Problem and Others.....................9

The Weather Problem......................................................................9

Numeric-attribute problem- A problem in which the learning scheme must create inequalities rather than simple Boolean truths (yes/no outcomes).
Mixed-attribute problem- Not all attributes are numeric, or Boolean. 
Classification rules- rules that predict the classification of the example (play, or not play?)
Association rules- rules that are derived from correlations between attributes in a data set (if it is sunny, it might also be more likely to be hot)

Contact Lenses: An Idealized Problem.........................................12

A very complicated problem that asks questions about how simple our "set of rules" for making decisions should be. 

Irises: A Classic Numeric Dataset.................................................13

Looks at plants, a data set often used for data mining training. All attributes are numeric, and you are trying to predict the type of Iris flower the measurements come from.

CPU Performance: Introducing Numeric Prediction....................15

Essentially introducing regression-- "continuous prediction" via a sum of attributes with appropriate weights based on how much they matter.

Labor Negotiations: A More Realistic Example...........................15

Looks at the outcome of a proposed contract between a union and a business, public, or private service: acceptable, or unacceptable.

Soybean Classification: A Classic Machine Learning Success.....19

Domain knowledge- Basically, prior information that guides how you structure your "rules"

1.3 Fielded Applications......................................................................21

Projects that were actually done in the real world-- other data sets are "toy", or "test" data sets. 

Web Mining...................................................................................21

What shows up first in a Google query? Figuring out  page "prestige"- how visited is it? How linked is it to other pages? How relevant is the query to the actual content?

Decisions Involving Judgment......................................................22

A very basic set of rules determines the clear-cut cases of loan offers, but the borderline cases are often left to human evaluation. Machine learning refines these rules based on defaults as an outcome to reduce risk to the loan company.

Screening Images...........................................................................23

Remote sensing and image differentiation-- how can you tell where an oil slick is on the surface of the ocean?

Load Forecasting............................................................................24

Increasingly complex data were fed into a regression to predict electrical load. 

Diagnosis........................................................................................25

"Domain expertise" in the field often approves rules from machine learning.

Marketing and Sales......................................................................26

Churn- turnover in the customer base.
Market Basket Analysis- the use of association techniques to find groups of items that tend to occur in transactions, especially at checkout. 
Loyalty cards are basically a way of tracking you as an individual; it makes their business analysis more refined. 

Other Applications.........................................................................27

There are lots of other examples, usually when they involve continuous monitoring, or tasks that are tedious or time-consuming for humans. 

1.4 Machine Learning and Statistics...................................................28

Statistics plus marketing? Some continuum of data analysis techniques from statistics and computer science techniques. Often, machine learning is seen as a method for hypothesis generation, while statistics seems to deal with hypothesis testing.

1.5 Generalization as Search ..............................................................29

This is the intellectual framework.
Imagining the problem of "learning" as a search problem- through some kind of variable space. 
Concept descriptions- the result of learning, expressed through rules or decision trees.
Keep in mind that the number of rule sets are finite- it's the number of different observations in each variable field (column). However... the rule combinations get incredibly large, even when you restrict the number of rules to the number of data in the set. Keep in mind that the accuracy of your predictions are never going to be better than the accuracy of your data you are using.

1.6 Data Mining and Ethics.................................................................33

There are obvious ethical implications for data mining based on data from humans.

Reidentification..............................................................................33

Basically, you can use patterns and rules to discover who individuals are in an "anonymous" data set. However, if you remove all possible identification information from the database, you will probably have eliminated the trends in the data, and now have a useless dataset. 

Using Personal Information...........................................................34

Introduces the issue that "discoveries" based on correlations between personal choices can have real-world outcomes (like the increase of insurance premiums). Personal choice data can be easily exploited to leverage purchasing decisions. 

Wider Issues...................................................................................35

In essence, don't data mine in a vaccuum. Realize that what you are doing has a real-world impact, and that 

1.7 Further Reading.............................................................................36

Friday, October 3, 2014

Censored Variables

Right Censored Data

For some observations, it is only known that the true value exceeds some threshold.
Example: Subjects are still alive at the end of the study- we don't know their time of death, we only know that they will die outside the time period of interest.

Left Censored Data

For some observations, it is only known that the true value is below some threshold. Example: You have a measurement that has a lower limit of sensitivity- For example, a pregnancy test-- you don't know that you aren't pregnant, but you do know that the test didn't say you were pregnant because the levels of the chemicals it was looking for were below the test's sensitivity (either because you aren't pregnant, or are just below the amount that it can pick up-- it's too early in the term).




Monday, September 15, 2014

Statistics Bootcamp: Part 1


Two main aspects of statistics:
     Descriptive
     Inferential

Descriptive Statistics

Statistics Notation: 
In statistics, (parentheses) denote ordered sets, while [brackets] describe rounding to the nearest integer. 

_____________________________________________________

KEY TERMS


Measures of central tendency
-- Looking for ways to describe the majority of values
Robustness-A term to describe how sensitive a statistic is to changes in the data set. 
Finite Sample Breakdown- A theoretical measure of robustness of a statistic-- the smallest proportion of data needed to be distorted to make a statistic arbitrarily large or small
Masking- A situation where the very presence of outliers is hindering them from being detected
Outlier- Although we use this term casually to mean an "extreme value", there are many ways of mathematically determining what data are outliers within a set. In these circumstances, to say a data point is an "outlier" is classifying it statistically by a specific definition. Its opposite? The outside rate per observation (you basically cap the proportion of observations that you would be willing to label as an outlie; usually 20 or 25%)
_____________________________________________________


Measures of Central Tendency

Which measurement should you use? Well, it depends on what you are trying to "get at", and how sensitive you want your measurement to be to extreme data points.
  1. Mean
    • Sum of all #s in data set, divided by N of data set
    • FSB= 1/n (only 1 data point)
  2. Median 
    • "middle" data point in ordered set (or average of 2 middle points)
    • FSB= roughly 50%
  3. Mode
    • Most common value
  4. Trimmed mean
    • Mean, but calculated in a way that excludes extreme values; it cuts out a certain percentage of values (usually around 10-20%) from each side of the distribution, then calculates the mean
    • For a 20% trimmed mean, FSB= roughly 20%
      • What about small data sets? Does this mean trimmed means shouldn't be used?
          • No-- it means it's actually more important to deal with outliers, because one data point can have an even greater influence on the mean when the sample is small. Sure, you are cutting out precious data, but a median is essentially a mean where you are cutting out all but 1-2 data points. 
             

So far, these are only univariate measures. Some multivariate measures might include correlation.

Monday, September 8, 2014

Disability Weights

Disability Weights

Why do we want to do this, anyway?
     Assessment of disease burden and evaluations of comparative cost-effectiveness require outcome measures expressed in a common “currency” or unit
DALYs are used for burden
And Quality-adjusted life eyars (QALYs) have become

% reduction from perfect health- multiply that by number of prevalent cases in any year.
EG if weight for blindness was .2, then 5 people would account for 1 year of DALY.

Visual Analogue Scale- give a scale between 0 and 100; 0 is death, 100 is perfect health.
·      Highly familiar to people from a variety of everday experiences
·      Cognitive burden is relatively low.
However…
·      People tend to avoid the extremes of the scale.
o   For example, the common cold- if people

The Standard Gamble
·      You are told you have a disabling condition, like blindness. There is a magic cure that is available that will restore your sight. But… There is a risk of dying. You have blindness. You’re offered this chance- would you accept a 10% risk of dying? What about 20%? It’s the “point of indifference”.
o   If you’ve been deaf since birth, you have a greater problem in terms of acquisition of language, etc-
·      Advantage:
o   Related to choices under uncertainty (at least that is what (some) economists insist is a ‘must’)
·      Disadvantage:
o   More cognitively demanding- for vision, some grounding in reality, but what about dementia- we don’t have a magical surgical cure for dementia.
o   Does not correspond to typical decision making as choice between life and death are not real scenarios
o   Variation between individuals in propensity to take risk

Time Trade-Off
·      Respondents determine what amount of time they would be willing to give up to be a better versus a poorer state of heatlh.
o   There is an alternative that you could live a shorter number of years with full sight. Would you rather live 10 years with blindness, or 5 years with full sight.
o   Advantage: fits in with concept of measuring health loss in units of time
o   Disadvantages: Enormously influenced by time preference- young people will give up more time than older people,  young parents will almost never give up any time (I must be there for my kids!)

Person Trade-Off
You are a decision maker- you have 1 bag of money, you can spend it in only 1 way. In one way, you can prevent the death in 1000 people. Or, you can prevent the onset of blindness in 2000 people….
·      Advantage- closely related to resource allocation question
·      Disadvantage- probably the most cognitively demanding method- people won’t even answer the questions (think they are playing the role of God)

More severe weights are from visual analogue scale
The Standard Gamble and Person Trade-off; people re reluctant to spend money compared to killing people, and many people tend to be risk averse
Time trade-off: somewhere in-between.
·      People cannot find a systematic pattern- very difficult to find a way to translate the methods

Furthermore..who do you ask?
·      Individuals in health state
o   Adaptation- people can improve their functioning based on doing things in a different way.
o   Coping- Mental resilience- reframing; no change in ability to do anything, but you change your expectations
o   Adjustment- You reconstruct your notion of what health is- you put a higher weight on ability to think rather than run, etc…
·      Health care providers
o   Medical training warps you- tend to rate things higher
o   Selection bias- they tend to see the most serious cases
·      General public
·      Patient Families
o   They suffer the consequences- they tend to give the worst values.
Expert panel used “person trade-off”  to assign values to 22 indicator conditions.

Disability Weights Measurement Study
GBD 2010- derive weights for 220 unique sequelae
Address criticisms of previous approaches by:
·      Focusing on valuations from community respondents
·      In a diverse range of settings
·      Using suitable measurement methods

Specific research aims
·      Population-based household in 5 primary sites
o   Tanzania, Bangladesh, Indonesia, peru- face-to-face
o   Telephone in uS
o   Key objectives included comparative analysis across diverse countries and benchmarking internet survey against community samples
o   Open access internet surveys including all 220 sequelae
§  Available in English, Spanish, and mandarin
§  Paired comparisons
Paired Comparisons- 2 descriptions of hypothetical people, each with a randomly selected condition- respondents indicate which person is healthier.
·      Literacy and numeracy are not essential
·      Health comparisons not tied to external “calibrators” like risk
·       Appealing intuitive basis and established strategies for analysis.

Basis for all comparisons are lay descriptions of sequelae, which highlight major functional consequences and symptoms associated with each sequel
o   must be brief: restricted to less than 35 worlds based on pretest results
o   must use simple, non-clinical vocabulary
·       
What about condition that were worse to have 30 years ago than they are now (probably?)?
·      You shouldn’t capture that in your disability weight, you should capture that in your severity distribution.

Results: comparison of household and web surveys- still, Tanzania, with the lowest educational attainment, compared to the web survey- aren’t that different.

Population Health equivalent questions- finding “grounding points” for the scale.
Imagine two different health programs- first prevented 1,000 people from getting an illness that causes rapid death. Second prevented 2,000 people from getting an illness that is not fatal but causes the following lifelong health problems: “the person is completely deaf”. Which program would you say produced the greater overall population health benefit?

Lowest weight- .004
Schizophrenia- 0.77
AIDs untreated- .55
Vision and hearing some of the most contentious.


But…. Severe intellectual disability (can’t even dress themselves) still aren’t considered a “health” thing- only around .2!!


Comorbidity- can’t just add them up- they can stack up to more than being dead. Multiplicative function, simulated populations, what proportion must have mroethan 1 function at the same time, make sure that it never exceeds 1, reduce amount of burden by age, sex, year, etc-

Friday, September 5, 2014

The Global Burden of Disease: Generating Evidence, Guiding Policy

How does GBD work?

  1. GBD counts the total number of deaths in a year
  2. Researchers work to assign a single cause to each death using a variety of methods.
  3. Estimates of cause-specific mortality are then compared to estimates of death as a double-check (should add up to the right number)
Also assesses the disease burden attributable to different risk factors, also does comparitive risk assessment- both the prevalence of the risk factor as well as the relative harm caused by that risk factor. Premature death and disability due to:
high BP
Tobacco and alcohol use
lack of exercise
air pollution
poor diet


Key Terms
Years of life lost: (YLLs): Years of life lost due to premature mortality
Years lived with disability (YLDs): Years of life lived with any short-term or long-term health loss.
Disability-adjusted life years (DALs): the sum of YLLs and YLDs- years of healthy life lost
Healthy life expectancy, or health-adjusted life expectancy (HALE): The number of years that a person at a given age can expect to live in good health, taking into account mortality and disability.
Sequelae: consequences of diseases and injuries
Health states: groupings of sequelae hat reflect key differences in symptoms and functioning
Disability weights: number on a scale from 0 to 10 that represents the severity of health loss associated with a health sate.
Uncertainty intervals: A range of values that is likely to include the correct estimate of health loss for a given cause. Limited data create substantial uncertainty. 


Thursday, September 4, 2014

Measuring Indicators and Health- Principles to Live By


Approach
  • Comprehensive Comparisons
    • We measure everything, for everybody. Every disease, risk factor, etc, over time, for all places.
    • Some things don’t matter in some countries, but measuring them is still worthwhile- by measuring it, we can safeguard ourselves from a veil of ignorance and make sure we don’t ignore a problem.
    • Least amount of bias- measure everything!
    • Lots of data gap issues- Probably not very many data points for rural areas, less developed countries, etc… Ex: North Korea—confidence is very low. Highlights the gap in information! People react to numbers, even if the numbers are HIGHLY uncertain. Still put out a number- experience with GBD, and people say “You’re wrong! It’s not that bad!”- and then we say “why? What data is there to convince us?”
  • 2. Uncertainty
    • Every estimate should ALWAYS be published with an uncertainty/confidence intervals. Uncertainty is ONE way of conveying the strength of the evidence- it’s comparable and easily interpretable. 
  • 3. Internal Consistency
    • If you are producing an estimate of incidence of a disease, and an estimate of prevalence of disease…they should be consistent.
    • All the hypothetical deaths from each cause SHOULD add up to 100% of the deaths. (WHO does not do this- if you take malaria, TB, HIV, etc and add up the deaths, we die about 50% more than we should).
    • Can be plausible within the uncertainty.
    •  http://www.statsmakemecry.com/smmctheblog/confusing-stats-terms-explained-internal-consistency.html
    • How do you deal with comorbidity?
      • Usually there are underlying causes of death, not JUST stroke, etc…
      • We know it’s a problem, we cant let each of those causes of death claim those deaths- need some approach to deal with comorbidities. 
  • 4. Iterative approach to estimation
    • Don’t get “locked into” your methods- as things improve, go back and CHANGE your estimates for the past, not just the present, based on newer estimations, etc…
    • Computing power has increased- it’s irresponsible to not use the tools we have now to change our retrospective estimations as well.
    • It is our moral obligation o use all better data and methods as soon as we have them.


Some Principles for Data Synthesis

  • 1. We use all possible data
    • What about quality?
    • What does All Possible Data mean?? We will take any and all we can identify- “Quality of data” is highly subjective. Belief in the field why DHS surveys are really high- but, since they ask 2,000 questions, how can you say that the last question is as high quality as the first question? We don’t always include all data in the estimation, because there are some outliers, but you want to start off with the most complete picture.
    • Makes it more difficult to interpret/reproduce (not as bad if you have a good catalogue)
    • Where does qualitative data fit in?
      • Fits into your assessment of data quality. Hard to put qualitative data on a graph, but it can inform your models, what you think is implausible, etc.
    • ·      How do you reconcile two conflicting sources of data?
      •   Multiple ways to solve it- advanced line fitting between different surveys, etc- you need to collect the dirt on the data sources
2. Assess the various forms of error:

  • Sampling
    • Difference in sample population and population you re trying to study. Happens in sample recruitment.
    • Can sometimes correct for this

  • Different definitions
    • How do you identify the disease? - What does it mean to “Have pneumonia”? You might use Xrays, or maybe WHO diagnostic criteria, etc- in the same case, in the same child, you are measuring different things.
    • Difficult because the testing methods might make seeing trends over time very difficult.
    • You can sometimes “Crosswalk” between different measurements if you know the ratios of the different measurement method’s sensitivities and relationships with one another. (there is a standard way of making your confidence interval reflect this)
    • Better to not use data that isn’t comparable than include it incorrectly.
    • 18 different, good study-based definitions of how you measure diabetes

  • Sources of bias

  • Comparability
    • Make sure you are comparing apples to apples in your model

  • 4. Statistical Inference
    • Handle both sampling and non-sampling error
    • Use covariates: Other variables that might be relevant or might influence the relationships and your estimates
      • Estimating diabetes? Include obesity in your model. Include as much information as you have.
    • Model that always reflects uncertainty
      • Reflect sample size
      • If your input sources have high uncertainty, your estimate will have high uncertainty- lower uncertainty data sources will have data uncertainty
      • Amount of data/Gaps in data- how many years you are inferring data for? How many areas?
      • Parameter uncertainty
      • Lots of cross walking? More uncertainty.

Model and Estimate Validation
How do we know that we have a good model?

Out of Sample Predictive Validity
·      Do some sanity checks- pretend you don’t have information, and see if what you know is true. Cross-validate your data.
·      Basically- hold out part of the data, and see how well the model predicts it.
·      Debate: How do you select the hold-out. Out of 100 data points, what do you validate your data? Randomly drop 20%- too easy; you still have too much data to predict it.
·      You can cut off by time- pretend you don’t have it for the early stages, or the later stages.
·      Look at the data you have, look at the missing patterns, and use that to see that your model is valid.

Health Indicators- The Good, The Bad, and the Ugly

cross time and places, etc… two well-defined events

Good Attributes of Health Indicators
·      Simple to measure
·      Good proxy? Or in general, is important
·      Easy to collect
·      Comparable across time and regions
·      Easy to Interpret

Bad Attributes
·      Not specific
·      Not clear
·      Scale properties are not clear
·      Assumptions
·      Lack of comparability
·      Confusing
·      Ideal not known

Use of Indicator
·      Proxy

Definitions of Attributes Health Indicators Might Have

Validity-
·      It actually measures what we want to measure
Reliability-
·      Every time you measure it, you’re measuring the same thing.
Comparability-
·      How well you can relate measurements to other similar measurements- across places AND time

Can you do better than a proxy?

“SMART” Indicators
  Specific – target a specific area for improvement.
  Measurable – quantify or at least suggest an indicator of progress.
  Assignable – specify who will do it.
  Realistic – state what results can realistically be achieved, given available resources.
  Time-related – specify when the result(s) can be achieved.



Health Indicators- Some Examples
Infant Mortality Rate
·      Infants under 1 year that die (usually by # of births)

Gini Coeficient
·      Indicator of inequality- how far away you are from an even distribution of wealth.
·      Ranges between 0 and 1.
·      Has become a standard- Health doesn’t have a comparable indicator—you don’t capture aspects of the reality

Maternal Mortality- How many women die from pregnancy, birth, or after-birth
·      Comparable across time and geography
·      Sometimes can be hard to differentiate
·      Ratio: usually per 100,000 live births.
·      Rate: per year
·      Proxy for women’s health

Incidence of HIV
·      Extremely difficult to measure (similar with TB, etc)
·      If you wanted to measure this….
o   Nobody has the $$ to pay for it.

Prevalence of Child Malnutrition

Number of doses of ART prophylaxis delivered in antenatal care in HIV+ mothers
·      Not clear what is being measured- what is a dose? Medicine? Medicine taken, or just given to the patient? Difficult because it doesn’t specify whether it’s an initial appointment, etc…
·      0 to infinity- scale properties are not clear
E(O) Life expectancy at birth
·      It takes the rates of death for each age class at the current moment, and makes a prediction
·      For an individual born today, if they were exposed to age-specific mortality rates currently observed in their place of birth, how long would they expect to live?
·      Can’t necessarily use it for policy with assumptions- Can obscure many underlying differences

C-section rate
·      Not clear what the “ideal” c-section rate is
·      Doesn’t indicate whether it was a choice or medically necessitated
·      What might be better? Things like…
o   Proportion of women needing emergency c-section
o   Rate of infant death due to emergency c-section
o   Rate of maternal death due to emergency c-section


Contraceptive use in population
·      Can be a proxy for women’s empowerment
·      Doesn’t address need-
·      What is a good rate? 100% Probably not… 0%? Probably not… Do you take into account whether the population is above or below replacement?

Prevalence of obesity
5q0- probability of dying between birth and age 5
·      Specific
·      Clear idea- it should be 0- hopefully, NO children die before their 5th birthday!
·      Comparable a