A Small Space for Ideas, Notes, Data Visualization, Collecting Resources, and Remembering What You’ve Read

Tuesday, April 15, 2014

Google BigQuery

Interesting....

https://developers.google.com/bigquery/

  • " Analyze terabytes of data in seconds. Run super-fast, SQL-like queries against terabytes of data in seconds, using the processing power of Google's infrastructure.
    Load data with ease. Bulk load your data using Google Cloud Storage or stream it in.
    Easy access. Access BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP or Python."

Thursday, April 10, 2014

Interactive Charting Options for R and Python: Rcharts and Python Equivalent

Sadly, you can't see the interactivity- this is just a screenshot. Click the links below for the actually cool stuff.

Visualizing fertility trends in Europe - An example of Rcharts in action! Check out the clicking/unclicking of data viewing/options/binning- so cool. 

(also check out RNotebook for the interactive scripting piece). 


It exists for Python, too, kind of (with a few less options)!
Again, no dice on the interactivity; this is just a screencap from the links below.

Documentation-- http://python-nvd3.readthedocs.org/en/latest/

A cool application of this would be to test out the average of Nvac, Nplac, cases, etc over time in the final data set for the Pneumonia project.

Numba: Fast C-Compiling for Python

Apparently, Numba's @jit really helps out looped python scripts.

"Numba is an just-in-time specializing compiler which compiles annotated Python and NumPy code to LLVM (through decorators). Its goal is to seamlessly integrate with the Python scientific software stack and produce optimized native code, as well as integrate with native foreign languages."

MiniSom: Stripped down Self-Organizing Maps


"MiniSom is a minimalistic and Numpy based implementation of the Self Organizing Maps (SOM). SOM is a type of Artificial Neural Networks able to convert complex, nonlinear statistical relationships between high-dimensional data items into simple geometric relationships on a low-dimensional display."


Wednesday, April 9, 2014

Pandas Data Analysis 3 Hour Tutorial Video

In essence, the pandas library is an answer to the fact that dictionaries, lists, and arrays are simply too low-level (even NumPy!) for true statistical or data-crunching analysis. To better understand the pandas library...


And below is a 3 hour video tutorial detailing how Pandas works by one of its creators (the primary author), Wes McKinney. His Blog:  http://blog.wesmckinney.com



Relevant time marks:

20:15- Wes McKinney begins lecture, and gives overview
28:20 Lecture meat begins
40:20 GroupBy slides
50:00 begins to talk about Series
60:00 begins to talk about DataFrames
1:13:00 starts to talk about stock data case study
1:28:00 begins to talk about baby names data

LECTURE NOTES

Basic Terms

Index
Generally need to be unique.
Hierarchical indexes-- you can select out groups of data without writing a for loop.

Series
1-d NumPy Array. Series represents a dataframe column.
Index: array of labels.

DataFrame
 2D table with row and column labels, potentially heterogeneous columns (columns can be of different types). You can refer to the data by column or row labeling.

GroupBy
I want to group on these values- splits data into groups, then you can apply a function to those groups. The result= a smaller object, with a unique set of labels, and the aggregated values.
Ways to group by:

  • splittting axis into groups:
  • DataFrame columns
  • Array of labels
  • Functions, applied to axis.
dr.groupby([key1,key2], axis=0]) 
#0 for rows, 1 for columns. 
#Keys can be functions, column names, or arrays. 

What do you get back from this?
Nothing immediately- it just knows how to split up the data. From there, you decide what you want to do:

  • Iterate
    • "for key, group in grouped"
  • Transform
    • grouped.transform(f)
      • will alter values, but not their size
  • Aggregate
    • grouped.agg(f)
      • agg: produce a single aggregated value per column, per group
  • Apply
    • grouped.apply(f)
      • completely generic, but slower



His startup block:

from pandas import * 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.rc('figure', figsize=910,6))

Playing with Series

labels = ['a', 'b', 'c', 'd', 'e']
s= Series(randn(5), index=labels)
'b' in s

s.index 
array of labels in series

s['b']
looks up value at location 'b' in index

mapping= s.to_dict() -> converted the series to a dictionary

s=Series(mapping, index= ['b', 'e', 'a', 'd'])
creates a new series, this time with only the values that are contained within the "index=.." part of the above line- almost like slicing.  If you put something in the "index=..." list that isn't actually in the original index, you return NaN for that entry.

s[notnull(s)] selects out data that is not null.
s.dropna() drops NA values

Can also slice using traditional python slicing-
s[:3] returns first three records
s[-2:] return last two rows

DataFrame Manipulation
A 2d Collection of Series.

df['d']=5 will set a column 'd' in that dataframe to "5" values.

Selecting subsets of the rows:
df.xs(0) gives me the 0th row- column labels become the index of what's returned.
df.ix allows you to index te DataFrame like a NumPy array. (label indexing facility)
df.ix[0, 'b'] gives me the 0th index, at column b.
df.get_value(2, 'b') returns the index 2, at column b.

df.ix[2:4, 'b':'c'] now returns a slice- columns 2 thru 4, columns b thru c.

df.ix[df['c']>0, ['b':'d']] selects out data where the "c" column is greater than 0

In these cases, the indexes have been integers. However, there can be a index=DateRange('1/1/2000', periods=6)
that creates dates as the rows.

After importing a DataFrame...
s1.add(s2, fill_value=0) combining dataframes, if there isn't a value, it fills in zero (double check this, around 1:15:)


df.mean() computes means of columns
df.apply(np.mean, axis=1)

....Note to self: around 1:10:00 I probably start listening-skimming and should re-watch it.

....CHECK FOR UPDATES IN PANDAS!!!





Tuesday, April 8, 2014

Learning SQL resources and PostGIS


A blog post geared towards a pay-for service called CartoDB, but explains roughly joining/relating geometry in Databases using SQL and the ST_Intersects function.


The Boundless Geo Tutorial Module Index

http://workshops.boundlessgeo.com/postgis-intro/index.html
An introduction to PostGIS, using the OpenGeo suite, which isn’t important, but the text might still be useful and it has a ton of recipes that could be adapted.

A rough intro to the way SQL “thinks” compared to other programming languages-The complete understanding claim is a bit of an exaggeration.

A blog post about SQL joins and getting Right, Left, Inner, and Outer joins straight.

Monday, April 7, 2014

WikiBrains: Interactive Network/Relationship Maps

Circos: Taking Data Vis Inspiration from the world of Genomics


http://circos.ca/software/roadmap/

Could be especially useful for visualizing migration, or exceptionally large data.

As a whole it came to my attention from something a little more stripped down- this “map” of human migration in the United States. http://www.wired.com/2013/11/mapping-migration-without-a-map/

It seems to be suited towards analysis that pertains to flow, and relationships that describe things in an nXn matrix with values for each intersection. 

This link talks a little about the kind of data that might make sense, and it has an example at the bottom of it being used in conjunction with some mapping. The sub-sections of the circular charts that are a little more detailed and less of a fine web might be a better example of how it could be used.

http://circos.ca/tutorials/images/small/ And here’s a section that’s a tutorial that deals with creating different kinds of images.

I haven’t looked into it deeply, but it looks like it’s based on Perl, which I’m not very familiar with, but it appears that someone has come up with an R package that mimics it, which is an interesting thought-
Some documentation below:


This is a similarly beautiful, and apparently useful visualization tool for networks by the same designer:
 [ Hive Plots - Rational Network Visualization - A Simple, Informative and Pretty Linear Layout for Network Analytics - Martin Krzywinski ]
http://mkweb.bcgsc.ca/linnet/

Poster!


Template Post

This is a template for a reading summary/commentary.

Paper Name:

Paper Authors:

Link to Finding Article

Commentary:


Summary of Main Points:

  • points
  • points 
  • points

Summary of Methods:

  1. method
    1. sub-commentary
  2. method
  3. method