A Small Space for Ideas, Notes, Data Visualization, Collecting Resources, and Remembering What You’ve Read

Wednesday, April 9, 2014

Pandas Data Analysis 3 Hour Tutorial Video

In essence, the pandas library is an answer to the fact that dictionaries, lists, and arrays are simply too low-level (even NumPy!) for true statistical or data-crunching analysis. To better understand the pandas library...


And below is a 3 hour video tutorial detailing how Pandas works by one of its creators (the primary author), Wes McKinney. His Blog:  http://blog.wesmckinney.com



Relevant time marks:

20:15- Wes McKinney begins lecture, and gives overview
28:20 Lecture meat begins
40:20 GroupBy slides
50:00 begins to talk about Series
60:00 begins to talk about DataFrames
1:13:00 starts to talk about stock data case study
1:28:00 begins to talk about baby names data

LECTURE NOTES

Basic Terms

Index
Generally need to be unique.
Hierarchical indexes-- you can select out groups of data without writing a for loop.

Series
1-d NumPy Array. Series represents a dataframe column.
Index: array of labels.

DataFrame
 2D table with row and column labels, potentially heterogeneous columns (columns can be of different types). You can refer to the data by column or row labeling.

GroupBy
I want to group on these values- splits data into groups, then you can apply a function to those groups. The result= a smaller object, with a unique set of labels, and the aggregated values.
Ways to group by:

  • splittting axis into groups:
  • DataFrame columns
  • Array of labels
  • Functions, applied to axis.
dr.groupby([key1,key2], axis=0]) 
#0 for rows, 1 for columns. 
#Keys can be functions, column names, or arrays. 

What do you get back from this?
Nothing immediately- it just knows how to split up the data. From there, you decide what you want to do:

  • Iterate
    • "for key, group in grouped"
  • Transform
    • grouped.transform(f)
      • will alter values, but not their size
  • Aggregate
    • grouped.agg(f)
      • agg: produce a single aggregated value per column, per group
  • Apply
    • grouped.apply(f)
      • completely generic, but slower



His startup block:

from pandas import * 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.rc('figure', figsize=910,6))

Playing with Series

labels = ['a', 'b', 'c', 'd', 'e']
s= Series(randn(5), index=labels)
'b' in s

s.index 
array of labels in series

s['b']
looks up value at location 'b' in index

mapping= s.to_dict() -> converted the series to a dictionary

s=Series(mapping, index= ['b', 'e', 'a', 'd'])
creates a new series, this time with only the values that are contained within the "index=.." part of the above line- almost like slicing.  If you put something in the "index=..." list that isn't actually in the original index, you return NaN for that entry.

s[notnull(s)] selects out data that is not null.
s.dropna() drops NA values

Can also slice using traditional python slicing-
s[:3] returns first three records
s[-2:] return last two rows

DataFrame Manipulation
A 2d Collection of Series.

df['d']=5 will set a column 'd' in that dataframe to "5" values.

Selecting subsets of the rows:
df.xs(0) gives me the 0th row- column labels become the index of what's returned.
df.ix allows you to index te DataFrame like a NumPy array. (label indexing facility)
df.ix[0, 'b'] gives me the 0th index, at column b.
df.get_value(2, 'b') returns the index 2, at column b.

df.ix[2:4, 'b':'c'] now returns a slice- columns 2 thru 4, columns b thru c.

df.ix[df['c']>0, ['b':'d']] selects out data where the "c" column is greater than 0

In these cases, the indexes have been integers. However, there can be a index=DateRange('1/1/2000', periods=6)
that creates dates as the rows.

After importing a DataFrame...
s1.add(s2, fill_value=0) combining dataframes, if there isn't a value, it fills in zero (double check this, around 1:15:)


df.mean() computes means of columns
df.apply(np.mean, axis=1)

....Note to self: around 1:10:00 I probably start listening-skimming and should re-watch it.

....CHECK FOR UPDATES IN PANDAS!!!





No comments:

Post a Comment