In essence, the pandas library is an answer to the fact that dictionaries, lists, and arrays are simply too low-level (even NumPy!) for true statistical or data-crunching analysis. To better understand the pandas library...
And below is a 3 hour video tutorial detailing how Pandas works by one of its creators (the primary author), Wes McKinney. His Blog: http://blog.wesmckinney.com
Relevant time marks:
20:15- Wes McKinney begins lecture, and gives overview
28:20 Lecture meat begins
40:20 GroupBy slides
50:00 begins to talk about Series
60:00 begins to talk about DataFrames
1:13:00 starts to talk about stock data case study
1:28:00 begins to talk about baby names data
LECTURE NOTES
Basic Terms
Index
Generally need to be unique.
Hierarchical indexes-- you can select out groups of data without writing a for loop.
Series
1-d NumPy Array. Series represents a dataframe column.
Index: array of labels.
DataFrame
2D table with row and column labels, potentially heterogeneous columns (columns can be of different types). You can refer to the data by column or row labeling.
GroupBy
I want to group on these values- splits data into groups, then you can apply a function to those groups. The result= a smaller object, with a unique set of labels, and the aggregated values.
Ways to group by:
- splittting axis into groups:
- DataFrame columns
- Array of labels
- Functions, applied to axis.
#0 for rows, 1 for columns.
#Keys can be functions, column names, or arrays.
What do you get back from this?
Nothing immediately- it just knows how to split up the data. From there, you decide what you want to do:
- Iterate
- "for key, group in grouped"
- Transform
- grouped.transform(f)
- will alter values, but not their size
- Aggregate
- grouped.agg(f)
- agg: produce a single aggregated value per column, per group
- Apply
- grouped.apply(f)
- completely generic, but slower
His startup block:
from pandas import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rc('figure', figsize=910,6))
Playing with Series
labels = ['a', 'b', 'c', 'd', 'e']
s= Series(randn(5), index=labels)
'b' in s
s.index
array of labels in series
s['b']
looks up value at location 'b' in index
mapping= s.to_dict() -> converted the series to a dictionary
s=Series(mapping, index= ['b', 'e', 'a', 'd'])
creates a new series, this time with only the values that are contained within the "index=.." part of the above line- almost like slicing. If you put something in the "index=..." list that isn't actually in the original index, you return NaN for that entry.
s[notnull(s)] selects out data that is not null.
s.dropna() drops NA values
Can also slice using traditional python slicing-
s[:3] returns first three records
s[-2:] return last two rows
DataFrame Manipulation
A 2d Collection of Series.
df['d']=5 will set a column 'd' in that dataframe to "5" values.
Selecting subsets of the rows:
df.xs(0) gives me the 0th row- column labels become the index of what's returned.
df.ix allows you to index te DataFrame like a NumPy array. (label indexing facility)
df.ix[0, 'b'] gives me the 0th index, at column b.
df.get_value(2, 'b') returns the index 2, at column b.
df.ix[2:4, 'b':'c'] now returns a slice- columns 2 thru 4, columns b thru c.
df.ix[df['c']>0, ['b':'d']] selects out data where the "c" column is greater than 0
In these cases, the indexes have been integers. However, there can be a index=DateRange('1/1/2000', periods=6)
that creates dates as the rows.
After importing a DataFrame...
s1.add(s2, fill_value=0) combining dataframes, if there isn't a value, it fills in zero (double check this, around 1:15:)
df.mean() computes means of columns
df.apply(np.mean, axis=1)
....Note to self: around 1:10:00 I probably start listening-skimming and should re-watch it.
....CHECK FOR UPDATES IN PANDAS!!!
No comments:
Post a Comment