3. DataFrame¶
3.1. Group by¶
Group a dataframe by a group of columns with different aggregations columns.
dataframe.groupby(['Name', 'Fruit'])['Number'].agg('sum')
You can also use different aggregation functions on different columns.
dataframe.groupby(['Name', 'Fruit'])['Number, att1, att2'].agg({'Name': "count", 'att1': "sum",'att2': 'mean'})
You can also use custom aggregation functions
dataframe.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()})
3.1.1. Aggregation Methods¶
sum
mean
count
nunique
3.2. MultiIndex¶
3.2.1. Remove MultiIndex¶
To split multi index to columns :
dataframe.reset_index()
Turn that :
att1 |
att2 |
||
---|---|---|---|
departement |
day |
||
1 |
01/01/2020 |
0.083 |
0.083 |
02/01/2020 |
0.083 |
0.083 |
|
03/01/2020 |
0.083 |
0.083 |
To that :
departement |
day |
att1 |
att2 |
|
---|---|---|---|---|
index |
||||
1 |
1 |
01/01/2020 |
0.083 |
0.083 |
2 |
1 |
02/01/2020 |
0.083 |
0.083 |
3 |
1 |
03/01/2020 |
0.083 |
0.083 |
3.3. Sort On¶
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)
df.sort_values(by=['col1'])
3.4. Parallelize apply¶
from multiprocessing import Pool
from functools import partial
import numpy as np
def parallelize(data, func, num_of_processes=8):
data_split = np.array_split(data, num_of_processes)
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
def run_on_subset(func, data_subset):
return data_subset.apply(func, axis=1)
def parallelize_on_rows(data, func, num_of_processes=8):
return parallelize(data, partial(run_on_subset, func), num_of_processes)