3. DataFrame

3.1. Group by

  • Group a dataframe by a group of columns with different aggregations columns.

dataframe.groupby(['Name', 'Fruit'])['Number'].agg('sum')
  • You can also use different aggregation functions on different columns.

dataframe.groupby(['Name', 'Fruit'])['Number, att1, att2'].agg({'Name': "count", 'att1': "sum",'att2': 'mean'})
  • You can also use custom aggregation functions

dataframe.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()})

3.1.1. Aggregation Methods

  • sum

  • mean

  • count

  • nunique

3.2. MultiIndex

3.2.1. Remove MultiIndex

To split multi index to columns :

dataframe.reset_index()

Turn that :

att1

att2

departement

day

1

01/01/2020

0.083

0.083

02/01/2020

0.083

0.083

03/01/2020

0.083

0.083

To that :

departement

day

att1

att2

index

1

1

01/01/2020

0.083

0.083

2

1

02/01/2020

0.083

0.083

3

1

03/01/2020

0.083

0.083

3.3. Sort On

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)

df.sort_values(by=['col1'])

3.4. Parallelize apply

from multiprocessing import  Pool
from functools import partial
import numpy as np

def parallelize(data, func, num_of_processes=8):
    data_split = np.array_split(data, num_of_processes)
    pool = Pool(num_of_processes)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

def run_on_subset(func, data_subset):
    return data_subset.apply(func, axis=1)

def parallelize_on_rows(data, func, num_of_processes=8):
    return parallelize(data, partial(run_on_subset, func), num_of_processes)