Newsportal USENET - Re: help: pandas and 2d table

jak <nospam@please.ty> wrote or quoted:

Stefan Ram ha scritto:
df = df.where( df == 'zz' ).stack().reset_index()
result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}
Since I don't know Pandas, I will need a month at least to understand
these 2 lines of code. Thanks again.

Here's a technique to better understand such code:

Transform it into a program with small statements and small
expressions with no more than one call per statement if possible.
(After each litte change check that the output stays the same.)

import pandas as pd

# Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )
# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )

selection = df.where( df == 'zz' )
selection_stack = selection.stack()
df = selection_stack.reset_index()
df0 = df.iloc[ :, 0 ]
df1 = df.iloc[ :, 1 ]
z = zip( df0, df1 )
l = list( z )
result ={ 'zz': l }
print( result )

I suggest to next insert print statements to print each intermediate
value:

# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
print( 'df = \n', type( df ), ':\n"', df, '"\n' )

selection = df.where( df == 'zz' )
print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
selection, '"\n' )

selection_stack = selection.stack()
print( 'result of stack() = \n', type( selection_stack ), ':\n"',
selection_stack, '"\n' )

df = selection_stack.reset_index()
print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )

df0 = df.iloc[ :, 0 ]
print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )

df1 = df.iloc[ :, 1 ]
print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )

z = zip( df0, df1 )
print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )

l = list( z )
print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )

result ={ 'zz': l }
print( "value of { 'zz': l }= \n", type( result ), ':\n"',
result, '"\n' )

print( result )

Now you can see what each single step does!

df =
<class 'pandas.core.frame.DataFrame'> :
" foo1 foo2 foo3 foo4 foo5 foo6
obj
foo1   aa   ab   zz   ad   ae   af
foo2   ba   bb   bc   bd   zz   bf
foo3   ca   zz   cc   cd   ce   zz
foo4   da   db   dc   dd   de   df
foo5   ea   eb   ec   zz   ee   ef
foo6   fa   fb   fc   fd   fe   ff "

result of where( df == 'zz' ) =
<class 'pandas.core.frame.DataFrame'> :
" foo1 foo2 foo3 foo4 foo5 foo6
obj
foo1 NaN NaN   zz NaN NaN NaN
foo2 NaN NaN NaN NaN   zz NaN
foo3 NaN   zz NaN NaN NaN   zz
foo4 NaN NaN NaN NaN NaN NaN
foo5 NaN NaN NaN   zz NaN NaN
foo6 NaN NaN NaN NaN NaN NaN "

result of stack() =
<class 'pandas.core.series.Series'> :
" obj
foo1 foo3 zz
foo2 foo5 zz
foo3 foo2 zz
foo6 zz
foo5 foo4 zz
dtype: object "

result of reset_index() =
<class 'pandas.core.frame.DataFrame'> :
"    obj level_1   0
0 foo1 foo3 zz
1 foo2 foo5 zz
2 foo3 foo2 zz
3 foo3 foo6 zz
4 foo5 foo4 zz "

value of .iloc[ :, 0 ]=
<class 'pandas.core.series.Series'> :
" 0 foo1
1 foo2
2 foo3
3 foo3
4 foo5
Name: obj, dtype: object "

value of .iloc[ :, 1 ] =
<class 'pandas.core.series.Series'> :
" 0 foo3
1 foo5
2 foo2
3 foo6
4 foo4
Name: level_1, dtype: object "

result of zip( df0, df1 )=
<class 'zip'> :
" <zip object at 0x000000000B3B9548> "

result of list( z )=
<class 'list'> :
" [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')] "

value of { 'zz': l }=
<class 'dict'> :
" {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]} "

{'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}

The script reads a CSV file and stores the data in a Pandas
DataFrame object named "df". The "index_col=0" parameter tells
Pandas to use the first column as the index for the DataFrame,
which is kinda like column headers.

The "where" creates a new DataFrame selection that contains
the same data as df, but with all values replaced by NaN (Not
a Number) except for the values that are equal to 'zz'.

"stack" returns a Series with a multi-level index created
by pivoting the columns. Here it gives a Series with the
row-col-addresses of a all the non-NaN values. The general
meaning of "stack" might be the most complex operation of
this script. It's explained in the pandas manual (see there).

"reset_index" then just transforms this Series back into a
DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
first and second column, respectively, of that DataFrame. These
then are zipped to get the desired form as a list of pairs.

Date	Sujet	#	Auteur
12 Apr 24	help: pandas and 2d table	10	jak
12 Apr 24	Re: help: pandas and 2d table	9	Stefan Ram
13 Apr 24	Re: help: pandas and 2d table	8	jak
13 Apr 24	Re: help: pandas and 2d table	1	Mats Wichmann
13 Apr 24	Re: help: pandas and 2d table	6	Tim Williams
13 Apr 24	Re: help: pandas and 2d table	5	Stefan Ram
13 Apr 24	Re: help: pandas and 2d table	4	jak
14 Apr 24	Re: help: pandas and 2d table	3	Stefan Ram
15 Apr 24	Re: help: pandas and 2d table	1	jak
19 May 24	Re: help: pandas and 2d table	1	Stefan Ram