Sujet : Re: help: pandas and 2d table
De : nospam (at) *nospam* please.ty (jak)
Groupes : comp.lang.pythonDate : 15. Apr 2024, 08:05:18
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <uvig30$4mnd$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9
User-Agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0 SeaMonkey/2.53.18.2
Stefan Ram ha scritto:
jak <nospam@please.ty> wrote or quoted:
Stefan Ram ha scritto:
df = df.where( df == 'zz' ).stack().reset_index()
result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}
Since I don't know Pandas, I will need a month at least to understand
these 2 lines of code. Thanks again.
Here's a technique to better understand such code:
Transform it into a program with small statements and small
expressions with no more than one call per statement if possible.
(After each litte change check that the output stays the same.)
import pandas as pd
# Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )
# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
selection = df.where( df == 'zz' )
selection_stack = selection.stack()
df = selection_stack.reset_index()
df0 = df.iloc[ :, 0 ]
df1 = df.iloc[ :, 1 ]
z = zip( df0, df1 )
l = list( z )
result ={ 'zz': l }
print( result )
I suggest to next insert print statements to print each intermediate
value:
# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
print( 'df = \n', type( df ), ':\n"', df, '"\n' )
selection = df.where( df == 'zz' )
print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
selection, '"\n' )
selection_stack = selection.stack()
print( 'result of stack() = \n', type( selection_stack ), ':\n"',
selection_stack, '"\n' )
df = selection_stack.reset_index()
print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )
df0 = df.iloc[ :, 0 ]
print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )
df1 = df.iloc[ :, 1 ]
print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )
z = zip( df0, df1 )
print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )
l = list( z )
print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )
result ={ 'zz': l }
print( "value of { 'zz': l }= \n", type( result ), ':\n"',
result, '"\n' )
print( result )
Now you can see what each single step does!
df =
<class 'pandas.core.frame.DataFrame'> :
" foo1 foo2 foo3 foo4 foo5 foo6
obj
foo1 aa ab zz ad ae af
foo2 ba bb bc bd zz bf
foo3 ca zz cc cd ce zz
foo4 da db dc dd de df
foo5 ea eb ec zz ee ef
foo6 fa fb fc fd fe ff "
result of where( df == 'zz' ) =
<class 'pandas.core.frame.DataFrame'> :
" foo1 foo2 foo3 foo4 foo5 foo6
obj
foo1 NaN NaN zz NaN NaN NaN
foo2 NaN NaN NaN NaN zz NaN
foo3 NaN zz NaN NaN NaN zz
foo4 NaN NaN NaN NaN NaN NaN
foo5 NaN NaN NaN zz NaN NaN
foo6 NaN NaN NaN NaN NaN NaN "
result of stack() =
<class 'pandas.core.series.Series'> :
" obj
foo1 foo3 zz
foo2 foo5 zz
foo3 foo2 zz
foo6 zz
foo5 foo4 zz
dtype: object "
result of reset_index() =
<class 'pandas.core.frame.DataFrame'> :
" obj level_1 0
0 foo1 foo3 zz
1 foo2 foo5 zz
2 foo3 foo2 zz
3 foo3 foo6 zz
4 foo5 foo4 zz "
value of .iloc[ :, 0 ]=
<class 'pandas.core.series.Series'> :
" 0 foo1
1 foo2
2 foo3
3 foo3
4 foo5
Name: obj, dtype: object "
value of .iloc[ :, 1 ] =
<class 'pandas.core.series.Series'> :
" 0 foo3
1 foo5
2 foo2
3 foo6
4 foo4
Name: level_1, dtype: object "
result of zip( df0, df1 )=
<class 'zip'> :
" <zip object at 0x000000000B3B9548>"
result of list( z )=
<class 'list'> :
" [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]"
value of { 'zz': l }=
<class 'dict'> :
" {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}"
{'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}
The script reads a CSV file and stores the data in a Pandas
DataFrame object named "df". The "index_col=0" parameter tells
Pandas to use the first column as the index for the DataFrame,
which is kinda like column headers.
The "where" creates a new DataFrame selection that contains
the same data as df, but with all values replaced by NaN (Not
a Number) except for the values that are equal to 'zz'.
"stack" returns a Series with a multi-level index created
by pivoting the columns. Here it gives a Series with the
row-col-addresses of a all the non-NaN values. The general
meaning of "stack" might be the most complex operation of
this script. It's explained in the pandas manual (see there).
"reset_index" then just transforms this Series back into a
DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
first and second column, respectively, of that DataFrame. These
then are zipped to get the desired form as a list of pairs.
And this is a technique very similar to reverse engineering. Thanks for
the explanation and examples. All this is really clear and I was able to
follow it easily because I have already written a version of this code
in C without any kind of external library that uses the .CSV version of
the table as data ( 234 code lines :^/ ).