Re: help: pandas and 2d table

Liste des GroupesRevenir à cl python 
Sujet : Re: help: pandas and 2d table
De : ram (at) *nospam* zedat.fu-berlin.de (Stefan Ram)
Groupes : comp.lang.python
Date : 14. Apr 2024, 10:58:16
Autres entêtes
Organisation : Stefan Ram
Message-ID : <pandas-20240414094956@ram.dialup.fu-berlin.de>
References : 1 2 3 4 5 6 7 8
jak <nospam@please.ty> wrote or quoted:
Stefan Ram ha scritto:
df = df.where( df == 'zz' ).stack().reset_index()
result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}
Since I don't know Pandas, I will need a month at least to understand
these 2 lines of code. Thanks again.

  Here's a technique to better understand such code:

  Transform it into a program with small statements and small
  expressions with no more than one call per statement if possible.
  (After each litte change check that the output stays the same.)

import pandas as pd

# Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
    print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )
# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )

selection = df.where( df == 'zz' )
selection_stack = selection.stack()
df = selection_stack.reset_index()
df0 = df.iloc[ :, 0 ]
df1 = df.iloc[ :, 1 ]
z = zip( df0, df1 )
l = list( z )
result ={ 'zz': l }
print( result )

  I suggest to next insert print statements to print each intermediate
  value:

# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
print( 'df = \n', type( df ), ':\n"', df, '"\n' )

selection = df.where( df == 'zz' )
print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
  selection, '"\n' )

selection_stack = selection.stack()
print( 'result of stack() = \n', type( selection_stack ), ':\n"',
  selection_stack, '"\n' )

df = selection_stack.reset_index()
print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )

df0 = df.iloc[ :, 0 ]
print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )

df1 = df.iloc[ :, 1 ]
print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )

z = zip( df0, df1 )
print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )

l = list( z )
print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )

result ={ 'zz': l }
print( "value of { 'zz': l }= \n", type( result ), ':\n"',
  result, '"\n' )

print( result )

  Now you can see what each single step does!

df =
 <class 'pandas.core.frame.DataFrame'> :
"      foo1 foo2 foo3 foo4 foo5 foo6
obj                              
foo1   aa   ab   zz   ad   ae   af
foo2   ba   bb   bc   bd   zz   bf
foo3   ca   zz   cc   cd   ce   zz
foo4   da   db   dc   dd   de   df
foo5   ea   eb   ec   zz   ee   ef
foo6   fa   fb   fc   fd   fe   ff "

result of where( df == 'zz' ) =
 <class 'pandas.core.frame.DataFrame'> :
"      foo1 foo2 foo3 foo4 foo5 foo6
obj                              
foo1  NaN  NaN   zz  NaN  NaN  NaN
foo2  NaN  NaN  NaN  NaN   zz  NaN
foo3  NaN   zz  NaN  NaN  NaN   zz
foo4  NaN  NaN  NaN  NaN  NaN  NaN
foo5  NaN  NaN  NaN   zz  NaN  NaN
foo6  NaN  NaN  NaN  NaN  NaN  NaN "

result of stack() =
 <class 'pandas.core.series.Series'> :
" obj      
foo1  foo3    zz
foo2  foo5    zz
foo3  foo2    zz
      foo6    zz
foo5  foo4    zz
dtype: object "

result of reset_index() =
 <class 'pandas.core.frame.DataFrame'> :
"     obj level_1   0
0  foo1    foo3  zz
1  foo2    foo5  zz
2  foo3    foo2  zz
3  foo3    foo6  zz
4  foo5    foo4  zz "

value of .iloc[ :, 0 ]=
 <class 'pandas.core.series.Series'> :
" 0    foo1
1    foo2
2    foo3
3    foo3
4    foo5
Name: obj, dtype: object "

value of .iloc[ :, 1 ] =
 <class 'pandas.core.series.Series'> :
" 0    foo3
1    foo5
2    foo2
3    foo6
4    foo4
Name: level_1, dtype: object "

result of zip( df0, df1 )=
 <class 'zip'> :
" <zip object at 0x000000000B3B9548> "

result of list( z )=
 <class 'list'> :
" [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')] "

value of { 'zz': l }=
 <class 'dict'> :
" {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]} "

{'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}

  The script reads a CSV file and stores the data in a Pandas
  DataFrame object named "df". The "index_col=0" parameter tells
  Pandas to use the first column as the index for the DataFrame,
  which is kinda like column headers.

  The "where" creates a new DataFrame selection that contains
  the same data as df, but with all values replaced by NaN (Not
  a Number) except for the values that are equal to 'zz'.

  "stack" returns a Series with a multi-level index created
  by pivoting the columns. Here it gives a Series with the
  row-col-addresses of a all the non-NaN values. The general
  meaning of "stack" might be the most complex operation of
  this script. It's explained in the pandas manual (see there).

  "reset_index" then just transforms this Series back into a
  DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
  first and second column, respectively, of that DataFrame. These
  then are zipped to get the desired form as a list of pairs.

Date Sujet#  Auteur
12 Apr 24 * help: pandas and 2d table10jak
12 Apr 24 `* Re: help: pandas and 2d table9Stefan Ram
13 Apr 24  `* Re: help: pandas and 2d table8jak
13 Apr 24   +- Re: help: pandas and 2d table1Mats Wichmann
13 Apr 24   `* Re: help: pandas and 2d table6Tim Williams
13 Apr 24    `* Re: help: pandas and 2d table5Stefan Ram
13 Apr 24     `* Re: help: pandas and 2d table4jak
14 Apr 24      `* Re: help: pandas and 2d table3Stefan Ram
15 Apr 24       +- Re: help: pandas and 2d table1jak
19 May 24       `- Re: help: pandas and 2d table1Stefan Ram

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal