Article <pandas-20240414094956@ram.dialup.fu-berlin.de>

Deutsch English Français Italiano
<pandas-20240414094956@ram.dialup.fu-berlin.de>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: help: pandas and 2d table
Date: 14 Apr 2024 08:58:16 GMT
Organization: Stefan Ram
Lines: 167
Expires: 1 Feb 2025 11:59:58 GMT
Message-ID: <pandas-20240414094956@ram.dialup.fu-berlin.de>
References: <uvbv6a$2gmc4$1@dont-email.me> <pandas-20240412202220@ram.dialup.fu-berlin.de> <uvdvlj$30soq$1@dont-email.me> <8b63c74a-d8e5-4c3a-ac3e-b240c88b7dcb@wichmann.us> <CAO39LaTUr5_KC5PLqb9_PZs4cQMombFUVf21K9o5HhkrWJfKnw@mail.gmail.com> <mailman.104.1713028490.3468.python-list@python.org> <pandas-20240413193824@ram.dialup.fu-berlin.de> <uvetru$375r4$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de Lod5LRV7nSkg/FvapUzQVACIaE0CeYwptMjIr+RlyAcjN8
Cancel-Lock: sha1:UaFsmj7t9Lu1R/B0cxBzYjSXeNM= sha256:RE1CUuICWGzfBpurQldzINVL5Dn1OtE+rFkPYIzeq4A=
X-Copyright: (C) Copyright 2024 Stefan Ram. All rights reserved.
	Distribution through any means other than regular usenet
	channels is forbidden. It is forbidden to publish this
	article in the Web, to change URIs of this article into links,
        and to transfer the body without this notice, but quotations
        of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
	services to mirror the article in the web. But the article may
	be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Bytes: 7033

jak <nospam@please.ty> wrote or quoted:
>Stefan Ram ha scritto:
>>df = df.where( df == 'zz' ).stack().reset_index()
>>result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))}
>Since I don't know Pandas, I will need a month at least to understand
>these 2 lines of code. Thanks again.

  Here's a technique to better understand such code:

  Transform it into a program with small statements and small
  expressions with no more than one call per statement if possible.
  (After each litte change check that the output stays the same.)

import pandas as pd

# Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'!
with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out:
    print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6
foo1,aa,ab,zz,ad,ae,af
foo2,ba,bb,bc,bd,zz,bf
foo3,ca,zz,cc,cd,ce,zz
foo4,da,db,dc,dd,de,df
foo5,ea,eb,ec,zz,ee,ef
foo6,fa,fb,fc,fd,fe,ff''', file=out )
# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )

selection = df.where( df == 'zz' )
selection_stack = selection.stack()
df = selection_stack.reset_index()
df0 = df.iloc[ :, 0 ]
df1 = df.iloc[ :, 1 ]
z = zip( df0, df1 )
l = list( z )
result ={ 'zz': l }
print( result )

  I suggest to next insert print statements to print each intermediate
  value:

# Note the "index_col=0" below, which is important here!
df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 )
print( 'df = \n', type( df ), ':\n"', df, '"\n' )

selection = df.where( df == 'zz' )
print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"',
  selection, '"\n' )

selection_stack = selection.stack()
print( 'result of stack() = \n', type( selection_stack ), ':\n"',
  selection_stack, '"\n' )

df = selection_stack.reset_index()
print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' )

df0 = df.iloc[ :, 0 ]
print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' )

df1 = df.iloc[ :, 1 ]
print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' )

z = zip( df0, df1 )
print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' )

l = list( z )
print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' )

result ={ 'zz': l }
print( "value of { 'zz': l }= \n", type( result ), ':\n"',
  result, '"\n' )

print( result )

  Now you can see what each single step does!

df = 
 <class 'pandas.core.frame.DataFrame'> :
"      foo1 foo2 foo3 foo4 foo5 foo6
obj                               
foo1   aa   ab   zz   ad   ae   af
foo2   ba   bb   bc   bd   zz   bf
foo3   ca   zz   cc   cd   ce   zz
foo4   da   db   dc   dd   de   df
foo5   ea   eb   ec   zz   ee   ef
foo6   fa   fb   fc   fd   fe   ff "

result of where( df == 'zz' ) = 
 <class 'pandas.core.frame.DataFrame'> :
"      foo1 foo2 foo3 foo4 foo5 foo6
obj                               
foo1  NaN  NaN   zz  NaN  NaN  NaN
foo2  NaN  NaN  NaN  NaN   zz  NaN
foo3  NaN   zz  NaN  NaN  NaN   zz
foo4  NaN  NaN  NaN  NaN  NaN  NaN
foo5  NaN  NaN  NaN   zz  NaN  NaN
foo6  NaN  NaN  NaN  NaN  NaN  NaN "

result of stack() = 
 <class 'pandas.core.series.Series'> :
" obj       
foo1  foo3    zz
foo2  foo5    zz
foo3  foo2    zz
      foo6    zz
foo5  foo4    zz
dtype: object "

result of reset_index() = 
 <class 'pandas.core.frame.DataFrame'> :
"     obj level_1   0
0  foo1    foo3  zz
1  foo2    foo5  zz
2  foo3    foo2  zz
3  foo3    foo6  zz
4  foo5    foo4  zz "

value of .iloc[ :, 0 ]= 
 <class 'pandas.core.series.Series'> :
" 0    foo1
1    foo2
2    foo3
3    foo3
4    foo5
Name: obj, dtype: object "

value of .iloc[ :, 1 ] = 
 <class 'pandas.core.series.Series'> :
" 0    foo3
1    foo5
2    foo2
3    foo6
4    foo4
Name: level_1, dtype: object "

result of zip( df0, df1 )= 
 <class 'zip'> :
" <zip object at 0x000000000B3B9548> "

result of list( z )= 
 <class 'list'> :
" [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')] "

value of { 'zz': l }= 
 <class 'dict'> :
" {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]} "

{'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]}

  The script reads a CSV file and stores the data in a Pandas
  DataFrame object named "df". The "index_col=0" parameter tells
  Pandas to use the first column as the index for the DataFrame,
  which is kinda like column headers.

  The "where" creates a new DataFrame selection that contains
  the same data as df, but with all values replaced by NaN (Not
  a Number) except for the values that are equal to 'zz'.

  "stack" returns a Series with a multi-level index created
  by pivoting the columns. Here it gives a Series with the
  row-col-addresses of a all the non-NaN values. The general
  meaning of "stack" might be the most complex operation of
  this script. It's explained in the pandas manual (see there).

  "reset_index" then just transforms this Series back into a
  DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the
  first and second column, respectively, of that DataFrame. These
  then are zipped to get the desired form as a list of pairs.