Deutsch English Français Italiano |
<pandas-20240414094956@ram.dialup.fu-berlin.de> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!fu-berlin.de!uni-berlin.de!not-for-mail From: ram@zedat.fu-berlin.de (Stefan Ram) Newsgroups: comp.lang.python Subject: Re: help: pandas and 2d table Date: 14 Apr 2024 08:58:16 GMT Organization: Stefan Ram Lines: 167 Expires: 1 Feb 2025 11:59:58 GMT Message-ID: <pandas-20240414094956@ram.dialup.fu-berlin.de> References: <uvbv6a$2gmc4$1@dont-email.me> <pandas-20240412202220@ram.dialup.fu-berlin.de> <uvdvlj$30soq$1@dont-email.me> <8b63c74a-d8e5-4c3a-ac3e-b240c88b7dcb@wichmann.us> <CAO39LaTUr5_KC5PLqb9_PZs4cQMombFUVf21K9o5HhkrWJfKnw@mail.gmail.com> <mailman.104.1713028490.3468.python-list@python.org> <pandas-20240413193824@ram.dialup.fu-berlin.de> <uvetru$375r4$1@dont-email.me> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Trace: news.uni-berlin.de Lod5LRV7nSkg/FvapUzQVACIaE0CeYwptMjIr+RlyAcjN8 Cancel-Lock: sha1:UaFsmj7t9Lu1R/B0cxBzYjSXeNM= sha256:RE1CUuICWGzfBpurQldzINVL5Dn1OtE+rFkPYIzeq4A= X-Copyright: (C) Copyright 2024 Stefan Ram. All rights reserved. Distribution through any means other than regular usenet channels is forbidden. It is forbidden to publish this article in the Web, to change URIs of this article into links, and to transfer the body without this notice, but quotations of parts in other Usenet posts are allowed. X-No-Archive: Yes Archive: no X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some services to mirror the article in the web. But the article may be kept on a Usenet archive server with only NNTP access. X-No-Html: yes Content-Language: en-US Bytes: 7033 jak <nospam@please.ty> wrote or quoted: >Stefan Ram ha scritto: >>df = df.where( df == 'zz' ).stack().reset_index() >>result ={ 'zz': list( zip( df.iloc[ :, 0 ], df.iloc[ :, 1 ]))} >Since I don't know Pandas, I will need a month at least to understand >these 2 lines of code. Thanks again. Here's a technique to better understand such code: Transform it into a program with small statements and small expressions with no more than one call per statement if possible. (After each litte change check that the output stays the same.) import pandas as pd # Warning! Will overwrite the file 'file_20240412201813_tmp_DML.csv'! with open( 'file_20240412201813_tmp_DML.csv', 'w' )as out: print( '''obj,foo1,foo2,foo3,foo4,foo5,foo6 foo1,aa,ab,zz,ad,ae,af foo2,ba,bb,bc,bd,zz,bf foo3,ca,zz,cc,cd,ce,zz foo4,da,db,dc,dd,de,df foo5,ea,eb,ec,zz,ee,ef foo6,fa,fb,fc,fd,fe,ff''', file=out ) # Note the "index_col=0" below, which is important here! df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 ) selection = df.where( df == 'zz' ) selection_stack = selection.stack() df = selection_stack.reset_index() df0 = df.iloc[ :, 0 ] df1 = df.iloc[ :, 1 ] z = zip( df0, df1 ) l = list( z ) result ={ 'zz': l } print( result ) I suggest to next insert print statements to print each intermediate value: # Note the "index_col=0" below, which is important here! df = pd.read_csv( 'file_20240412201813_tmp_DML.csv', index_col=0 ) print( 'df = \n', type( df ), ':\n"', df, '"\n' ) selection = df.where( df == 'zz' ) print( "result of where( df == 'zz' ) = \n", type( selection ), ':\n"', selection, '"\n' ) selection_stack = selection.stack() print( 'result of stack() = \n', type( selection_stack ), ':\n"', selection_stack, '"\n' ) df = selection_stack.reset_index() print( 'result of reset_index() = \n', type( df ), ':\n"', df, '"\n' ) df0 = df.iloc[ :, 0 ] print( 'value of .iloc[ :, 0 ]= \n', type( df0 ), ':\n"', df0, '"\n' ) df1 = df.iloc[ :, 1 ] print( 'value of .iloc[ :, 1 ] = \n', type( df1 ), ':\n"', df1, '"\n' ) z = zip( df0, df1 ) print( 'result of zip( df0, df1 )= \n', type( z ), ':\n"', z, '"\n' ) l = list( z ) print( 'result of list( z )= \n', type( l ), ':\n"', l, '"\n' ) result ={ 'zz': l } print( "value of { 'zz': l }= \n", type( result ), ':\n"', result, '"\n' ) print( result ) Now you can see what each single step does! df = <class 'pandas.core.frame.DataFrame'> : " foo1 foo2 foo3 foo4 foo5 foo6 obj foo1 aa ab zz ad ae af foo2 ba bb bc bd zz bf foo3 ca zz cc cd ce zz foo4 da db dc dd de df foo5 ea eb ec zz ee ef foo6 fa fb fc fd fe ff " result of where( df == 'zz' ) = <class 'pandas.core.frame.DataFrame'> : " foo1 foo2 foo3 foo4 foo5 foo6 obj foo1 NaN NaN zz NaN NaN NaN foo2 NaN NaN NaN NaN zz NaN foo3 NaN zz NaN NaN NaN zz foo4 NaN NaN NaN NaN NaN NaN foo5 NaN NaN NaN zz NaN NaN foo6 NaN NaN NaN NaN NaN NaN " result of stack() = <class 'pandas.core.series.Series'> : " obj foo1 foo3 zz foo2 foo5 zz foo3 foo2 zz foo6 zz foo5 foo4 zz dtype: object " result of reset_index() = <class 'pandas.core.frame.DataFrame'> : " obj level_1 0 0 foo1 foo3 zz 1 foo2 foo5 zz 2 foo3 foo2 zz 3 foo3 foo6 zz 4 foo5 foo4 zz " value of .iloc[ :, 0 ]= <class 'pandas.core.series.Series'> : " 0 foo1 1 foo2 2 foo3 3 foo3 4 foo5 Name: obj, dtype: object " value of .iloc[ :, 1 ] = <class 'pandas.core.series.Series'> : " 0 foo3 1 foo5 2 foo2 3 foo6 4 foo4 Name: level_1, dtype: object " result of zip( df0, df1 )= <class 'zip'> : " <zip object at 0x000000000B3B9548> " result of list( z )= <class 'list'> : " [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')] " value of { 'zz': l }= <class 'dict'> : " {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]} " {'zz': [('foo1', 'foo3'), ('foo2', 'foo5'), ('foo3', 'foo2'), ('foo3', 'foo6'), ('foo5', 'foo4')]} The script reads a CSV file and stores the data in a Pandas DataFrame object named "df". The "index_col=0" parameter tells Pandas to use the first column as the index for the DataFrame, which is kinda like column headers. The "where" creates a new DataFrame selection that contains the same data as df, but with all values replaced by NaN (Not a Number) except for the values that are equal to 'zz'. "stack" returns a Series with a multi-level index created by pivoting the columns. Here it gives a Series with the row-col-addresses of a all the non-NaN values. The general meaning of "stack" might be the most complex operation of this script. It's explained in the pandas manual (see there). "reset_index" then just transforms this Series back into a DataFrame, and ".iloc[ :, 0 ]" and ".iloc[ :, 1 ]" are the first and second column, respectively, of that DataFrame. These then are zipped to get the desired form as a list of pairs.