String Manipulation related with pandas

String object Methods

import pandas as pd

import numpy as np

val='a,b, guido'

val.split(',') # normal python built-in method split

['a', 'b', ' guido']

pieces=[x.strip() for x in val.split(',')];pieces  # strip whitespace

['a', 'b', 'guido']

'::'.join(pieces)

'a::b::guido'

val.count(',')

val.count('guido')

val.replace(',',':')

'a:b: guido'

val.swapcase()

'A,B, GUIDO'

val[::-1]

'odiug ,b,a'

Regular expression

The re module functions fall into 3 categories:pattern matching,substitution,splliting.

import re

text='foo   bar\t baz  \t qux'

re.split('\s+',text)

['foo', 'bar', 'baz', 'qux']

regex=re.compile('\s+')

regex.split(text)

['foo', 'bar', 'baz', 'qux']

regex.findall(text)

['   ', '\t ', '  \t ']

To avoid unwanted escaping with \ in a regular expression,use raw string literals

text="""Dave dave@google.com
Steve steve@mail.com
Rob rob@mail.com
Ryan ryan@yahoo.com
"""

pattern=r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

regex=re.compile(pattern,re.I)

Using findall() produces a list of the email address.

regex.findall(text)

['dave@google.com', 'steve@mail.com', 'rob@mail.com', 'ryan@yahoo.com']

regex.findall(r' J.onepy+@w-m.co')

['J.onepy+@w-m.co']

search() returns a specified match object for the first email address in the text.

m=regex.search(text)

<re.Match object; span=(5, 20), match='dave@google.com'>

regex.match(text)

text[m.start():m.end()]

'dave@google.com'

regex.match(text) returns None,as it onlyu will match if the pattern occurs at the start of the string.

sub() will return a new string with occurences of the pattern replaced by a new string.

print(regex.sub('READACTED',text))

Dave READACTED
Steve READACTED
Rob READACTED
Ryan READACTED

Vectorized string functions in pandas

data={'Dave':'dave@google.com','Steve':'steve@gmeil.com','Rob':'rob@gmail.com','Wes':np.nan}

data=pd.Series(data);data

Dave     dave@google.com
Steve    steve@gmeil.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

data.str.contains('gmail')

Dave     False
Steve    False
Rob       True
Wes        NaN
dtype: object

data

Dave     dave@google.com
Steve    steve@gmeil.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

data.map(lambda x:x[:2],na_action='ignore')  # x is the value in data, the returned Series has the same index with caller,data here.

Dave      da
Steve     st
Rob       ro
Wes      NaN
dtype: object

help(data.map)

Help on method map in module pandas.core.series:

map(arg, na_action=None) method of pandas.core.series.Series instance
    Map values of Series using input correspondence (a dict, Series, or
    function).
    
    Parameters
    ----------
    arg : function, dict, or Series
        Mapping correspondence.
    na_action : {None, 'ignore'}
        If 'ignore', propagate NA values, without passing them to the
        mapping correspondence.
    
    Returns
    -------
    y : Series
        Same index as caller.
    
    Examples
    --------
    
    Map inputs to outputs (both of type `Series`):
    
    >>> x = pd.Series([1,2,3], index=['one', 'two', 'three'])
    >>> x
    one      1
    two      2
    three    3
    dtype: int64
    
    >>> y = pd.Series(['foo', 'bar', 'baz'], index=[1,2,3])
    >>> y
    1    foo
    2    bar
    3    baz
    
    >>> x.map(y)
    one   foo
    two   bar
    three baz
    
    If `arg` is a dictionary, return a new Series with values converted
    according to the dictionary's mapping:
    
    >>> z = {1: 'A', 2: 'B', 3: 'C'}
    
    >>> x.map(z)
    one   A
    two   B
    three C
    
    Use na_action to control whether NA values are affected by the mapping
    function.
    
    >>> s = pd.Series([1, 2, 3, np.nan])
    
    >>> s2 = s.map('this is a string {}'.format, na_action=None)
    0    this is a string 1.0
    1    this is a string 2.0
    2    this is a string 3.0
    3    this is a string nan
    dtype: object
    
    >>> s3 = s.map('this is a string {}'.format, na_action='ignore')
    0    this is a string 1.0
    1    this is a string 2.0
    2    this is a string 3.0
    3                     NaN
    dtype: object
    
    See Also
    --------
    Series.apply : For applying more complex functions on a Series.
    DataFrame.apply : Apply a function row-/column-wise.
    DataFrame.applymap : Apply a function elementwise on a whole DataFrame.
    
    Notes
    -----
    When `arg` is a dictionary, values in Series that are not in the
    dictionary (as keys) are converted to ``NaN``. However, if the
    dictionary is a ``dict`` subclass that defines ``__missing__`` (i.e.
    provides a method for default values), then this default is used
    rather than ``NaN``:
    
    >>> from collections import Counter
    >>> counter = Counter()
    >>> counter['bar'] += 1
    >>> y.map(counter)
    1    0
    2    1
    3    0
    dtype: int64

pattern

'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'

data.str.findall(pattern,flags=re.I)

Dave     [dave@google.com]
Steve    [steve@gmeil.com]
Rob        [rob@gmail.com]
Wes                    NaN
dtype: object

matches=data.str.match(pattern,flags=re.I);matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

matches.str.get(1)

Dave    NaN
Steve   NaN
Rob     NaN
Wes     NaN
dtype: float64

matches.str[0]

Dave    NaN
Steve   NaN
Rob     NaN
Wes     NaN
dtype: float64

data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

posted @ 2020-04-16 20:21 JohnYang819 阅读(168) 评论(0) 收藏举报

刷新页面返回顶部

JohnYang

定心者，勇猛精进，终获证果（https://johnsite.157489.xyz/）

String Manipulation related with pandas

String Manipulation related with pandas

String object Methods

Regular expression

Vectorized string functions in pandas