pandas 之 字符串处理
import numpy as np
import pandas as pd
Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing.(Python非常流行的一个原因在于它对字符串处理提供了非常灵活的操作方式). Most text operations are made simple with string object's built-in methods. For more complex pattern matching and text manipulations, reqular expressions may be needed(对于非常复杂的字符串操作,正则还是非常必要的). pandas adds to the mix by enabling you to apply string and reqular expressions concisely(简明地) on whole arrays of data, additionally handling the annonyance(烦恼) of missing data.
字符串对象常用方法
In many string munging and scriptiong applications, built-in methods are sufficient(内置的方法就已够用). As a example, a comma-separated string can be broken into pieces with split:
val = 'a,b, guido'
val.split(',')
['a', 'b', ' guido']
split is offen combined with strip to trim whitesplace(including line breaks): (split 通常和strip配合使用哦)
pieces = [x.strip() for x in val.split(',')]
pieces
['a', 'b', 'guido']
These subtrings could be concatenated together with a two-colon delimiter using additon:
first, second, thrid = pieces # 拆包
first + "::" + second + "::" + thrid
'a::b::guido'
But this isn't a practical(实际有效) generic mathod. A faster and more Pythonic way is to pass a list or tuple to the join method on the string "::".
'::'.join(pieces)
'a::b::guido'
Other methods are concerned with locating substrings. Using Python's in keyword is the best way to detect a substring, though index and find can also be used:
"guido" in val
True
val.index(',') # 下标索引位置
1
val.find(":") # 返回第一次出现的下标, 没有则返回 -1
-1
Note the difference between find and index is that index raises an exception if the string isn't found (versus 相对于index的报错, find 返回 -1, 健壮性好)
val.index(':')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-37-2c016e7367ac> in <module>
----> 1 val.index(':')
ValueError: substring not found
val.find(":")
Relatedly, count returns the number of occurrences of a particular substring:
val.count(',')
replace will substitute(替换) occurrences of one pattern for another. It is commonly used to delete patterns, too, by passing an empty string:
val
val.replace(',', ':') # 是深拷贝, 创建新对象了哦
'a:b: guido'
val # 原来的没变哦
'a,b, guido'
val.replace(',', '') # 替换为空
'ab guido'
See Table 7-3 for a listing of some of Python's string methods.
Regular expressions can also be used with many of these operations, as you'll see.
Argument | Description |
---|---|
count | 计数某元素出现的次数 |
endswith | Return True if string ends with suffix |
startswith | 判断是否以某元素结尾 |
join | 字符串拼接 |
index | 返回某元素第一次出现的下标, 没有则报错 |
find | 返回某元素第一次出现的下标,没有则返回-1 |
rfind | 从右边往左开始寻找 |
replace | 替换某元素 |
strip | 清除两侧空白符 |
rstrip | for each element |
lstrip | |
split | 分割 |
lower | 小写 |
upper | 大写 |
casefold | 将字符转换为小写,并将任何特定于区域的变量字符组合转换为常见形式 |
ljust | 调整字符距离 |
rjust |
正则表达式
Regular expression provide a flexible way to search or match(often more complex) string patterns in text. A single expression, commonly called a regex, is a string formed(形成的) according to the regular expression language. Python's built-in re module is responsible for applying regular expressions to strings; I'll give a number of examples of its use here.
The art of writing regular expressions could be a chapter of its own and thus is outside the book's scope. There are many excellent tutorials and references available on the internet and in other books.
The re module functions fall into three categories:pattern matching, substitution, and splitting. Naturally these are all related; a regex describes a pattern to locate in the text, which can then be used for many purposes. Let's look at a simple example:
Suppose we want to split a string with a variable number of whitespace characters(tabs, spaces, and newlines). The regex describing one or more whitespace characters is "\s+":
import re
text = "foo bar\t baz \tqux"
re.split("\s+", text) # 按空白符分割
['foo', 'bar', 'baz', 'qux']
When you call re.split('\s+', text), the regular expression is first compiled, and then its split method method is called on the passed text. You can complie the regex yourself with re.compile forming a reusable regex object:
regex = re.compile('\s+') # cj 编译模式在代码复用时挺好
regex.split(text)
['foo', 'bar', 'baz', 'qux']
If, instead(替换), you want to get a list of all patterns matching the regex, you can use the findall method:
regex.findall(text) # cj,匹配所有满足要求的, 并返回列表
[' ', '\t ', ' \t']
To avoid unwanted escaping with \ in a regular expression, use raw string literals(原生字面符) like r'C:\x' instead of the equivalent 'C:\x'
Creating a regex object with re.complie is highly recommended if you intent to apply the same expression to many strings; doing so will save CPU cycles(周期)
(提高代码复用, 节省CPU空间)
match and search are closely related to findall. While findall returns all matches in a string, search returns only the first match. More rigidly(严格地), match only matches at the beginning of the string. As a less trivial(不重要地)example, let's consider a block of text and a regular expression capable(能干的) of identifying most email addresses:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
"匹配出所有邮箱"
pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)
'匹配出所有邮箱'
Using findall on the text produces a list of the email addresses:
regex.findall(text)
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
search returns a special match object for the first email address in the text. For the preceding regex, the match object can only tell us the start and end position of the pattern in the string:
m = regex.search(text) # 只返回第一个匹配到的结果
m # 是一个Match对象
<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>
text[m.start():m.end()]
'dave@google.com'
regex.match returns None, as it only will mathch if the pattern occurs at the start of the string:
# 第一个参数必须是正则表达式, 没有匹配则None
print(regex.match(text))
None
Relatedly, sub will return a new string with occurrences of the pattern replaced by the a new string.
# 参数: pattern, replace_value, text, count
print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
Suppose you wanted to find email addresses and simultaneously(同时地) segment each address into its three components(部分): username, domain name, and domain suffix. To do this, put parentheses around the parts of pattern to segment:
pattern = r'([a-z0-9+_.%-]+)@([a-z0-9+-._]+)\.([a-z0-9]{2,4})' # () 用来分组
regex = re.compile(pattern, flags=re.IGNORECASE)
A match object produced by this modified regex return a tuple of the pattern components with its groups method:
m = regex.match("wesm@bring.net")
m.groups()
('wesm', 'bring', 'net')
findall returns a list of tuples when the pattern has groups:
regex.findall(text) # 数据清洗非常有用啊,正则
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
sub also has access to groups in each match using special symbols like \1 and \2. The symbol \1 correspons to the first matched group, \2 corresponds to the second, and so forth:
"感觉真的是数据清洗的利器"
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
'感觉真的是数据清洗的利器'
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
There is much more to regular expression in Python, most of which is outside the book's scope, Table 7-4 provides a brief summary.
Argument | Description |
---|---|
findall | 匹配所有满足条件的元素, 返回是个列表 |
finditer | Like findall, but returns an iterator |
match | 从头开始严格匹配, 一旦匹配到则返回match对象, 否则None |
search | 所有满足条件的元素从任意位置, 匹配放回match对象, 否则None |
split | 按正则表达式分割 |
sub, subn | 替换匹配字串,返回新字串, \1, \2..分组显示等 |
批量字符串处理
Cleaning up a messy dataset for analysis often requires a lot of string munging and regularization. To complicate matters, a column containing strings will sometimes have missing data:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN
dtype: object
data.isnull()
Dave False
Steve False
Rob False
Wes True
dtype: bool
You can apply string and regular expression methods can be applied(passing a lambda or other function) to each value using data.map, but it will fail on the NA values(apply能传一个方法去处理去映射每个元素, 但缺失值就麻爪了). To cope with(处理)this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series's str attribute; for example, we could check whether each email address has 'gmail' in it with str.contains
data.str.contains("gmail") # like 'in'
Dave False
Steve True
Rob True
Wes NaN
dtype: object
Regular expressions can be used, too, along with any re option like IGNORECASE:
pattern
'([a-z0-9+_.%-]+)@([a-z0-9+-._]+)\\.([a-z0-9]{2,4})'
data.str.findall(pattern, flags=re.IGNORECASE) # 映射每个元素
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
dtype: object
There are a couple of(一对) ways to do vectorized element retrieval. Either use str.get or index into the str attribute:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches
Dave True
Steve True
Rob True
Wes NaN
dtype: object
To access elements in the embedded lists(列表嵌套), we can pass an index to either of these functions:
matches.str.get(1)
Dave NaN
Steve NaN
Rob NaN
Wes NaN
dtype: float64
matches.str[0]
Dave NaN
Steve NaN
Rob NaN
Wes NaN
dtype: float64
You can similarly slice strings using this syntax:
data.str[:5]
Dave dave@
Steve steve
Rob rob@g
Wes NaN
dtype: object
See Table 7-5 for more pandas string methods
- cat
- contains
- count
- extract 用正则表达式提取
- endswith
- startswith
- findall
- get index into each element
- isalnum 判断是否为字母or数字
- islaph
- isdecimal
- isdigit
- islower
- isupper
- isnumeric
- join
- len
- lower/ upper
- match
- pad Add whitespace to left, right or both sides of strings
- repeat
- replace
- slice
- split
- strip
- rstrip
- lstrip
小结
Effective data preparation can significantly improve productive by enabling you to spend more time analyzing data and less time getting it ready for analyingsis.
(能高效便捷进行数据清洗和预处理能让我们有更多的时间去分析问题而非一直在处理数据)
We have explored a number of tools in this chapter, but the coverage here is by no means comprehensive. In the next chapter, we will explore pandas's joining and grouping functionality.
pandas 之 字符串处理的更多相关文章
- pandas处理字符串
# pandas 字符串的处理 # 前面已经学习了字符串的处理函数 # df["bWendu"].str.replace("℃","").a ...
- 利用Python进行数据分析(15) pandas基础: 字符串操作
字符串对象方法 split()方法拆分字符串: strip()方法去掉空白符和换行符: split()结合strip()使用: "+"符号可以将多个字符串连接起来: join( ...
- Pandas | 11 字符串函数
在本章中,我们将使用基本系列/索引来讨论字符串操作.在随后的章节中,将学习如何将这些字符串函数应用于数据帧(DataFrame). Pandas提供了一组字符串函数,可以方便地对字符串数据进行操作. ...
- Python数据科学手册-Pandas:向量化字符串操作、时间序列
向量化字符串操作 Series 和 Index对象 的str属性. 可以正确的处理缺失值 方法列表 正则表达式. Method Description match() Call re.match() ...
- (数据科学学习手札131)pandas中的常用字符串处理方法总结
本文示例代码及文件已上传至我的Github仓库https://github.com/CNFeffery/DataScienceStudyNotes 1 简介 在日常开展数据分析的过程中,我们经常需要对 ...
- 04. Pandas 3| 数值计算与统计、合并连接去重分组透视表文件读取
1.数值计算和统计基础 常用数学.统计方法 数值计算和统计基础 基本参数:axis.skipna df.mean(axis=1,skipna=False) -->> axis=1是按行来 ...
- pandas 基础操作 更新
创建一个Series,同时让pandas自动生成索引列 创建一个DataFrame数据框 查看数据 数据的简单统计 数据的排序 选择数据(类似于数据库中sql语句) 另外可以使用标签来选择 通过位置获 ...
- Python 数据处理库 pandas 入门教程
Python 数据处理库 pandas 入门教程2018/04/17 · 工具与框架 · Pandas, Python 原文出处: 强波的技术博客 pandas是一个Python语言的软件包,在我们使 ...
- 「Python」pandas入门教程
pandas适合于许多不同类型的数据,包括: 具有异构类型列的表格数据,例如SQL表格或Excel数据 有序和无序(不一定是固定频率)时间序列数据. 具有行列标签的任意矩阵数据(均匀类型或不同类型) ...
随机推荐
- ESP8266 SDK开发: 开发环境搭建
前言 这节安装下编程软件, 可以去官网下载, https://wiki.ai-thinker.com/ai_ide_install 也可以安装我提供的(我使用的为官方以前版本) 建议安装我提供的,有问 ...
- 洛谷 P3368 【模板】树状数组 2 题解
P3368 [模板]树状数组 2 题目描述 如题,已知一个数列,你需要进行下面两种操作: 1.将某区间每一个数数加上x 2.求出某一个数的值 输入格式 第一行包含两个整数N.M,分别表示该数列数字的个 ...
- C++ 重写虚函数的代码使用注意点+全部知识点+全部例子实现
h-------------------------- #ifndef VIRTUALFUNCTION_H #define VIRTUALFUNCTION_H /* * 派生类中覆盖虚函数的使用知识点 ...
- Android 10 终于来了!增加了不少新特性
前言 Android 10 正式发布了,根据官网的介绍,聚焦于隐私可控.手机自定义与使用效率,此版本主要带来了十大新特性: image 智能回复 使用机器学习来预测你在回复信息时可能会说些什么,这 ...
- Docker环境下的前后端分离项目部署与运维(六)搭建MySQL集群
单节点数据库的弊病 大型互联网程序用户群体庞大,所以架构必须要特殊设计 单节点的数据库无法满足性能上的要求 单节点的数据库没有冗余设计,无法满足高可用 单节点MySQL的性能瓶领颈 2016年春节微信 ...
- MySql查询数据令某字段显示固定值
我们用SQL查询数据时后,基于某些原因不想看到某字段的值,比如密码,我们可以通过创建视图,忽略某一字段的值. 同时我们也可以直接通过SQL语句来让其显示某个固定值: (1)一般查询语句: SELECT ...
- python3.5-ssh免输入密码过程
ssh远程批量执行命令要输密码很蛋疼,虽然有很多种方式,大概有4.5种.原理基本类似. 这里我就讲一个python的模拟登陆 模块 此刻的时间是:2015年11月19日11:11:47 ...
- cad问题小百科 持续更新
一些浩辰的问题移步去: 浩辰问题 (浩辰可能和桌子具有相同的问题,所以这篇你可能还是要看 cad2007遇到了这种情况 安装问题安装CAD出现C++2005问题的解决方法,出现此问题,原 ...
- Python的WEB框架
Python的WEB框架 Bottle Bottle是一个快速.简洁.轻量级的基于WSIG的微型Web框架,此框架只由一个 .py 文件,除了Python的标准库外,其不依赖任何其他模块. ? 1 2 ...
- mapreduce 函数入门 三
一.mapreduce多job串联 1.需求 一个稍复杂点的处理逻辑往往需要多个 mapreduce 程序串联处理,多 job 的串联可以借助 mapreduce 框架的 JobControl 实现 ...