8个数据清洗Python代码，复制可用，最长11行 | 资源

动画版 Kin Lim Lee

乾明校对重新整理

前段时间，大统计数据技师Kin Lim Lee在Medium上刊登了一首诗，如是说了8个用作统计数据冲洗的Python标识符。

统计数据冲洗，是展开统计数据挖掘和采用统计数据体能训练数学模型的必经之地，也是最花费统计数据生物学家/开发人员心力的地方性。

那些用作统计数据冲洗的代码有三个缺点：其一由表达式撰写而成，不必改模块就能间接采用。并有比较简单，加之注解最久的也但是11行。

在如是说每几段标识符时，Lee都得出了商业用途，也在标识符中也得出注解。

我们能把这首诗珍藏出来，当作工具箱采用。

囊括8大情景的统计数据冲洗标识符

那些统计数据冲洗标识符，总共囊括8个情景，依次是：

删掉卡代纳、更动统计正则表达式、将展开分类表达式切换为位数表达式、检查和缺位统计数据、删掉quarterfinal的数组、删掉quarterfinal的字符、用数组相连一列（带前提）、切换天数戳（从数组到年份天数文件格式）

删掉卡代纳

在展开统计数据挖掘时，并非所有的列都有用，用df.drop能方便地删掉你指定的列。

def drop_multiple_col(col_names_list, df):

AIM -> Drop multiple columns based on their column names

INPUT -> List of column names, df

OUTPUT -> updated df with dropped columns

——

df.drop(col_names_list, axis=1, inplace=True)

return df

切换统计正则表达式

当统计数据集变大时，需要切换统计正则表达式来节省内存。

def change_dtypes(col_int, col_float, df):

AIM -> Changing dtypes to save memory

INPUT -> List of column names (int, float), df

OUTPUT -> updated df with smaller memory

——

df[col_int] = df[col_int].astype(int32)

df[col_float] = df[col_float].astype(float32)

将展开分类表达式切换为数值表达式

一些机器学习数学模型要求表达式采用数值文件格式。这需要先将展开分类表达式切换为数值表达式。同时，你也能保留展开分类表达式，以便展开统计数据可视化。

def convert_cat2num(df):

# Convert categorical variable to numerical variable

num_encode = {col_1 : {YES:1, NO:0},

col_2 : {WON:1, LOSE:0, DRAW:0}}

df.replace(num_encode, inplace=True)

检查和缺位统计数据

如果你要检查和每列缺位统计数据的数量，采用下列标识符是最快的方法。能让你更好地了解哪些列缺位的统计数据更多，从而确定怎么展开下一步的统计数据冲洗和分析操作。

def check_missing_data(df):

# check for any missing data in the df (display in descending order)

return df.isnull().sum().sort_values(ascending=False)

删掉quarterfinal的数组

有时候，会有新的字符或者其他奇怪的符号出现在数组quarterfinal，这能采用df[‘col_1’].replace很简单地把它们处理掉。

def remove_col_str(df):

# remove a portion of string in a dataframe column – col_1

df[col_1].replace(\n, , regex=True, inplace=True)

# remove all the characters after &# (including &#) for column – col_1

df[col_1].replace( &#.*, , regex=True, inplace=True)

删掉quarterfinal的字符

统计数据混乱的时候，什么情况都有可能发生。数组开头经常会有一些字符。在删掉quarterfinal数组开头的字符时，下面的标识符非常有用。

def remove_col_white_space(df):

# remove white space at the beginning of string

df[col] = df[col].str.lstrip()

用数组相连一列（带前提）

当你想要有前提地用数组将一列相连在一起时，这段标识符很有帮助。比如，你能在第一列结尾处设定某些字母，然后用它们与第二列相连在一起。

根据需要，结尾处的字母也能在相连完成后删掉。

def concat_col_str_condition(df):

# concat 2 columns with strings if the last 3 letters of the first column are pil

mask = df[col_1].str.endswith(pil, na=False)

col_new = df[mask][col_1] + df[mask][col_2]

col_new.replace(pil, , regex=True, inplace=True) # replace the pil with emtpy space

切换天数戳（从数组到年份天数文件格式）

在处理天数序列统计数据时，我们很可能会遇到数组文件格式的天数戳列。

这意味着要将数组文件格式切换为年份天数文件格式(或者其他根据我们的需求指定的文件格式) ，以便对统计数据展开有意义的分析。

def convert_str_datetime(df):

AIM -> Convert datetime(String) to datetime(format we want)

INPUT -> df

OUTPUT -> updated df with new datetime format

——

df.insert(loc=2, column=timestamp, value=pd.to_datetime(df.transdate, format=%Y-%m-%d %H:%M:%S.%f))

最后，附上原文传送门~

https://towardsdatascience.com/the-simple-yet-practical-data-cleaning-codes-ad27c4ce0a38

—完—

相关文章

微信