Pandas 练习 75 题
编辑日期: 2024-11-28 文章阅读: 次
https://www.machinelearningplus.com/python/101-pandas-exercises-python/
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
1. How to import pandas and check the version?
import pandas as pd
print(pd.__version__)
print(pd.show_versions(as_json=True))
1.0.1
{'system': {'commit': None, 'python': '3.7.6.final.0', 'python-bits': 64, 'OS': 'Linux', 'OS-release': '5.4.0-29-generic', 'machine': 'x86_64', 'processor': 'x86_64', 'byteorder': 'little', 'LC_ALL': 'None', 'LANG': 'zh_CN.UTF-8', 'LOCALE': 'zh_CN.UTF-8'}, 'dependencies': {'pandas': '1.0.1', 'numpy': '1.18.1', 'pytz': '2019.3', 'dateutil': '2.8.1', 'pip': '20.0.2', 'setuptools': '45.2.0.post20200210', 'Cython': '0.29.15', 'pytest': '5.3.5', 'hypothesis': '5.5.4', 'sphinx': '2.4.0', 'blosc': None, 'feather': None, 'xlsxwriter': '1.2.7', 'lxml.etree': '4.6.2', 'html5lib': '1.0.1', 'pymysql': None, 'psycopg2': None, 'jinja2': '2.11.1', 'IPython': '7.12.0', 'pandas_datareader': None, 'bs4': '4.8.2', 'bottleneck': '1.3.2', 'fastparquet': None, 'gcsfs': None, 'matplotlib': '3.1.3', 'numexpr': '2.7.1', 'odfpy': None, 'openpyxl': '3.0.3', 'pandas_gbq': None, 'pyarrow': None, 'pytables': None, 'pyxlsb': None, 's3fs': None, 'scipy': '1.4.1', 'sqlalchemy': '1.3.13', 'tables': '3.6.1', 'tabulate': None, 'xarray': None, 'xlrd': '1.2.0', 'xlwt': '1.3.0', 'numba': '0.48.0'}}
None
2. How to create a series from a list, numpy array and dict?
Create a pandas series from each of the items below: a list, numpy and a dictionary
import numpy as np
a_list = list("abcdefg")
numpy_array = np.arange(1, 10)
dictionary = {"A": 0, "B":1, "C":2, "D":3, "E":5}
series1 = pd.Series(a_list)
print(series1)
series2 = pd.Series(numpy_array)
print(series2)
series3 = pd.Series(dictionary)
print(series3)
3. How to convert the index of a series into a column of a dataframe?
Convert the series ser into a dataframe with its index as another column on the dataframe.
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)
print(ser[:5])
ser_df = pd.DataFrame(ser)
ser_df.reset_index()
ser_df = ser.to_frame().reset_index()
ser_df
4. How to combine many series to form a dataframe?
Combine ser1 and ser2 to form a dataframe.
ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series(np.arange(26))
ser_df = pd.DataFrame(ser1, ser2).reset_index()
ser_df.head()
ser_df = pd.DataFrame({"col1":ser1, "col2":ser2})
ser_df.head(5)
ser_df = pd.concat([ser1, ser2], axis = 1)
ser_df.head()
5. How to assign name to the series’ index?
Give a name to the series ser calling it ‘alphabets’.
ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser.rename("alphabets")
ser.name = "other_name"
ser
6. How to get the items of series A not present in series B?
Get all items of ser1 and ser2 not common to both.
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])
ser1[~ser1.isin(ser2)]
7. How to get the items not common to both series A and series B?
Get all items of ser1 and ser2 not common to both.
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])
a_not_b = ser1[~ser1.isin(ser2)]
b_not_a = ser2[~ser2.isin(ser1)]
a_not_b.append(b_not_a, ignore_index = True)
ser_u = pd.Series(np.union1d(ser1, ser2))
ser_i = pd.Series(np.intersect1d(ser1, ser2))
ser_u[~ser_u.isin(ser_i)]
8. How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?
Compute the minimum, 25th percentile, median, 75th, and maximum of ser.
state = np.random.RandomState(100)
ser = pd.Series(state.normal(10, 5, 25))
ser.describe()
np.percentile(ser, q = [0, 25, 50, 75, 100])
9. How to get frequency counts of unique items of a series?
Calculate the frequency counts of each unique value ser.
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))
ser.value_counts()
10. How to keep only top 2 most frequent values as it is and replace everything else as ‘Other’?
From ser, keep the top 2 most frequent items as it is and replace everything else as ‘Other’.
np.random.RandomState(100)
ser = pd.Series(np.random.randint(1, 5, [12]))
ser
ser.value_counts()
ser[~ser.isin(ser.value_counts().index[:2])] = 'Other'
ser
11. How to bin a numeric series to 10 groups of equal size?
Bin the series ser into 10 equal deciles and replace the values with the bin name.
ser = pd.Series(np.random.random(20))
ser
pd.qcut(ser, q = 10)
pd.qcut(ser, q = [0, .10, .20, .30, .40, .50, .60, .70, .80, .90, 1], labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th']).head()
12. How to convert a numpy array to a dataframe of given shape? (L1)####
Reshape the series ser into a dataframe with 7 rows and 5 columns
ser = pd.Series(np.random.randint(1, 10, 35))
ser
pd.DataFrame(np.array(ser).reshape(7, 5))
pd.DataFrame(ser.values.reshape(7, 5))
13. How to find the positions of numbers that are multiples of 3 from a series?
Find the positions of numbers that are multiples of 3 from ser.
np.random.RandomState(100)
ser = pd.Series(np.random.randint(1, 5, 10))
ser
ser.where(lambda x: x%3 == 0).dropna()
#pd.Series(np.argwhere(ser%3 == 0).reshape(4))
np.argwhere(ser%3 == 0)
14. How to extract items at given positions from a series####
From ser, extract the items at positions in list pos.
ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]
ser.loc[pos]
ser.take(pos)
15. How to stack two series vertically and horizontally ?
Stack ser1 and ser2 vertically and horizontally (to form a dataframe).
ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))
ser1.append(ser2)
pd.concat([ser1, ser2], axis = 0)
pd.concat([ser1, ser2], axis = 1)
16. How to get the positions of items of series A in another series B?
Get the positions of items of ser2 in ser1 as a list.
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])
list(ser1[ser1.isin(ser2)].index)
[np.where(i == ser1)[0].tolist()[0] for i in ser2]
[pd.Index(ser1).get_loc(i) for i in ser2]
17. How to compute the mean squared error on a truth and predicted series?
Compute the mean squared error of truth and pred series.
truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)
(np.mean([(truth_i - pred_i)#### 2 for truth_i, pred_i in zip(truth, pred)]))
np.mean((truth-pred)#### 2)
from sklearn.metrics import mean_squared_error
mean_squared_error(truth, pred)
18. How to convert the first character of each element in a series to uppercase?
Change the first character of each word to upper case in each word of ser.
ser = pd.Series(['just', 'a', 'random', 'list'])
ser
[i.title() for i in ser]
ser.map(lambda x: x.title())
ser.map(lambda x: x[0].upper() + x[1:])
19. How to calculate the number of characters in each word in a series?
ser = pd.Series(['just', 'a', 'random', 'list'])
[len(i) for i in ser]
ser.map(len)
ser.apply(len)
20. How to compute difference of differences between consequtive numbers of a series?
Difference of differences between the consequtive numbers of ser.
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])
ser.diff(periods = 1).tolist()
ser.diff(periods = 1).diff(periods = 1).tolist()
21. How to convert a series of date-strings to a timeseries?
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
'''
Desired Output
0 2010-01-01 00:00:00
1 2011-02-02 00:00:00
2 2012-03-03 00:00:00
3 2013-04-04 00:00:00
4 2014-05-05 00:00:00
5 2015-06-06 12:20:00
'''
pd.to_datetime(ser)
from dateutil.parser import parse
ser.map(lambda x: parse(x))
22. How to get the day of month, week number, day of year and day of week from a series of date strings?
Get the day of month, week number, day of year and day of week from ser.
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
'''
Desired output
Date: [1, 2, 3, 4, 5, 6]
Week number: [53, 5, 9, 14, 19, 23]
Day num of year: [1, 33, 63, 94, 125, 157]
Day of week: ['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']
'''
pd.to_datetime(ser).dt.day.to_list()
pd.to_datetime(ser).dt.week.to_list()
pd.to_datetime(ser).dt.weekofyear.to_list()
pd.to_datetime(ser).dt.dayofyear.to_list()
week_dict = {0:"Monday", 1:"Tuesday", 2:"Wednesday", 3:"Thursday", 4:"Friday", 5:"Saturday", 6:"Sunday"}
pd.to_datetime(ser).dt.dayofweek.map(week_dict).to_list()
pd.to_datetime(ser).dt.weekday_name.to_list()
23. How to convert year-month string to dates corresponding to the 4th day of the month?
Change ser to dates that start with 4th of the respective months.
ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])
'''
Desired Output
0 2010-01-04
1 2011-02-04
2 2012-03-04
dtype: datetime64[ns]
'''
from dateutil.parser import parse
ser.map(lambda x: parse('04 ' + x))
from dateutil.parser import parse
ser_ts = ser.map(lambda x: parse(x))
ser_datestr = ser_ts.dt.year.astype('str') + '-' + ser_ts.dt.month.astype('str') + '-' + '04'
[parse(i).strftime('%Y-%m-%d') for i in ser_datestr]
24. How to filter words that contain atleast 2 vowels from a series?
From ser, extract words that contain atleast 2 vowels.
ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])
'''
Desired Output
0 Apple
1 Orange
4 Money
dtype: object
'''
vowels = list("aeiou")
list_ = []
for w in ser:
c = 0
for l in list(w.lower()):
if l in vowels:
c += 1
if c >= 2:
print(w)
list_.append(w)
ser[ser.isin(list_)]
from collections import Counter
mask = ser.map(lambda x: sum([Counter(x.lower()).get(i, 0) for i in list('aeiou')]) >= 2)
ser[mask]
25. How to filter valid emails from a series?
Extract the valid emails from the series emails. The regex pattern for valid emails is provided as reference.
emails = pd.Series(['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'
'''
Desired Output
1 rameses@egypt.com
2 matt@t.co
3 narendra@modi.com
dtype: object
'''
import re
re_ = re.compile(pattern)
emails[emails.str.contains(pat = re_, regex = True)]
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'
mask = emails.map(lambda x: bool(re.match(pattern, x)))
emails[mask]
emails.str.findall(pattern, flags=re.IGNORECASE)
[x[0] for x in [re.findall(pattern, email) for email in emails] if len(x) > 0]
26. How to get the mean of a series grouped by another series?
Compute the mean of weights of each fruit.
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
fruit
weights = pd.Series(np.linspace(1, 10, 10))
weights
#print(weights.tolist())
#print(fruit.tolist())
'''
Desired output
apple 6.0
banana 4.0
carrot 5.8
dtype: float64
'''
df = pd.concat([fruit, weights], axis = 1)
df
df.groupby(0).mean()
weights.groupby(fruit).mean()
27. How to compute the euclidean distance between two series?
Compute the euclidean distance between series (points) p and q, without using a packaged formula.
p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
'''
Desired Output
18.165
'''
suma = np.sqrt(np.sum([(p - q)#### 2 for p, q in zip(p, q)]))
suma
sum((p - q)#### 2)#### .5
np.linalg.norm(p-q)
28. How to find all the local maxima (or peaks) in a numeric series?
ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])
'''
Desired output
array([1, 5, 7])
'''
local_max = ser[(ser.shift(1) < ser) & (ser.shift(-1) < ser)]
local_max.index
dd = np.diff(np.sign(np.diff(ser)))
dd
peak_locs = np.where(dd == -2)[0] + 1
peak_locs
29. How to replace missing spaces in a string with the least frequent character?
Replace the spaces in my_str with the least frequent character.
Go back to the table of contents
my_str = 'dbc deb abed ggade'
'''
Desired Output
'dbccdebcabedcggade' # least frequent is 'c'
'''
from collections import Counter
my_str_ = my_str
Counter_ = Counter(list(my_str_.replace(" ", "")))
Counter_
minimum = min(Counter_, key = Counter_.get)
print(my_str.replace(" ", minimum))
ser = pd.Series(list(my_str.replace(" ", "")))
ser.value_counts()
minimum = list(ser.value_counts().index)[-1]
minimum
print(my_str.replace(" ", minimum))
30. How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (saturdays) after that having random numbers as values?
'''
Desired Output
values can be random
2000-01-01 4
2000-01-08 1
2000-01-15 8
2000-01-22 4
2000-01-29 4
2000-02-05 2
2000-02-12 4
2000-02-19 9
2000-02-26 6
2000-03-04 6
'''
dti = pd.Series(pd.date_range('2000-01-01', periods=10, freq='W-SAT'))
random_num = pd.Series([np.random.randint(1, 10) for i in range(10)])
df = pd.concat({"Time":dti, "Numbers":random_num}, axis = 1)
df
ser = pd.Series(np.random.randint(1,10,10), pd.date_range('2000-01-01', periods=10, freq='W-SAT'))
ser
31. How to fill an intermittent time series so all missing dates show up with values of previous non-missing date?
ser has missing dates and values. Make all missing dates appear and fill up with value from previous date.
ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))
'''
Desired Output
2000-01-01 1.0
2000-01-02 1.0
2000-01-03 10.0
2000-01-04 10.0
2000-01-05 10.0
2000-01-06 3.0
2000-01-07 3.0
2000-01-08 NaN
'''
indx = pd.date_range("2000-01-01", "2000-01-08")
ser = ser.reindex(indx)
ser.fillna(method = "ffill")
ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))
ser.resample('D').ffill() # fill with previous value
ser.resample('D').bfill() # fill with next value
ser.resample('D').bfill().ffill() # fill next else prev value
32. How to compute the autocorrelations of a numeric series?
Compute autocorrelations for the first 10 lags of ser. Find out which lag has the largest correlation.
ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))
'''
Desired Output
[0.29999999999999999, -0.11, -0.17000000000000001, 0.46000000000000002, 0.28000000000000003, -0.040000000000000001, -0.37, 0.41999999999999998, 0.47999999999999998, 0.17999999999999999]
Lag having highest correlation: 9
'''
autocorrelations = [ser.autocorr(i).round(2) for i in range(11)]
print(autocorrelations[1:])
print('Lag having highest correlation: ', np.argmax(np.abs(autocorrelations[1:]))+1)
33. How to import only every nth row from a csv file to create a dataframe?
Import every 50th row of BostonHousing dataset as a dataframe.
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
with open("/kaggle/input/boston-house-prices/housing.csv") as f:
data = f.read()
nth_rows = []
for i, rows in enumerate(data.split("\n")):
if i%50 == 0:
nth_rows.append(rows)
nth_rows[0]
data_ = [nth_rows[i].split() for i in range(len(nth_rows))]
df = pd.DataFrame(data_, columns=names)
df
34. How to change column values when importing csv to a dataframe?
Import the boston housing dataset, but while importing change the 'medv' (median house value) column so that values < 25 becomes ‘Low’ and > 25 becomes ‘High’.
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
with open("/kaggle/input/boston-house-prices/housing.csv") as f:
data = f.read()
nth_rows = []
for i, rows in enumerate(data.split("\n")):
nth_rows.append(rows)
data_ = [nth_rows[i].split() for i in range(len(nth_rows))]
df = pd.DataFrame(data_, columns=names)
df.head()
df.to_csv("housing_preprocessed.csv")
del df
df = pd.read_csv("housing_preprocessed.csv", index_col = 0, skipfooter=1, converters = {"MEDV": lambda x: "HIGH" if float(x) >= 25 else "LOW"})
df
35. How to create a dataframe with rows as strides from a given series?
L = pd.Series(range(15))
'''
Desired Output
array([[ 0, 1, 2, 3],
[ 2, 3, 4, 5],
[ 4, 5, 6, 7],
[ 6, 7, 8, 9],
[ 8, 9, 10, 11],
[10, 11, 12, 13]])
'''
index_ = np.arange(0, 15, 2)
index_
my_list = []
for i in range(6):
my_list.append(list(L[index_[i]:index_[i+2]]))
np.array(my_list)
np.array([L[index_[i]:index_[i+2]] for i in range(6)])
def gen_strides(a, stride_len=5, window_len=5):
n_strides = ((a.size-window_len)//stride_len) + 1
return np.array([a[s:(s+window_len)] for s in np.arange(0, a.size, stride_len)[:n_strides]])
gen_strides(L, stride_len=2, window_len=4)
36. How to import only specified columns from a csv file?
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
with open("/kaggle/input/boston-house-prices/housing.csv") as f:
data = f.read()
nth_rows = []
for i, rows in enumerate(data.split("\n")):
nth_rows.append(rows)
data_ = [nth_rows[i].split() for i in range(len(nth_rows))]
df = pd.DataFrame(data_, columns=names)
df.to_csv("housing_preprocessed.csv")
del df
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
file = "housing_preprocessed.csv"
df = pd.read_csv(file, usecols = [1, 2, 4], skipfooter=1)
df.head()
df = pd.read_csv(file, usecols = ["CRIM", "ZN", "CHAS"])
df.head()
37. How to get the nrows, ncolumns, datatype, summary stats of each column of a dataframe? Also get the array and list equivalent.####
df = pd.read_csv("housing_preprocessed.csv", index_col=0 ,skipfooter=1)
df.shape
df.dtypes
df.info()
df.get_dtype_counts()
df.dtypes.value_counts()
df.describe()
38. How to extract the row and column number of a particular cell with given criterion?
df = pd.read_csv("housing_preprocessed.csv", skipfooter=1, index_col=0)
max_tax = df["TAX"].max()
max_tax
df[df["TAX"] == max_tax]
df.loc[df["TAX"] == np.max(df["TAX"]), ["CRIM", "ZN", "TAX"]]
row, col = np.where(df.values == np.max(df["TAX"]))
for i, j in zip(row, col):
print(i , j)
df.iat[row[0], col[0]]
df.iloc[row[0], col[0]]
df.at[row[0], 'TAX']
df.get_value(row[0], 'TAX')
39. How to rename a specific columns in a dataframe?
cars93 = pd.read_csv("../input/cars93/Cars93.csv", index_col=0)
cars93.head()
'''
Desired Output
Index(['Manufacturer', 'Model', 'CarType', 'Min_Price', 'Price', 'Max_Price',
'MPG_city', 'MPG_highway', 'AirBags', 'DriveTrain', 'Cylinders',
'EngineSize', 'Horsepower', 'RPM', 'Rev_per_mile', 'Man_trans_avail',
'Fuel_tank_capacity', 'Passengers', 'Length', 'Wheelbase', 'Width',
'Turn_circle', 'Rear_seat_room', 'Luggage_room', 'Weight', 'Origin',
'Make'],
dtype='object')
'''
cars93 = pd.read_csv("../input/cars93/Cars93.csv", index_col=0)
cars93.rename(columns={"Type":"CarType"}, inplace = True)
cols = cars93.columns
df.columns.values[2] = "CarType"
cols = list(map(lambda x: x.replace(".", "_"), cols))
cars93.columns = cols
cars93.head()
cars93 = pd.read_csv("../input/cars93/Cars93.csv", index_col=0)
cols = cars93.columns
cols = list(map(lambda x: x.replace(".", "_"), cols))
cols[cols.index("Type")] = "CarType"
cars93.columns = cols
cars93.head()
40. How to check if a dataframe has any missing values?
df = pd.read_csv("../input/cars93/Cars93.csv")
df
print("Our df has a total of {} null values".format(df.isnull().sum().sum()))
print()
df.isnull().values.any()
print()
def report_nulls(df):
'''
Show a fast report of the DF.
'''
rows = df.shape[0]
columns = df.shape[1]
null_cols = 0
list_of_nulls_cols = []
for col in list(df.columns):
null_values_rows = df[col].isnull().sum()
null_rows_pcn = round(((null_values_rows)/rows)*100, 2)
col_type = df[col].dtype
if null_values_rows > 0:
print("The column {} has {} null values. It is {}% of total rows.".format(col, null_values_rows, null_rows_pcn))
print("The column {} is of type {}.\n".format(col, col_type))
null_cols += 1
list_of_nulls_cols.append(col)
null_cols_pcn = round((null_cols/columns)*100, 2)
print("The DataFrame has {} columns with null values. It is {}% of total columns.".format(null_cols, null_cols_pcn))
return list_of_nulls_cols
report_nulls(df)
41. How to count the number of missing values in each column?
Count the number of missing values in each column of df. Which column has the maximum number of missing values?
df = pd.read_csv("../input/cars93/Cars93.csv")
df_null = pd.DataFrame(df.isnull().sum())
df_null[df_null[0] > 0][0].argmax()
df_null[df_null[0] > 0][0].idxmax()
n_missings_each_col = df.apply(lambda x: x.isnull().sum())
n_missings_each_col.argmax()
n_missings_each_col.idxmax()
42. How to replace missing values of multiple numeric columns with the mean?
Replace missing values in Luggage.room columns with their respective mean.
df = pd.read_csv("../input/cars93/Cars93.csv")
beg_null = df.isnull().sum().sum()
print(beg_null)
df[["Luggage.room"]] = df[["Luggage.room"]].apply(lambda x: x.fillna(x.mean()))
end_null = df.isnull().sum().sum()
print(end_null)
print("We have got rid of {} null values, filling them with the mean.".format(beg_null - end_null))
43. How to use apply function on existing columns with global variables as additional arguments?
In df, use apply method to replace the missing values in Rear.seat.room with mean Luggage.room with median by passing an argument to the function.
df = pd.read_csv("../input/cars93/Cars93.csv")
df = pd.read_csv("../input/cars93/Cars93.csv")
print("We have a total of {} nulls".format(df.isnull().sum().sum()))
d = {'Rear.seat.room': np.nanmean, 'Luggage.room': np.nanmedian}
df[['Rear.seat.room', 'Luggage.room']] = df[['Rear.seat.room', 'Luggage.room']].apply(lambda x, d: x.fillna(d[x.name](x)), args=(d, ))
print("We have a total of {} nulls".format(df.isnull().sum().sum()))
df["Rear.seat.room"].sum()
df["Luggage.room"].sum()
df = pd.read_csv("../input/cars93/Cars93.csv")
print("We have a total of {} nulls".format(df.isnull().sum().sum()))
def num_inputer(x, strategy):
if strategy.lower() == "mean":
x = x.fillna(value = np.nanmean(x))
if strategy.lower() == "median":
x = x.fillna(value = np.nanmedian(x))
return x
df['Rear.seat.room'] = df[['Rear.seat.room']].apply(num_inputer, args = ["mean"])
df['Luggage.room'] = df[['Luggage.room']].apply(num_inputer, args = ["median"])
print("We have a total of {} nulls".format(df.isnull().sum().sum()))
df["Rear.seat.room"].sum()
df["Luggage.room"].sum()
44. How to select a specific column from a dataframe as a dataframe instead of a series?
Get the first column (a) in df as a dataframe (rather than as a Series).
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
type(df["a"].to_frame())
type(pd.DataFrame(df["a"]))
type(df[['a']])
type(df.loc[:, ['a']])
type(df.iloc[:, [0]])
type(df.a)
type(df['a'])
type(df.loc[:, 'a'])
type(df.iloc[:, 1])
45. How to change the order of columns of a dataframe?
Actually 3 questions.
-
In df, interchange columns 'a' and 'c'.
-
Create a generic function to interchange two columns, without hardcoding column names.
-
Sort the columns in reverse alphabetical order, that is colume 'e' first through column 'a' last.
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
df[["c", "b", "a", "d", "e"]]
df
df = df[["c", "b", "a", "d", "e"]]
df
def change_cols(df, col1, col2):
df_columns = df.columns.to_list()
index1 = df_columns.index(col1)
index2 = df_columns.index(col2)
df_columns[index1], df_columns[index2] = col1, col2
return df[df_columns]
df = change_cols(df, "b", "e")
df
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
col_list = list(df.columns)
col_list_reversed = col_list[::-1]
col_list
col_list_reversed
df = df[col_list_reversed]
df
print("Solution from the website")
print("-------------------------")
df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))
df[list('cbade')]
def switch_columns(df, col1=None, col2=None):
colnames = df.columns.tolist()
i1, i2 = colnames.index(col1), colnames.index(col2)
colnames[i2], colnames[i1] = colnames[i1], colnames[i2]
return df[colnames]
df1 = switch_columns(df, 'a', 'c')
df[sorted(df.columns)]
df.sort_index(axis=1, ascending=False, inplace=True)
46. How to set the number of rows and columns displayed in the output?
Change the pandas display settings on printing the dataframe df it shows a maximum of 10 rows and 10 columns.
df = pd.read_csv("../input/cars93/Cars93.csv")
pd.set_option("display.max_columns",10)
pd.set_option("display.max_rows",10)
df
47. How to format or suppress scientific notations in a pandas dataframe?
Suppress scientific notations like ‘e-03’ in df and print upto 4 numbers after decimal.
df = pd.DataFrame(np.random.random(5)#### 10, columns=['random'])
'''
Desired Output
#> random
#> 0 0.0035
#> 1 0.0000
#> 2 0.0747
#> 3 0.0000
'''
print("Initial DF")
df
print("Using solution 1")
df.round(4)
df
pd.reset_option('^display.', silent=True)
print("Using solution 2")
df.apply(lambda x: '%.4f' %x, axis=1).to_frame()
df
pd.reset_option('^display.', silent=True)
print("Using solution 3")
pd.set_option('display.float_format', lambda x: '%.4f'%x)
df
pd.reset_option('^display.', silent=True)
df
48. How to format all the values in a dataframe as percentages?
Format the values in column 'random' of df as percentages.
df = pd.DataFrame(np.random.random(4), columns=['random'])
df
out = df.style.format({
'random': '{0:.2%}'.format,
})
out
pd.options.display.float_format = '{:,.2f}%'.format
df*100
pd.reset_option('^display.', silent=True)
49. How to filter every nth row in a dataframe?
From df, filter the 'Manufacturer', 'Model' and 'Type' for every 20th row starting from 1st (row 0).
df = pd.read_csv("../input/cars93/Cars93.csv")
df
df = pd.read_csv("../input/cars93/Cars93.csv", usecols=["Manufacturer", "Model", "Type"])
df[::20]
df = pd.read_csv("../input/cars93/Cars93.csv", usecols=["Manufacturer", "Model", "Type"])
df.iloc[::20, :][['Manufacturer', 'Model', 'Type']]
50. How to create a primary key index by combining relevant columns?
In df, Replace NaNs with ‘missing’ in columns 'Manufacturer', 'Model' and 'Type' and create a index as a combination of these three columns and check if the index is a primary key.
df = pd.read_csv("../input/cars93/Cars93.csv")
df
df = pd.read_csv("../input/cars93/Cars93.csv", usecols=["Manufacturer", "Model", "Type", "Min.Price", "Max.Price"])
df.isnull().sum().sum()
df.fillna("missing")
df["new_index"] = df["Manufacturer"] + df["Model"] + df["Type"]
df.set_index("new_index", inplace = True)
df
51. How to get the row number of the nth largest value in a column?
Find the row position of the 5th largest value of column 'a' in df.
df = pd.DataFrame(np.random.randint(1, 30, 30).reshape(10,-1), columns=list('abc'))
df
arg_sort = df["a"].argsort()
#arg_sort.to_frame()
#arg_sort[0]
#df
df = df.iloc[arg_sort]
df["arg_sort"] = arg_sort
df
n_largest = 5
print("The {} largest values in our DF is at row/index {} and the value is {}".format(n_largest, (df[df["arg_sort"] == (n_largest-1)].index[0]), df[df["arg_sort"] == (n_largest-1)]["a"].iloc[0]))
n = 5
df['a'].argsort()[::-1][n]
52. How to find the position of the nth largest value greater than a given value?
In ser, find the position of the 2nd largest value greater than the mean.
ser = pd.Series(np.random.randint(1, 100, 15))
#ser
sorted_ser = ser[ser.argsort()[::-1]]
#sorted_ser
sorted_ser[sorted_ser > sorted_ser.mean()].index[1]
print('ser: ', ser.tolist(), 'mean: ', round(ser.mean()))
np.argwhere(ser > ser.mean())[1]
ser[ser > ser.mean()].index[1]
53. How to get the last n rows of a dataframe with row sum > 100?
Get the last two rows of df whose row sum is greater than 100.
df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))
df1 = df.copy(deep = True)
df["sum"] = df.sum(axis = 1)
df
print("The index of the rows that are greater than 100 are {}".format((df[df["sum"] > 100].index).to_list()[-2:]))
rowsums = df1.apply(np.sum, axis=1)
last_two_rows = df1.iloc[np.where(rowsums > 100)[0][-2:], :]
last_two_rows
54. How to find and cap outliers from a series or dataframe column?
Replace all values of ser in the lower 5%ile and greater than 95%ile with respective 5th and 95th %ile value.
ser = pd.Series(np.logspace(-2, 2, 30))
ser1 = ser.copy(deep = True)
ser2 = ser.copy(deep = True)
quantiles = np.quantile(ser, [0.05, 0.95])
ser
ser.iloc[np.where(ser < quantiles[0])] = quantiles[0]
ser.iloc[np.where(ser > quantiles[1])] = quantiles[1]
ser1[ser1 < quantiles[0]] = quantiles[0]
ser1[ser1 > quantiles[1]] = quantiles[1]
ser1
def cap_outliers(ser, low_perc, high_perc):
low, high = ser.quantile([low_perc, high_perc])
print(low_perc, '%ile: ', low, '|', high_perc, '%ile: ', high)
ser[ser < low] = low
ser[ser > high] = high
return(ser)
capped_ser = cap_outliers(ser2, .05, .95)
ser2
capped_ser
55. How to reshape a dataframe to the largest possible square after removing the negative values?
Reshape df to the largest possible square with negative values removed. Drop the smallest values if need be. The order of the positive numbers in the result should remain the same as the original.
df = pd.DataFrame(np.random.randint(-20, 50, 100).reshape(10,-1))
my_array = np.array(df.values.reshape(-1, 1)) # convert to numpy
my_array = my_array[my_array > 0] # filter only positive values
lar_square = int(np.floor(my_array.shape[0]#### 0.5)) # find the largest square
arg_sort = np.argsort(my_array)[::-1][0:lar_square#### 2] # eliminate the smallest values that will prevent from converting to a square
my_array = np.take(my_array, sorted(arg_sort)).reshape(lar_square, lar_square) # filter the array and reshape back
my_array
arr = df[df > 0].values.flatten()
arr_qualified = arr[~np.isnan(arr)]
n = int(np.floor(arr_qualified.shape[0]#### .5))
top_indexes = np.argsort(arr_qualified)[::-1]
output = np.take(arr_qualified, sorted(top_indexes[:n#### 2])).reshape(n, -1)
print(output)
56. How to swap two rows of a dataframe?
Swap rows 1 and 2 in df.
df = pd.DataFrame(np.arange(25).reshape(5, -1))
df
print("Original DataFrame")
df
temp_col = df[1].copy(deep = True)
df[1], df[2] = df[2], temp_col
print("Swapped Columns DataFrame")
df
print("Original DataFrame")
df
temp_row = df.iloc[1].copy(deep = True)
df.iloc[1], df.iloc[2] = df.iloc[2], temp_row
print("Swapped Rows DataFrame")
df
def swap_rows(df, i1, i2):
a, b = df.iloc[i1, :].copy(), df.iloc[i2, :].copy()
df.iloc[i1, :], df.iloc[i2, :] = b, a
return df
print(swap_rows(df, 1, 2))
57. How to reverse the rows of a dataframe?
Reverse all the rows of dataframe df.
df = pd.DataFrame(np.arange(25).reshape(5, -1))
df
df.iloc[df.index.to_list()[::-1]]
df.iloc[::-1, :]
print(df.loc[df.index[::-1], :])
58. How to create one-hot encodings of a categorical variable (dummy variables)?
Get one-hot encodings for column 'a' in the dataframe df and append it as columns.
df = pd.DataFrame(np.arange(25).reshape(5,-1), columns=list('abcde'))
'''
Desired Output
0 5 10 15 20 b c d e
0 1 0 0 0 0 1 2 3 4
1 0 1 0 0 0 6 7 8 9
2 0 0 1 0 0 11 12 13 14
3 0 0 0 1 0 16 17 18 19
4 0 0 0 0 1 21 22 23 24
'''
dummies = pd.get_dummies(df["a"])
df = pd.concat([dummies, df], axis = 1)
df
df_onehot = pd.concat([pd.get_dummies(df['a']), df[list('bcde')]], axis=1)
df_onehot
59. Which column contains the highest number of row-wise maximum values?
Obtain the column name with the highest number of row-wise maximum’s in df.
df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1))
def get_col(df):
columns = list(df.columns)
df["col_index_with_max"] = ""
for i in range(len(df)):
row_values = list(df.iloc[i, :-1].values)
max_value = np.max(row_values)
col_index = row_values.index(max_value)
df["col_index_with_max"].iloc[i] = col_index
get_col(df)
df
print("The col with maximum amont of maximun per row if {} with a total of {} maximus".format(df.groupby("col_index_with_max").size()[::-1].index[0], \
df.groupby("col_index_with_max").size()[::-1].values[0]))
print('Column with highest row maxes: ', df.apply(np.argmax, axis=1).value_counts().index[0])
60. How to create a new column that contains the row number of nearest column by euclidean distance?
Create a new column such that, each row contains the row number of nearest row-record by euclidean distance.
df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1), columns=list('pqrs'), index=list('abcdefghij'))
'''
Desired Output
df
'''
#######################################################################################################################################
df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1), columns=list('pqrs'), index=list('abcdefghij'))
corr_list = []
index_list = []
max_corr = 0
current_index = ""
for i in range(len(df)):
for j in range(len(df)):
if i == j:
pass
else:
curr_corr = sum((df.iloc[i] - df.iloc[j])#### 2)#### .5
#curr_corr = df.iloc[i].corr(df.iloc[j])
if curr_corr >= max_corr:
max_corr = curr_corr
current_index = list(df.index)[j]
corr_list.append(max_corr)
index_list.append(current_index)
max_corr = 0
current_index = ""
df["nearest_row"] = index_list
df["dist"] = corr_list
df
df.drop(["nearest_row", "dist"], axis = 1, inplace = True)
#######################################################################################################################################
nearest_rows = []
nearest_distance = []
for i, row in df.iterrows():
curr = row
rest = df.drop(i)
e_dists = {} # init dict to store euclidean dists for current row.
for j, contestant in rest.iterrows():
e_dists.update({j: round(np.linalg.norm(curr.values - contestant.values))})
nearest_rows.append(max(e_dists, key=e_dists.get))
nearest_distance.append(max(e_dists.values()))
df['nearest_row'] = nearest_rows
df['dist'] = nearest_distance
df
61. How to know the maximum possible correlation value of each column against other columns?
For each column get the maximum possible correlation with other columns (only 1 value)
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1), columns=list('pqrstuvwxy'), index=list('abcdefgh'))
df_corr = np.abs(df.corr())
max_corr = df_corr.apply(lambda x: sorted(x)[-2], axis = 0)
max_corr
62. How to create a column containing the minimum by maximum of each row?
Compute the minimum-by-maximum for every row of df.
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df1 = df.copy(deep = True)
df2 = df.copy(deep = True)
df["min_by_max"] = (df.apply(min, axis = 1)/df.apply(max, axis = 1))
df
min_by_max = df1.apply(lambda x: np.min(x)/np.max(x), axis=1)
min_by_max
min_by_max = np.min(df2, axis=1)/np.max(df2, axis=1)
min_by_max
63. How to create a column that contains the penultimate value in each row?
Create a new column 'penultimate' which has the second largest value of each row of df.
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df["penultimate"] = df.apply(lambda x: np.partition(x, -2)[-2], axis = 1)
df
df.drop("penultimate", inplace = True, axis = 1)
df["penultimate"] = df.apply(lambda x: sorted(list(x))[-2], axis = 1)
df
df.drop("penultimate", inplace = True, axis = 1)
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
df
64. How to normalize all columns in a dataframe?
-
Normalize all columns of df by subtracting the column mean and divide by standard deviation.
-
Range all columns of df such that the minimum value in each column is 0 and max is 1.
Don’t use external packages like sklearn####
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df1 = df.copy(deep = True)
df = df.apply(lambda x: ((x-np.mean(x))/np.std(x)), axis = 0)
df
df1 = df1.apply(lambda x: ((x.max() - x)/(x.max() - x.min())).round(2))
df1
65. How to compute the correlation of each row with the suceeding row?
Compute the correlation of each row of df with its succeeding row.
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df["corr"] = 0
for i in range(len(df)-1):
values1 = df.iloc[i, :-1].astype('float64')
values2 = df.iloc[i+1, :-1].astype('float64')
corr = values1.corr(values2)
df["corr"].iloc[i] = corr
df
df.drop("corr", inplace = True, axis = 1)
[df.iloc[i].corr(df.iloc[i+1]).round(2) for i in range(df.shape[0])[:-1]]
66. How to replace both the diagonals of dataframe with 0?
Replace both values in both diagonals of df with 0.
df = pd.DataFrame(np.random.randint(1,100, 100).reshape(10, -1))
df1 = df.copy(deep = True)
'''
Desired Output (might change because of randomness)
'''
df = pd.DataFrame(np.random.randint(1,100, 100).reshape(10, -1))
df1 = df.copy(deep = True)
print("Original DF")
df
for i in range(len(df)):
for j in range(len(df)):
if i == j:
df.iloc[i ,j] = 0
df[::-1].iloc[i, j] = 0
print("DF from the solution 1")
df
for i in range(df1.shape[0]):
df1.iat[i, i] = 0
df1.iat[df1.shape[0]-i-1, i] = 0
print("DF from the solution 2")
df1
67. How to get the particular group of a groupby dataframe by key?
This is a question related to understanding of grouped dataframe. From df_grouped, get the group belonging to 'apple' as a dataframe.
df = pd.DataFrame({'col1': ['apple', 'banana', 'orange'] * 3,
'col2': np.random.rand(9),
'col3': np.random.randint(0, 15, 9)})
df_grouped = df.groupby(['col1'])
pd.DataFrame(df_grouped)
df_grouped.groups["apple"]
df_grouped.get_group("apple")
for i, dff in df_grouped:
if i == 'apple':
print(dff)
68. How to get the n’th largest value of a column when grouped by another column?
In df, find the second largest value of 'rating' for 'banana'
df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
'rating': np.random.rand(9),
'price': np.random.randint(0, 15, 9)})
grouped_by = df["rating"].groupby(df["fruit"])
grouped_by.get_group("banana")
list(grouped_by.get_group("banana"))[1]
df_grpd = df['rating'].groupby(df.fruit)
df_grpd.get_group('banana')
df_grpd.get_group('banana').sort_values().iloc[-2]
69. How to compute grouped mean on pandas dataframe and keep the grouped column as another column (not index)?
In df, Compute the mean price of every fruit, while keeping the fruit as another column instead of an index.
df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
'rating': np.random.rand(9),
'price': np.random.randint(0, 15, 9)})
df
df_grouped = pd.pivot_table(df[["fruit", "price"]], index = ["fruit"], aggfunc = np.mean ).reset_index()
df_grouped
out = df.groupby('fruit', as_index=False)['price'].mean()
out
70. How to join two dataframes by 2 columns so they have only the common rows?
Join dataframes df1 and df2 by ‘fruit-pazham’ and ‘weight-kilo’.
df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
'weight': ['high', 'medium', 'low'] * 3,
'price': np.random.randint(0, 15, 9)})
df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
'kilo': ['high', 'low'] * 3,
'price': np.random.randint(0, 15, 6)})
df1
df2
merge_df = pd.merge(df1, df2, left_on=["fruit", "weight"], right_on=["pazham", "kilo"])
merge_df
pd.merge(df1, df2, how='inner', left_on=['fruit', 'weight'], right_on=['pazham', 'kilo'], suffixes=['_left', '_right'])
71. How to remove rows from a dataframe that are present in another dataframe?
From df1, remove the rows that are present in df2. All three columns must be the same.
df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
'weight': ['high', 'medium', 'low'] * 3,
'price': np.random.randint(0, 10, 9)})
df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
'kilo': ['high', 'low'] * 3,
'price': np.random.randint(0, 10, 6)})
df1
df2
#df1.merge(df2, how = "inner", left_on = ["fruit", "weight", "price"], right_on = ["pazham", "kilo", "price"])
df1["concat"] = df1["fruit"].astype(str) + df1["weight"].astype(str) + df1["price"].astype(str)
#df1
df2["concat"] = df2["pazham"].astype(str) + df2["kilo"].astype(str) + df2["price"].astype(str)
#df2
df1 = df1[~df1["concat"].isin(df2["concat"])]
df1.drop("concat", inplace = True, axis = 1)
df1
#df1[~df1.isin(df2).all(1)]
72. How to get the positions where values of two columns match?
Find the index where col fruit1 and fruit2 match
df = pd.DataFrame({'fruit1': np.random.choice(['apple', 'orange', 'banana'], 10),
'fruit2': np.random.choice(['apple', 'orange', 'banana'], 10)})
df
np.where(df.fruit1 == df.fruit2)
73. How to create lags and leads of a column in a dataframe?
Create two new columns in df, one of which is a lag1 (shift column a down by 1 row) of column ‘a’ and the other is a lead1 (shift column b up by 1 row).
df = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list('abcd'))
df
'''
Desired Output
a b c d a_lag1 b_lead1
0 66 34 76 47 NaN 86.0
1 20 86 10 81 66.0 73.0
2 75 73 51 28 20.0 1.0
3 1 1 9 83 75.0 47.0
4 30 47 67 4 1.0 NaN
'''
df["lag1"] = df["a"].shift(1)
df["lead1"] = df["b"].shift(-1)
df
74. How to get the frequency of unique values in the entire dataframe?
Get the frequency of unique values in the entire dataframe df.
df = pd.DataFrame(np.random.randint(1, 10, 20).reshape(-1, 4), columns = list('abcd'))
pd.value_counts(df.values.ravel())
75. How to split a text column into two separate columns?
Split the string column in df to form a dataframe with 3 columns as shown.
df = pd.DataFrame(["STD, City State",
"33, Kolkata West Bengal",
"44, Chennai Tamil Nadu",
"40, Hyderabad Telengana",
"80, Bangalore Karnataka"], columns=['row'])
df
'''
Desired Output
0 STD City State
1 33 Kolkata West Bengal
2 44 Chennai Tamil Nadu
3 40 Hyderabad Telengana
4 80 Bangalore Karnataka
'''
df["re"] = df["row"].apply(lambda x: " ".join(x.split()).split(None, 2, ))
new_header = df["re"][0]
values = df["re"][1:]
d = {new_header[0]:[int(values.iloc[i][0].replace(",", "")) for i in range(len(values))], \
new_header[1]:[values.iloc[i][1].replace(",", "") for i in range(len(values))], \
new_header[2]:[values.iloc[i][2].replace(",", "") for i in range(len(values))]}
new_df = pd.DataFrame(d)
new_df