跳到主要内容

10 minutes to pandas

Object creation

s = pd.Series([1, 3, 5, np.nan, 6, 8])
dates = pd.date_range("20130101", periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)

df2.dtypes
Out[11]:
A float64
B datetime64[s]
C float32
D int32
E category
F object
dtype: object

Viewing data

df.head()

df.tail(3)

df.index

df.columns

df.to_numpy()

df.describe()

df.T

df.sort_index(axis=1, ascending=False)

df.sort_values(by="B")

Selection

Getitem ([])

For a DataFrame, passing a single label selects a columns and yields a Series equivalent to df.A

df["A"]

For a DataFrame, passing a slice : selects matching rows

df[0:3]

df["20130102":"20130104"]

Selection by label

Selecting a row matching a label

df.loc[dates[0]]

Selecting all rows (:) with a select column labels

df.loc[:, ["A", "B"]]

For label slicing, both endpoints are included

df.loc["20130102":"20130104", ["A", "B"]]
Out[29]:
A B
2013-01-02 1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04 0.721555 -0.706771

Selecting a single row and column label returns a scalar

df.loc[dates[0], "A"]

For getting fast access to a scalar (equivalent to the prior method)

df.at[dates[0], "A"]

Selection by position

Select via the position of the passed integers

df.iloc[3]

Integer slices acts similar to NumPy/Python:

df.iloc[3:5, 0:2]

Lists of integer position locations:

df.iloc[[1, 2, 4], [0, 2]]

For slicing rows explicitly:

df.iloc[1:3, :]

For slicing columns explicitly:

df.iloc[:, 1:3]

For getting a value explicitly:

df.iloc[1, 1]

For getting fast access to a scalar (equivalent to the prior method):

df.iat[1, 1]

Boolean indexing

Select rows where df.A is greater than 0

df[df["A"] > 0]

Using isin() method for filtering:

df2[df2["E"].isin(["two", "four"])]

Setting

df["F"] = s1

Setting by assigning with a NumPy array:

df.loc[:, "D"] = np.array([5] * len(df))

Missing data

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])

DataFrame.dropna() drops any rows that have missing data:

df1.dropna(how="any")

DataFrame.fillna() fills missing data:

df1.fillna(value=5)

Operations

Stats

df.mean()
df.mean(axis=1)

Merge

Concat

Concatenating pandas objects together row-wise with concat()

pieces = [df[:3], df[3:7], df[7:]]

pd.concat(pieces)
备注

Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it