Intro to data structures
Series
s = pd.Series(data, index=index)
From ndarray
If data is an ndarray
, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1]
.
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
pd.Series(np.random.randn(5))
From dict
d = {"b": 1, "a": 0, "c": 2}
pd.Series(d)
pd.Series(d, index=["b", "c", "d", "a"])
From scalar value
pd.Series(5.0, index=["a", "b", "c", "d", "e"])
Series is ndarray-like
Series acts very similarly to a ndarray and is a valid argument to most NumPy functions
s.iloc[0]
s.iloc[:3]
s[s > s.median()]
s.iloc[[4, 3, 1]]
np.exp(s)
s.to_numpy()
s.dtype
This is often a NumPy dtype. However, pandas and 3rd-party libraries extend NumPy’s type system in a few places, in which case the dtype would be an ExtensionDtype
. Some examples within pandas are Categorical data
and Nullable integer data type
.
s.array
Series.array
will always be an ExtensionArray
. Briefly, an ExtensionArray
is a thin wrapper around one or more concrete arrays like a numpy.ndarray
Series is dict-like
A Series is also like a fixed-size dict in that you can get and set values by index label:
s["a"]
s["e"] = 12.0
Using the Series.get()
method, a missing label will return None
or specified default:
s.get("f")
s.get("f", np.nan)
Vectorized operations and label alignment with Series
s + s
s * 2
np.exp(s)
A key difference between Series
and ndarray
is that operations between Series
automatically align the data based on label
s.iloc[1:] + s.iloc[:-1]
Name attribute
s = pd.Series(np.random.randn(5), name="something")
s2 = s.rename("different")
DataFrame
From dict of Series or dicts
The resulting index will be the union of the indexes of the various Series
. If there are any nested dicts, these will first be converted to Series
. If no columns are passed, the columns will be the ordered list of dict keys
d = {
"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
pd.DataFrame(d, index=["d", "b", "a"])
pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])
df.index
df.columns
From dict of ndarrays / lists
All ndarrays must share the same length.
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
pd.DataFrame(d)
pd.DataFrame(d, index=["a", "b", "c", "d"])
From structured or record array
data = np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])
data[:] = [(1, 2.0, "Hello"), (2, 3.0, "World")]
pd.DataFrame(data)
pd.DataFrame(data, index=["first", "second"])
pd.DataFrame(data, columns=["C", "A", "B"])
From a list of dicts
data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
pd.DataFrame(data2)
pd.DataFrame(data2, index=["first", "second"])
pd.DataFrame(data2, columns=["a", "b"])
From a Series
ser = pd.Series(range(3), index=list("abc"), name="ser")
pd.DataFrame(ser)
DataFrame.from_dict
In [68]: pd.DataFrame.from_dict(dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]))
Out[68]:
A B
0 1 4
1 2 5
2 3 6
In [69]: pd.DataFrame.from_dict(
....: dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]),
....: orient="index",
....: columns=["one", "two", "three"],
....: )
....:
Out[69]:
one two three
A 1 2 3
B 4 5 6
DataFrame.from_records
In [70]: data
Out[70]:
array([(1, 2., b'Hello'), (2, 3., b'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
In [71]: pd.DataFrame.from_records(data, index="C")
Out[71]:
A B
C
b'Hello' 1 2.0
b'World' 2 3.0
Column selection, addition, deletion
Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:
df["one"]
df["three"] = df["one"] * df["two"]
df["flag"] = df["one"] > 2
del df["two"]
three = df.pop("three")
df.insert(1, "bar", df["one"])
Assigning new columns in method chains
iris.assign(sepal_ratio=iris["SepalWidth"] / iris["SepalLength"])
Indexing / selection
df[col]
df.loc[label]
df.iloc[loc]
df[5:10]
df[bool_vec]
DataFrame interoperability with NumPy functions
Series
implements __array_ufunc__
, which allows it to work with NumPy’s universal functions.