
Introduction¶
Data is everywhere. From the smallest photon interactions to galaxy collisions, from mouse movements on a screen to economic developments of countries, we are surrounded by the sea of information. The human mind cannot comprehend this data in all its complexity; since ancient times people found it much easier to reduce the dimensionality, to impose a strict order, to arrange the data points neatly on a rectangular grid: to make a data table.
But once the data has been collected into a table, it has been tamed. It may still need some grooming and exercise, essentially so it is no longer scary. Even if it is really Big Data, with the right tools you can approach it, play with it, bend it to your will, master it.
Python datatable
module is the right tool for the task. It is a library that
implements a wide (and growing) range of operators for manipulating
two-dimensional data frames. It focuses on: big data support, high performance,
both in-memory and out-of-memory datasets, and multi-threaded algorithms. In
addition, datatable
strives to achieve good user experience, helpful error
messages, and powerful API similar to R data.table
’s.
Getting started¶
Install datatable¶
Let’s begin by installing the latest stable version of datatable
from PyPI:
$ pip install datatable
If this didn’t work for you, or if you want to install the bleeding edge version of the library, please check the Installation page.
Assuming the installation was successful, you can now import the library in a JupyterLab notebook or in a Python console:
import datatable as dt
print(dt.__version__)
0.7.0
Loading data¶
The fundamental unit of analysis in datatable is a data Frame
. It is the
same notion as a pandas DataFrame or SQL table: data arranged in a
two-dimensional array with rows and columns.
You can create a Frame
object from a variety of data sources: from a python
list or dictionary, from a numpy array, or from a pandas DataFrame.
DT1 = dt.Frame(A=range(5), B=[1.7, 3.4, 0, None, -math.inf],
stypes={"A": dt.int64})
DT2 = dt.Frame(pandas_dataframe)
DT3 = dt.Frame(numpy_array)
You can also load a CSV/text/Excel file, or open a previously saved binary
.jay
file:
DT4 = dt.fread("~/Downloads/dataset_01.csv")
DT5 = dt.open("data.jay")
The fread()
function shown above is both powerful and extremely fast. It can
automatically detect parse parameters for the majority of text files, load data
from .zip archives or URLs, read Excel files, and much more.
Data manipulation¶
Once the data is loaded into a Frame, you may want to do certain operations with it: extract/remove/modify subsets of the data, perform calculations, reshape, group, join with other datasets, etc. In datatable, the primary vehicle for all these operations is the square-bracket notation inspired by traditional matrix indexing but overcharged with power (this notation was pioneered in R data.table and is the main axis of intersection between these two libraries).
In short, almost all operations with a Frame can be expressed as
where i is the row selector,
j is the column selector, and ...
indicates
that additional modifiers might be added. If this looks familiar to you,
that’s because it is. Exactly the same DT[i, j]
notation is used in
mathematics when indexing matrices, in C/C++, in R, in pandas, in numpy, etc.
The only difference that datatable introduces is that it allows
i to be anything that can conceivably be
interpreted as a row selector: an integer to select just one row, a slice,
a range, a list of integers, a list of slices, an expression, a boolean-valued
Frame, an integer-valued Frame, an integer numpy array, a generator, and so on.
The j column selector is even more versatile.
In the simplest case, you can select just a single column by its index or name. But
also accepted are a list of columns, a slice, a string slice (of the form "A":"Z"
), a
list of booleans indicating which columns to pick, an expression, a list of
expressions, and a dictionary of expressions. (The keys will be used as new names
for the columns being selected.) The j
expression can even be a python type (such as int
or dt.float32
),
selecting all columns matching that type.
In addition to the selector expression shown above, we support the update and delete statements too:
DT[i, j] = r
del DT[i, j]
The first expression will replace values in the subset [i, j]
of Frame
DT
with the values from r
, which could be either a constant, or a
suitably-sized Frame, or an expression that operates on frame DT
.
The second expression deletes values in the subset [i, j]
. This is
interpreted as follows: if i
selects all rows, then the columns given by
j
are removed from the Frame; if j
selects all columns, then the rows
given by i
are removed; if neither i
nor j
span all rows/columns
of the Frame, then the elements in the subset [i, j]
are replaced with
NAs.
What the f.?¶
You may have noticed already that we mentioned several times the possibility of using expressions in i or j and in other places. In the simplest form an expression looks like
f.ColA
which indicates a column ColA
in some Frame. Here f
is a variable that
has to be imported from the datatable module. This variable provides a convenient
way to reference any column in a Frame. In addition to the notation above, the
following is also supported:
f[3]
f["ColB"]
denoting the fourth column and the column ColB
respectively.
These f-expression support arithmetic operations as well as various mathematical and
aggregate functions. For example, in order to select the values from column
A
normalized to range [0; 1]
we can write the following:
from datatable import f, min, max
DT[:, (f.A - min(f.A))/(max(f.A) - min(f.A))]
This is equivalent to the following SQL query:
SELECT (f.A - MIN(f.A))/(MAX(f.A) - MIN(f.A)) FROM DT AS f
So, what exactly is f
? We call it a “frame proxy”, as it becomes a
simple way to refer to the Frame that we currently operate on. More precisely,
whenever DT[i, j]
is evaluated and we encounter an f
-expression there,
that f
becomes replaced with the frame DT
, and the columns are looked
up on that Frame. The same expression can later on be applied to a different
Frame, and it will refer to the columns in that other Frame.
At some point you may notice that that datatable also exports symbol g
. This
g
is also a frame proxy; however it already refers to the second frame in
the evaluated expression. This second frame appears when you are joining two
or more frames together (more on that later). When that happens, symbol g
is
used to refer to the columns of the joined frame.
Groupbys / joins¶
In the Data Manipulation section we mentioned that the DT[i, j, ...]
selector
can take zero or more modifiers, which we denoted as ...
. The available
modifiers are by()
, join()
and sort()
. Thus, the full form of the
square-bracket selector is:
by(…)¶
This modifier splits the frame into groups by the provided column(s), and then
applies i and j within
each group. This mostly affects aggregator functions such as sum()
,
min()
or sd()
, but may also apply in other circumstances. For example,
if i is a slice that takes the first 5 rows of a frame,
then in the presence of the by()
modifier it will take the first 5 rows of
each group.
For example, in order to find the total amount of each product sold, write:
from datatable import f, by, sum
DT = dt.fread("transactions.csv")
DT[:, sum(f.quantity), by(f.product_id)]
sort(…)¶
This modifier controls the order of the rows in the result, much like SQL clause
ORDER BY
. If used in conjunction with by()
, it will order the rows
within each group.
join(…)¶
As the name suggests, this operator allows you to join another frame to the
current, equivalent to the SQL JOIN
operator. Currently we support only
left outer joins.
In order to join frame X
, it must be keyed. A keyed frame is conceptually
similar to a SQL table with a unique primary key. This key may be either a
single column, or several columns:
X.key = "id"
Once a frame is keyed, it can be joined to another frame DT
, provided that
DT
has the column(s) with the same name(s) as the key in X
:
DT[:, :, join(X)]
This has the semantics of a natural left outer join. The X
frame can be
considered as a dictionary, where the key column contains the keys, and all
other columns are the corresponding values. Then during the join each row of
DT
will be matched against the row of X
with the same value of the
key column, and if there are no such value in X
, with an all-NA row.
The columns of the joined frame can be used in expressions using the g.
prefix, for example:
DT[:, sum(f.quantity * g.price), join(products)]
Note
In the future, we will expand the syntax of the join operator to allow other kinds of joins and also to remove the limitation that only keyed frames can be joined.
Offloading data¶
Just as our work has started with loading some data into datatable
, eventually
you will want to do the opposite: store or move the data somewhere else. We
support multiple mechanisms for this.
First, the data can be converted into a pandas DataFrame or into a numpy array. (Obviously, you have to have pandas or numpy libraries installed.):
DT.to_pandas()
DT.to_numpy()
A frame can also be converted into python native data structures: a dictionary, keyed by the column names; a list of columns, where each column is itself a list of values; or a list of rows, where each row is a tuple of values:
DT.to_dict()
DT.to_list()
DT.to_tuples()
You can also save a frame into a CSV file, or into a binary .jay file:
DT.to_csv("out.csv")
DT.save("data.jay")
Using datatable
¶
This section describes common functionality and commands that you can run in datatable
.
Create Frame¶
You can create a Frame from a variety of sources, including numpy
arrays, pandas
DataFrames, raw Python objects, etc:
import datatable as dt
import numpy as np
np.random.seed(1)
dt.Frame(np.random.randn(1000000))
C0 | |
---|---|
▪▪▪▪▪▪▪▪ | |
0 | 1.62435 |
1 | −0.611756 |
2 | −0.528172 |
3 | −1.07297 |
4 | 0.865408 |
5 | −2.30154 |
6 | 1.74481 |
7 | −0.761207 |
8 | 0.319039 |
9 | −0.24937 |
⋮ | ⋮ |
999,995 | 0.0595784 |
999,996 | 0.140349 |
999,997 | −0.596161 |
999,998 | 1.18604 |
999,999 | 0.313398 |
import pandas as pd
pf = pd.DataFrame({"A": range(1000)})
dt.Frame(pf)
A | |
---|---|
▪▪▪▪▪▪▪▪ | |
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
6 | 6 |
7 | 7 |
8 | 8 |
9 | 9 |
⋮ | ⋮ |
995 | 995 |
996 | 996 |
997 | 997 |
998 | 998 |
999 | 999 |
dt.Frame({"n": [1, 3], "s": ["foo", "bar"]})
n | s | |
---|---|---|
▪ | ▪▪▪▪ | |
0 | 1 | foo |
1 | 3 | bar |
Convert a Frame¶
Convert an existing Frame into a numpy
array, a pandas
DataFrame, or a pure Python object:
nparr = df1.to_numpy()
pddfr = df1.to_pandas()
pyobj = df1.to_list()
Parse Text (csv) Files¶
datatable
provides fast and convenient parsing of text (csv) files:
df = dt.fread("train.csv")
The datatable
parser
- Automatically detects separators, headers, column types, quoting rules, etc.
- Reads from file, URL, shell, raw text, archives, glob
- Provides multi-threaded file reading for maximum speed
- Includes a progress indicator when reading large files
- Reads both RFC4180-compliant and non-compliant files
Write the Frame¶
Write the Frame’s content into a csv
file (also multi-threaded):
df.to_csv("out.csv")
Save a Frame¶
Save a Frame into a binary format on disk, then open it later instantly, regardless of the data size:
df.save("out.jay")
df2 = dt.open("out.jay")
Basic Frame Properties¶
Basic Frame properties include:
print(df.shape) # (nrows, ncols)
print(df.names) # column names
print(df.stypes) # column types
Compute Per-Column Summary Stats¶
Compute per-column summary stats using:
df.sum()
df.max()
df.min()
df.mean()
df.sd()
df.mode()
df.nmodal()
df.nunique()
Select Subsets of Rows/Columns¶
Select subsets of rows and/or columns using:
df[:, "A"] # select 1 column
df[:10, :] # first 10 rows
df[::-1, "A":"D"] # reverse rows order, columns from A to D
df[27, 3] # single element in row 27, column 3 (0-based)
Delete Rows/Columns¶
Delete rows and or columns using:
del df[:, "D"] # delete column D
del df[f.A < 0, :] # delete rows where column A has negative values
Filter Rows¶
Filter rows via an expression using the following. In this example, mean
, sd
, f
are all symbols imported from datatable
.
df[(f.x > mean(f.y) + 2.5 * sd(f.y)) | (f.x < -mean(f.y) - sd(f.y)), :]
Compute Columnar Expressions¶
Compute columnar expressions using:
df[:, {"x": f.x, "y": f.y, "x+y": f.x + f.y, "x-y": f.x - f.y}]
Append Rows/Columns¶
Append rows / columns to a Frame using:
df1.cbind(df2, df3)
df1.rbind(df4, force=True)
Installation¶
This section describes how to install H2O’s datatable
.
Requirements¶
- Python 3.5+
Install on Mac OS X¶
Run the following command to install datatable
on Mac OS X.
pip install datatable
Install on Linux¶
Run one of the following commands to retrieve the datatable
whl file for Linux environments.
# Python 3.5
pip install https://s3.amazonaws.com/h2o-release/datatable/stable/datatable-0.3.2/datatable-0.3.2-cp35-cp35m-linux_x86_64.whl
# Python 3.6
pip install https://s3.amazonaws.com/h2o-release/datatable/stable/datatable-0.3.2/datatable-0.3.2-cp36-cp36m-linux_x86_64.whl
Build from Source¶
The key component needed for building the datatable
package from source is the Clang/Llvm distribution. The same distribution is also required for building the llvmlite
package, which is a prerequisite for datatable
. Note that the clang
compiler which is shipped with MacOS is too old, and in particular it doesn’t have support for the OpenMP technology.
Installing the Clang/Llvm
distribution¶
- Visit https://releases.llvm.org/download.html and download the most recent version of Clang/Llvm available for your platform (but no older than version 4.0.0).
- Extract the downloaded archive into any suitable location on your hard drive.
- Create one of the environment variables
LLVM4
/LLVM5
/LLVM6
(depending on the version of Clang/Llvm that you installed). The variable should point to the directory where you placed the Clang/Llvm distribution.
For example, on Ubuntu after downloading
clang+llvm-4.0.0-x86_64-linux-gnu-ubuntu-16.10.tar.xz
the sequence of steps might look like:$ mv clang+llvm-4.0.0-x86_64-linux-gnu-ubuntu-16.10.tar.xz /opt $ cd /opt $ sudo tar xvf clang+llvm-4.0.0-x86_64-linux-gnu-ubuntu-16.10.tar.xz $ export LLVM4=/opt/clang+llvm-4.0.0-x86_64-linux-gnu-ubuntu-16.10
You probably also want to put the last export
line into your ~/.bash_profile
.
Building datatable
¶
- Verify that you have Python 3.5 or above:
$ python --VIf you don’t have Python 3.5 or later, you may want to download and install the newest version of Python, and then create and activate a virtual environment for that Python. For example:
$ virtualenv --python=python3.6 ~/py36 $ source ~/py36/bin/activate
- Build
datatable
:
$ make build $ make install $ make test
- Additional commands you may find occasionally interesting:
# Uninstall previously installed datatable make uninstall # Build a debug version of datatable (for example suitable for ``gdb`` debugging) make debug # Generate code coverage report make coverage
Troubleshooting¶
If you get an error like
ImportError: This package should not be accessible on Python 3
, then you may have aPYTHONPATH
environment variable that causes conflicts. See this SO question for details.If you see errors such as
"implicit declaration of function 'PyUnicode_AsUTF8' is invalid in C99"
or"unknown type name 'PyModuleDef'"
or"void function 'PyInit__datatable' should not return a value "
, it means your current Python is Python 2. Please revisit step 1 in the build instructions above.If you are seeing an error
'Python.h' file not found
, then it means you have an incomplete version of Python installed. This is known to sometimes happen on Ubuntu systems. The solution is to runapt-get install python-dev
orapt-get install python3.6-dev
.If you run into installation errors with
llvmlite
dependency, then your best bet is to attempt to install it manually before trying to builddatatable
:$ pip install llvmlite
Consult the
llvmlite
Installation Guide for additional information.On OS X, if you are getting an error
fatal error: 'sys/mman.h' file not found
or similar, this can be fixed by installing the Xcode Command Line Tools:$ xcode-select --install
Contributing¶
datatable
is an open source project released under the Mozilla Public Licence v2. Open Source projects live by their user and developer communities. We welcome and encourage your contributions of any kind!
No matter what your skill set or level of engagement is with datatable
, you can help others by improving the ecosystem of documentation, bug report and feature request tickets, and code.
We invite anyone who is interested to contribute, whether through pull requests, or tests, or GitHub issues, API suggestions, or generic discussion.
Have Questions?¶
If you have questions about using datatable
, post them on Stack Overflow using the [datatable] [python]
tags at http://stackoverflow.com/questions/tagged/datatable+python.
Frame¶
-
class
datatable.
Frame
¶ Two-dimensional column-oriented table of data. Each column has its own name and type. Types may vary across columns (unlike in a Numpy array) but cannot vary within each column (unlike in Pandas DataFrame).
Internally the data is stored as C primitives, and processed using multithreaded native C++ code.
This is a primary data structure for datatable module.
-
cbind
()¶ Append columns of Frames frames to the current Frame.
This is equivalent to pandas.concat(axis=1): the Frames are combined by columns, i.e. cbinding a Frame of shape [n x m] to a Frame of shape [n x k] produces a Frame of shape [n x (m + k)].
As a special case, if you cbind a single-row Frame, then that row will be replicated as many times as there are rows in the current Frame. This makes it easy to create constant columns, or to append reduction results (such as min/max/mean/etc) to the current Frame.
If Frame(s) being appended have different number of rows (with the exception of Frames having 1 row), then the operation will fail by default. You can force cbinding these Frames anyways by providing option force=True: this will fill all ‘short’ Frames with NAs. Thus there is a difference in how Frames with 1 row are treated compared to Frames with any other number of rows.
Parameters: - frames (sequence or list of Frames) – One or more Frame to append. They should have the same number of rows (unless option force is also used).
- force (boolean) – If True, allows Frames to be appended even if they have unequal number of rows. The resulting Frame will have number of rows equal to the largest among all Frames. Those Frames which have less than the largest number of rows, will be padded with NAs (with the exception of Frames having just 1 row, which will be replicated instead of filling with NAs).
-
colindex
()¶ Return index of the column
name
.Parameters: name – name of the column to find the index for. This can also be an index of a column, in which case the index is checked that it doesn’t go out-of-bounds, and negative index is converted into positive. Raises: ValueError – if the requested column does not exist.
-
copy
()¶ Make a copy of this Frame.
This method creates a shallow copy of the current Frame: only references are copied, not the data itself. However, due to copy-on-write semantics any changes made to one of the Frames will not propagate to the other. Thus, for all intents and purposes the copied Frame will behave as if it was deep-copied.
-
countna
()¶ Get the number of NA values in each column.
Returns: - A new datatable of shape (1, ncols) containing the counted number of NA
- values in each column.
-
countna1
()¶
-
head
()¶ Return the first n rows of the Frame, same as
self[:n, :]
.
-
key
¶ Tuple of column names that serve as a primary key for this Frame.
If the Frame is not keyed, this will return an empty tuple.
Assigning to this property will make the Frame keyed by the specified column(s). The key columns will be moved to the front, and the Frame will be sorted. The values in the key columns must be unique.
-
ltypes
¶ The tuple of each column’s ltypes (“logical types”)
-
materialize
()¶
-
max
()¶ Get the maximum value of each column.
Returns: - A new datatable of shape (1, ncols) containing the computed maximum
- values for each column (or NA if not applicable).
-
max1
()¶
-
mean
()¶ Get the mean of each column.
Returns: - A new datatable of shape (1, ncols) containing the computed mean
- values for each column (or NA if not applicable).
-
mean1
()¶
-
min
()¶ Get the minimum value of each column.
Returns: - A new datatable of shape (1, ncols) containing the computed minimum
- values for each column (or NA if not applicable).
-
min1
()¶
-
mode
()¶ Get the modal value of each column.
Returns: - A new datatable of shape (1, ncols) containing the computed count of
- most frequent values for each column.
-
mode1
()¶
-
names
¶ Tuple of column names.
You can rename the Frame’s columns by assigning a new list/tuple of names to this property. The length of the new list of names must be the same as the number of columns in the Frame.
It is also possible to rename just a few columns by assigning a dictionary
{oldname: newname, ...}
. Any column not listed in the dictionary will retain its name.Examples
>>> d0 = dt.Frame([[1], [2], [3]]) >>> d0.names = ['A', 'B', 'C'] >>> d0.names ('A', 'B', 'C') >>> d0.names = {'B': 'middle'} >>> d0.names ('A', 'middle', 'C') >>> del d0.names >>> d0.names ('C0', 'C1', 'C2')
-
ncols
¶ Number of columns in the Frame
-
nmodal
()¶ Get the number of modal values in each column.
Returns: - A new datatable of shape (1, ncols) containing the counted number of
- most frequent values in each column.
-
nmodal1
()¶
-
nrows
¶ Number of rows in the Frame.
Assigning to this property will change the height of the Frame, either by truncating if the new number of rows is smaller than the current, or filling with NAs if the new number of rows is greater.
Increasing the number of rows of a keyed Frame is not allowed.
-
nunique
()¶ Get the number of unique values in each column.
Returns: - A new datatable of shape (1, ncols) containing the counted number of
- unique values in each column.
-
nunique1
()¶
-
rbind
(*frames, force=False, bynames=True)¶ Append rows of frames to the current Frame.
This is equivalent to list.extend() in Python: the Frames are combined by rows, i.e. rbinding a Frame of shape [n x k] to a Frame of shape [m x k] produces a Frame of shape [(m + n) x k].
This method modifies the current Frame in-place. If you do not want the current Frame modified, then append all Frames to an empty Frame: dt.Frame().rbind(frame1, frame2).
If Frame(s) being appended have columns of types different from the current Frame, then these columns will be promoted to the largest of two types: bool -> int -> float -> string.
If you need to append multiple Frames, then it is more efficient to collect them into an array first and then do a single rbind(), than it is to append them one-by-one.
Appending data to a Frame opened from disk will force loading the current Frame into memory, which may fail with an OutOfMemory exception.
Parameters: - frames (sequence or list of Frames) – One or more Frame to append. These Frames should have the same columnar structure as the current Frame (unless option force is used).
- force (boolean, default False) – If True, then the Frames are allowed to have mismatching set of columns. Any gaps in the data will be filled with NAs.
- bynames (boolean, default True) – If True, the columns in Frames are matched by their names. For example, if one Frame has columns [“colA”, “colB”, “colC”] and the other [“colB”, “colA”, “colC”] then we will swap the order of the first two columns of the appended Frame before performing the append. However if bynames is False, then the column names will be ignored, and the columns will be matched according to their order, i.e. i-th column in the current Frame to the i-th column in each appended Frame.
-
replace
()¶ Replace given value(s) replace_what with replace_with in the entire Frame.
For each replace value, this method operates only on columns of types appropriate for that value. For example, if replace_what is a list [-1, math.inf, None, “??”], then the value -1 will be replaced in integer columns only, math.inf only in real columns, None in columns of all types, and finally “??” only in string columns.
The replacement value must match the type of the target being replaced, otherwise an exception will be thrown. That is, a bool must be replaced with a bool, an int with an int, a float with a float, and a string with a string. The None value (representing NA) matches any column type, and therefore can be used as either replacement target, or replace value for any column. In particular, the following is valid: DT.replace(None, [-1, -1.0, “”]). This will replace NA values in int columns with -1, in real columns with -1.0, and in string columns with an empty string.
The replace operation never causes a column to change its logical type. Thus, an integer column will remain integer, string column remain string, etc. However, replacing may cause a column to change its stype, provided that ltype remains constant. For example, replacing 0 with -999 within an int8 column will cause that column to be converted into the int32 stype.
Parameters: - replace_what (None, bool, int, float, list, or dict) – Value(s) to search for and replace.
- replace_with (single value, or list) – The replacement value(s). If replace_what is a single value, then this must be a single value too. If replace_what is a list, then this could be either a single value, or a list of the same length. If replace_what is a dict, then this value should not be passed.
Returns: Return type: Nothing, replacement is performed in-place.
Examples
>>> df = dt.Frame([1, 2, 3] * 3) >>> df.replace(1, -1) >>> df.to_list() [[-1, 2, 3, -1, 2, 3, -1, 2, 3]]
>>> df.replace({-1: 100, 2: 200, "foo": None}) >>> df.to_list() [[100, 200, 3, 100, 200, 3, 100, 200, 3]]
-
save
(dest=None, format='jay', _strategy='auto')¶ Save Frame in binary NFF/Jay format.
Parameters: - dest – destination where the Frame should be saved.
- format – either “nff” or “jay”
- _strategy – one of “mmap”, “write” or “auto”
-
sd
()¶ Get the standard deviation of each column.
Returns: - A new datatable of shape (1, ncols) containing the computed standard
- deviation values for each column (or NA if not applicable).
-
sd1
()¶
-
shape
¶ Tuple with (nrows, ncols) dimensions of the Frame
-
stypes
¶ The tuple of each column’s stypes (“storage types”)
-
sum
()¶ Get the sum of each column.
Returns: - A new datatable of shape (1, ncols) containing the computed sums
- for each column (or NA if not applicable).
-
sum1
()¶
-
tail
()¶ Return the last n rows of the Frame, same as
self[-n:, :]
.
-
to_csv
(path='', nthreads=0, hex=False, verbose=False, **kwargs)¶ Write the Frame into the provided file in CSV format.
Parameters: - dt (Frame) – Frame object to write into CSV.
- path (str) – Path to the output CSV file that will be created. If the file already exists, it will be overwritten. If path is not given, then the Frame will be serialized into a string, and that string will be returned.
- nthreads (int) – How many threads to use for writing. The value of 0 means to use all available threads. Negative values mean to use that many threads less than the maximum available.
- hex (bool) – If True, then all floating-point values will be printed in hex format (equivalent to %a format in C printf). This format is around 3 times faster to write/read compared to usual decimal representation, so its use is recommended if you need maximum speed.
- verbose (bool) – If True, some extra information will be printed to the console, which may help to debug the inner workings of the algorithm.
-
to_dict
()¶ Convert the Frame into a dictionary of lists, by columns.
Returns a dictionary with ncols entries, each being the colname: coldata pair, where colname is a string, and coldata is an array of column’s data.
Examples
>>> DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"]) >>> DT.to_dict() {"A": [1, 2, 3], "B": ["aye", "nay", "tain"]}
-
to_list
()¶ Convert the Frame into a list of lists, by columns.
Returns a list of ncols lists, each inner list representing one column of the Frame.
Examples
>>> DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"]) >>> DT.to_list() [[1, 2, 3], ["aye", "nay", "tain"]]
-
to_numpy
(stype=None)¶ Convert Frame into a numpy array, optionally forcing it into a specific stype/dtype.
Parameters: stype (datatable.stype, numpy.dtype or str) – Cast datatable into this dtype before converting it into a numpy array.
-
to_pandas
()¶ Convert Frame to a pandas DataFrame, or raise an error if pandas module is not installed.
-
to_tuples
()¶ Convert the Frame into a list of tuples, by rows.
Returns a list having nrows tuples, where each tuple has length ncols and contains data from each respective row of the Frame.
Examples
>>> DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"]) >>> DT.to_tuples() [(1, "aye"), (2, "nay"), (3, "tain")]
-
Ftrl¶
-
class
datatable.models.
Ftrl
¶ Follow the Regularized Leader (FTRL) model with hashing trick.
See this reference for more details: https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf
Parameters: - alpha (float) – alpha in per-coordinate learning rate formula.
- beta (float) – beta in per-coordinate learning rate formula.
- lambda1 (float) – L1 regularization parameter.
- lambda2 (float) – L2 regularization parameter.
- nbins (int) – Number of bins to be used after the hashing trick.
- nepochs (int) – Number of epochs to train for.
- interactions (bool) – Switch to enable second order feature interactions.
-
alpha
¶ alpha in per-coordinate FTRL-Proximal algorithm
-
beta
¶ beta in per-coordinate FTRL-Proximal algorithm
-
colname_hashes
¶ Column name hashes
-
feature_importances
¶ One-column frame with the overall weight contributions calculated feature-wise during training and predicting. It can be interpreted as a feature importance information.
-
fit
()¶ Train an FTRL model on a dataset.
Parameters: Returns: Return type: None
-
interactions
¶ Switch to enable second order feature interactions
-
labels
¶ List of labels for multinomial regression.
-
lambda1
¶ L1 regularization parameter
-
lambda2
¶ L2 regularization parameter
-
model
¶ Tuple of model frames. Each frame has two columns, i.e. z and n, and nbins rows, where nbins is a number of bins for the hashing trick. Both column types are float64.
-
nbins
¶ Number of bins to be used for the hashing trick
-
nepochs
¶ Number of epochs to train a model
-
params
¶ FTRL model parameters
-
predict
()¶ Make predictions for a dataset.
Parameters: X (Frame) – Frame of shape (nrows, ncols) to make predictions for. It must have the same number of columns as the training frame. Returns: - A new frame of shape (nrows, 1) with the predicted probability
- for each row of frame X.
-
reset
()¶ Reset FTRL model and feature importance information, i.e. initialize model and importance frames with zeros.
Parameters: None – Returns: Return type: None
FTRL¶
This section describes the FTRL (Follow the Regularized Leader) model as implemented in datatable.
FTRL Model Information¶
The Follow the Regularized Leader (FTRL) model is a datatable implementation of the FTRL-Proximal online learning algorithm for binomial logistic regression. It uses a hashing trick for feature vectorization and the Hogwild approach for parallelization. FTRL for multinomial classification and continuous targets are implemented experimentally.
Create an FTRL Model¶
The FTRL model is implemented as the Ftrl
Python class, which is a part of
datatable.models
, so to use the model you should first do
from datatable.models import Ftrl
and then create a model as
ftrl_model = Ftrl()
FTRL Model Parameters¶
The FTRL model requires a list of parameters for training and making predictions, namely:
alpha
– learning rate, defaults to0.005
.beta
– beta parameter, defaults to1.0
.lambda1
– L1 regularization parameter, defaults to0.0
.lambda2
– L2 regularization parameter, defaults to1.0
.nbins
– the number of bins for the hashing trick, defaults to1000000
.nepochs
– the number of epochs to train the model for, defaults to1
.interactions
– whether to enable second order feature interactions, defaults toFalse
.
If some parameters need to be changed, this can be done either when creating the model, as
ftrl_model = Ftrl(alpha = 0.1, nbins = 100, interactions = False)
or, if the model already exists, as
ftrl_model.alpha = 0.1
ftrl_model.nbins = 100
ftrl_model.interactions = False
If some parameters were not set explicitely, they will be assigned the default values.
Training a Model¶
Use the fit()
method to train a model for a binomial logistic regression problem:
ftrl_model.fit(X, y)
where X
is a frame of shape (nrows, ncols)
to be trained on,
and y
is a frame of shape (nrows, 1)
having a bool
type
of the target column. The following datatable column types are supported
for the X
frame: bool
, int
, real
and str
.
Resetting a Model¶
Use the reset()
method to reset a model:
ftrl_model.reset()
This will reset model weights, but it will not affect learning parameters. To reset parameters to default values, you can do
ftrl_model.params = Ftrl().params
Making Predictions¶
Use the predict()
method to make predictions:
targets = ftrl_model.predict(X)
where X
is a frame of shape (nrows, ncols)
to make predictions for.
X
should have the same number of columns as the training frame.
The predict()
method returns a new frame of shape (nrows, 1)
with
the predicted probability for each row of frame X
.
Feature Importances¶
To estimate feature importances, the overall weight contributions are calculated feature-wise during training and predicting. Feature importances can be accessed as
fi = ftrl_model.feature_importances
where fi
will be a frame of shape (nfeatures, 2)
containing
feature names and their importances, that are normalized to [0; 1] range.
Further Reading¶
For detailed help, please also refer to help(Ftrl)
.