-
Notifications
You must be signed in to change notification settings - Fork 997
The PDF FAQ may be moved to this wiki page. For easier and faster edits/additions by anyone.
Here are some new items to test the look, feel and anchors.
What are primary and secondary indexes in data.table?
Reading data.table from RDS or RData file
setkey(DT,col1,col2)
orders the rows by column col1
then within each group of col1
it orders by col2
. The row order is changed by reference in RAM. Subsequent joins and groups on those key columns then take advantage of the sort order for efficiency. (Imagine how difficult looking for a phone number in a printed telephone directory would be if it wasn't sorted by surname then forename. That's literally all setkey
does. It sorts the rows by the columns you specify.) The index doesn't use any RAM. It simply changes the row order in RAM and marks the key columns. Analogous to a clustered index in SQL.
However, you can only have one primary key because data can only be physically sorted in RAM in one way at a time. Choose the primary index to be the one you use most often (e.g. [id,date]). Sometimes there isn't an obvious choice for the primary key or you need to join and group many different columns in different orders. Enter a secondary index. This does use memory (4*nrow bytes regardless of the number of columns in the index) to store the order of the rows by the columns you specify, but doesn't actually reorder the rows in RAM. Subsequent joins and groups take advantage of the secondary key's order but need to hop via that index so aren't as efficient as primary indexes. But still, a lot faster than a full vector scan. There is no limit to the number of secondary indexes since each one is just a different ordering vector. Typically you don't need to create secondary indexes. They are created automatically and used for you automatically by using data.table normally; e.g. DT[someCol==someVal,]
and DT[someCol %in% someVals,]
will create, attach and then use the secondary index. This is faster in data.table than a vector scan so automatic indexing is on by default since there is no up front penalty. There is an option to turn off automatic indexing; e.g., if somehow many indexes are being created and even the relatively small amount of extra memory becomes too large.
We use the words index and key interchangeably.
**[Reading data.table from RDS or RData file](#loading-from-rds)***.RDS
and *.RData
are file types which can store in-memory R objects on disk efficiently. Storing data.table into the binary file loses its column overallocation, though. This isn't a big deal, your data.table will be copied in memory on the next by reference operation and throw a warning. Therefore it is recommended to call alloc.col()
on each loaded data.table.