Skip to content

Read and write Stata, SAS and SPSS data files with Julia tables

License

Notifications You must be signed in to change notification settings

junyuan-chen/ReadStatTables.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReadStatTables.jl

Read and write Stata, SAS and SPSS data files with Julia tables

CI-stable codecov PkgEval docs-stable docs-dev

ReadStatTables.jl is a Julia package for reading and writing Stata, SAS and SPSS data files with Tables.jl-compatible tables. It utilizes the ReadStat C library developed by Evan Miller for parsing and writing the data files. The same C library is also the backend of popular packages in other languages such as pyreadstat for Python and haven for R. As the Julia counterpart for similar purposes, ReadStatTables.jl leverages the state-of-the-art Julia ecosystem for usability and performance. Its read performance, especially when taking advantage of multiple threads, surpasses all related packages by a sizable margin based on the benchmark results here:


Features

ReadStatTables.jl provides the following features in addition to wrapping the C interface of ReadStat:

  • Fast multi-threaded data collection from ReadStat parsers to a Tables.jl-compatible ReadStatTable
  • Interface of file-level and variable-level metadata compatible with DataAPI.jl
  • Integration of value labels into data columns via a custom array type LabeledArray
  • Translation of date and time values into Julia time types Date and DateTime
  • Write support for Tables.jl-compatible tables (experimental)

Supported File Formats

ReadStatTables.jl recognizes data files with the following file extensions at this moment:

  • Stata: .dta
  • SAS: .sas7bdat and .xpt
  • SPSS: .sav and .por

Installation

ReadStatTables.jl can be installed with the Julia package manager Pkg. From the Julia REPL, type ] to enter the Pkg REPL and run:

pkg> add ReadStatTables

Quick Start

To read a data file located at data/sample.dta:

julia> using ReadStatTables

julia> tb = readstat("data/sample.dta")
5×7 ReadStatTable:
 Row │  mychar    mynum      mydate                dtime         mylabl           myord               mytime 
     │ String3  Float64       Date?            DateTime?  Labeled{Int8}  Labeled{Int8?}             DateTime 
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       a      1.1  2018-05-06  2018-05-06T10:10:10           Male             low  1960-01-01T10:10:10
   2 │       b      1.2  1880-05-06  1880-05-06T10:10:10         Female          medium  1960-01-01T23:10:10
   3 │       c  -1000.3  1960-01-01  1960-01-01T00:00:00           Male            high  1960-01-01T00:00:00
   4 │       d     -1.4  1583-01-01  1583-01-01T00:00:00         Female             low  1960-01-01T16:10:10
   5 │       e   1000.3     missing              missing           Male         missing  2000-01-01T00:00:00

To access a column from the above table:

julia> tb.myord
5-element LabeledVector{Union{Missing, Int8}, Vector{Union{Missing, Int8}}, Union{Char, Int32}}:
 1 => low
 2 => medium
 3 => high
 1 => low
 missing => missing

Notice that for data variables with value labels, both the original values and the value labels are preserved. For variables representing date/time, the translation to Julia Date/DateTime is lazy. One can access the underlying numerical values as follows:

julia> tb.mydate.data
5-element SentinelArrays.SentinelVector{Float64, Float64, Missing, Vector{Float64}}:
   21310.0
  -29093.0
       0.0
 -137696.0
        missing

File-level and variable-level metadata can be retrieved and modified via methods compatible with DataAPI.jl:

julia> metadata(tb)
ReadStatMeta:
  row count           => 5
  var count           => 7
  modified time       => 2021-04-23T04:36:00
  file format version => 118
  file label          => A test file
  file extension      => .dta

julia> colmetadata(tb, :mylabl)
ReadStatColMeta:
  label         => labeled
  format        => %16.0f
  type          => READSTAT_TYPE_INT8
  value label   => mylabl
  storage width => 1
  display width => 16
  measure       => READSTAT_MEASURE_UNKNOWN
  alignment     => READSTAT_ALIGNMENT_RIGHT

For more details, please see the documentation.