

Types: CSV has no inherent notion of types sometimes strings will be in quotes, but even that isn’t guaranteed. Given the above criteria, we’re going to consider three formats: CSV, JSON, and Parquet. That rules out many data formats that Polars doesn’t support yet, and formats that are too Pandas-specific like pickling a Pandas DataFrame. So as a stand-in for future flexibility, we’ll also add the requirement that you should be able to easily process your data with Polars, an up-and-coming Pandas replacement. You might not be using Pandas forever, or you might want to use other libraries in certain situations.

To help limit the scope of discussion, we’ll assume you’re using large datasets you write to disk in one internal process, then read the data later in one or more additional internal processes. The situation we’re considering: internal datasets If you’re using Pandas, you’re less likely to be doing this sort of processing. If you are streaming data over the network and want to process it row-by-row as it arrives, this implies a very different data format: you want something that makes for easy row-based parsing.ĬSV is actually pretty good at this, even though as we’ll see it’s otherwise an annoying format to work with. If someone is handing you a file, they control the format.Īnd if you will only ever process it once, changing the file format may not be worth the trouble. That is very situation-specific, so it’s difficult to give a universal answer. If you need to share data with other organizations, or even other teams within your organization, you need to limit yourself to data formats you know they will be able to process. “Best” is situation-specificĭifferent use cases imply different requirements. While there is no one true answer that works for everyone, this article will try to help you narrow down the field and make an informed decision. Some data formats do a better job at this than others.
