posachristian.blogg.se - Python json csv

Types: CSV has no inherent notion of types sometimes strings will be in quotes, but even that isn’t guaranteed. Given the above criteria, we’re going to consider three formats: CSV, JSON, and Parquet. That rules out many data formats that Polars doesn’t support yet, and formats that are too Pandas-specific like pickling a Pandas DataFrame. So as a stand-in for future flexibility, we’ll also add the requirement that you should be able to easily process your data with Polars, an up-and-coming Pandas replacement. You might not be using Pandas forever, or you might want to use other libraries in certain situations.

Efficient reading and writing: Writing to disk and loading from disk should be fast.

Efficient disk format: Minimize how much space is used on disk.

Types: Some data is numeric, some data is composed of strings, other data might be time-based it’s useful to be able to distinguish between these different types.

We can think about multiple criteria we’d want for a data format:

To help limit the scope of discussion, we’ll assume you’re using large datasets you write to disk in one internal process, then read the data later in one or more additional internal processes. The situation we’re considering: internal datasets If you’re using Pandas, you’re less likely to be doing this sort of processing. If you are streaming data over the network and want to process it row-by-row as it arrives, this implies a very different data format: you want something that makes for easy row-based parsing.ĬSV is actually pretty good at this, even though as we’ll see it’s otherwise an annoying format to work with. If someone is handing you a file, they control the format.Īnd if you will only ever process it once, changing the file format may not be worth the trouble. That is very situation-specific, so it’s difficult to give a universal answer. If you need to share data with other organizations, or even other teams within your organization, you need to limit yourself to data formats you know they will be able to process. “Best” is situation-specificĭifferent use cases imply different requirements. While there is no one true answer that works for everyone, this article will try to help you narrow down the field and make an informed decision. Some data formats do a better job at this than others.

You also want to make sure the loaded data has all the right types: numeric types, datetimes, and so on.

Ideally you’d want a file format that’s fast, efficient, small, and broadly supported.

You don’t want loading the data to be slow, or use lots of memory: that’s pure overhead.

There are plenty of data formats supported by Pandas, from CSV, to JSON, to Parquet, and many others as well. Before you can process your data with Pandas, you need to load it (from disk or remote storage).