Reading Text Files¶
In this notebook we show how to read text files that can be read by numpy's loadtxt
function
These are essentially column-based text files.
The notebook will also show you how Kosh can help you by adding metadata onto the file which in turn will help the loader (and potentially the Kosh users to pinpoint the actual text file they need).
Reading in the whole text file¶
We will be using the text files in this directory
Raw numpy¶
import numpy
filename = "../tests/baselines/npy/example_columns_no_header.txt"
data = numpy.loadtxt(filename)
print(data.shape)
(25, 6)
Kosh¶
Let's setup a Kosh store, create a dataset and associate this file. Numpy's loadtxt
is used via the numpy/txt
mime_type
import kosh
store = kosh.connect("numpy_loadtxt.sql", delete_all_contents=True)
dataset = store.create(name="example1")
dataset.associate(filename, mime_type="numpy/txt")
print("Features:", dataset.list_features())
print(dataset["features"][:].shape)
Features: ['features'] (25, 6)
Slicing¶
While it is nice be able to read the whole file it can be very time consuming if the file gets big, possibly not even fitting into memory.
Kosh's loader can slice the data appropriately and read only the necessary part of the file. Solving these potential problems:
print(dataset["features"][2:4, 1:5].shape)
(2, 4)
Header rows¶
Now it is possible that the text files actually has a few header lines.
A good example would be example_non_hashed_header_rows.txt
Now numpy's loadtxt
cannot read the file as is (you could pass the skiprows keyword though):
filename = "../tests/baselines/npy/example_non_hashed_header_rows.txt"
try:
data = numpy.loadtxt(filename)
except ValueError:
print("Numpy cannot read this text file")
Numpy cannot read this text file
And similarly Kosh's loader won't be able to read as is:
dataset = store.create(name="example_headers_rows")
associated = dataset.associate(filename, mime_type="numpy/txt", id_only=False)
try:
print(dataset["features"][:].shape)
except ValueError:
print("Cannot read as is")
Cannot read as is
Fortunately we can add metadata on our kosh-associated object and inform the loader on what to do:
associated.skiprows = 6
print(dataset["features"][:].shape)
(25, 6)
Columns Headers¶
It is quite frequent that one of the header rows contains the columns/names
Let's add some metadata informing the loader which line contains the features.
associated.features_line = 5
# we'll need to clear the features cache
print(dataset.list_features(use_cache=False))
['time', 'zeros', 'ones', 'twos', 'threes', 'fours']
We can now access each feature/column separately, via their name. This can be useful if you're reading data from text files that are organized differently but contain the same column name.
zeros = dataset["zeros"][:4]
print(zeros)
[0.65485361 0.04917816 0.20506388 0.24302516]
In some cases the column headers can be separated via fixed width (causing two names to touch each other)
For a good example would be: ../tests/baselines/npy/example_column_names_in_header_via_constant_width.txt
filename = "../tests/baselines/npy/example_column_names_in_header_via_constant_width.txt"
dataset = store.create(name="example_constant_width")
associated = dataset.associate(filename, mime_type="numpy/txt", id_only=False)
associated.skiprows=1
associated.features_line=0
associated.columns_width=10
print(dataset.list_features())
['time', 'zeros col', 'ones col', 'twos col', 'threes col', 'fours']