Reading Text Files¶

In this notebook we show how to read text files that can be read by numpy's loadtxt function

These are essentially column-based text files.

The notebook will also show you how Kosh can help you by adding metadata onto the file which in turn will help the loader (and potentially the Kosh users to pinpoint the actual text file they need).

Reading in the whole text file¶

We will be using the text files in this directory

Raw numpy¶

In [1]:

Copied!

import numpy

filename = "../tests/baselines/npy/example_columns_no_header.txt"
data = numpy.loadtxt(filename)
print(data.shape)
import numpy

filename = "../tests/baselines/npy/example_columns_no_header.txt"
data = numpy.loadtxt(filename)
print(data.shape)

(25, 6)

Kosh¶

Let's setup a Kosh store, create a dataset and associate this file. Numpy's loadtxt is used via the numpy/txt mime_type

In [2]:

Copied!





import kosh

store = kosh.connect("numpy_loadtxt.sql", delete_all_contents=True)
dataset = store.create(name="example1")
dataset.associate(filename, mime_type="numpy/txt")
print("Features:", dataset.list_features())
print(dataset["features"][:].shape)
import kosh

store = kosh.connect("numpy_loadtxt.sql", delete_all_contents=True)
dataset = store.create(name="example1")
dataset.associate(filename, mime_type="numpy/txt")
print("Features:", dataset.list_features())
print(dataset["features"][:].shape)

Features: ['features']
(25, 6)

Slicing¶

While it is nice be able to read the whole file it can be very time consuming if the file gets big, possibly not even fitting into memory.

Kosh's loader can slice the data appropriately and read only the necessary part of the file. Solving these potential problems:

In [3]:

Copied!

print(dataset["features"][2:4, 1:5].shape)
print(dataset["features"][2:4, 1:5].shape)

(2, 4)

Header rows¶

Now it is possible that the text files actually has a few header lines.

A good example would be example_non_hashed_header_rows.txt

Now numpy's loadtxt cannot read the file as is (you could pass the skiprows keyword though):

In [4]:

Copied!





filename = "../tests/baselines/npy/example_non_hashed_header_rows.txt"
try:
    data = numpy.loadtxt(filename)
except ValueError:
    print("Numpy cannot read this text file")
filename = "../tests/baselines/npy/example_non_hashed_header_rows.txt"
try:
    data = numpy.loadtxt(filename)
except ValueError:
    print("Numpy cannot read this text file")

Numpy cannot read this text file

And similarly Kosh's loader won't be able to read as is:

In [5]:

Copied!





dataset = store.create(name="example_headers_rows")
associated = dataset.associate(filename, mime_type="numpy/txt", id_only=False)
try:
    print(dataset["features"][:].shape)
except ValueError:
    print("Cannot read as is")
dataset = store.create(name="example_headers_rows")
associated = dataset.associate(filename, mime_type="numpy/txt", id_only=False)
try:
    print(dataset["features"][:].shape)
except ValueError:
    print("Cannot read as is")

Cannot read as is

Fortunately we can add metadata on our kosh-associated object and inform the loader on what to do:

In [6]:

Copied!

associated.skiprows = 6
print(dataset["features"][:].shape)
associated.skiprows = 6
print(dataset["features"][:].shape)

(25, 6)

Columns Headers¶

It is quite frequent that one of the header rows contains the columns/names

Let's add some metadata informing the loader which line contains the features.

In [7]:

Copied!

associated.features_line = 5
# we'll need to clear the features cache
print(dataset.list_features(use_cache=False))
associated.features_line = 5
# we'll need to clear the features cache
print(dataset.list_features(use_cache=False))

['time', 'zeros', 'ones', 'twos', 'threes', 'fours']

We can now access each feature/column separately, via their name. This can be useful if you're reading data from text files that are organized differently but contain the same column name.

In [8]:

Copied!

zeros = dataset["zeros"][:4]
print(zeros)
zeros = dataset["zeros"][:4]
print(zeros)

[0.65485361 0.04917816 0.20506388 0.24302516]

In some cases the column headers can be separated via fixed width (causing two names to touch each other)

For a good example would be: ../tests/baselines/npy/example_column_names_in_header_via_constant_width.txt

In [9]:

Copied!





filename = "../tests/baselines/npy/example_column_names_in_header_via_constant_width.txt"
dataset = store.create(name="example_constant_width")
associated = dataset.associate(filename, mime_type="numpy/txt", id_only=False)
associated.skiprows=1
associated.features_line=0
associated.columns_width=10
print(dataset.list_features())
filename = "../tests/baselines/npy/example_column_names_in_header_via_constant_width.txt"
dataset = store.create(name="example_constant_width")
associated = dataset.associate(filename, mime_type="numpy/txt", id_only=False)
associated.skiprows=1
associated.features_line=0
associated.columns_width=10
print(dataset.list_features())

['time', 'zeros col', 'ones  col', 'twos col', 'threes col', 'fours']