datasetinsights.io¶

datasetinsights.io.bbox¶

class datasetinsights.io.bbox.BBox2D(label, x, y, w, h, score=1.0)¶

Bases: object

Canonical Representation of a 2D bounding box.

label¶

string representation of the label.

Type: str

x¶

x pixel coordinate of the upper left corner.

Type: float

y¶

y pixel coordinate of the upper left corner.

Type: float

w¶

width (number of pixels)of the bounding box.

Type: float

h¶

height (number of pixels) of the bounding box.

Type: float

score¶

detection confidence score. Default is set to score=1. if this is a ground truth bounding box.

Type: float

Examples

Here is an example about how to use this class.

>>> gt_bbox = BBox2D(label='car', x=2, y=6, w=2, h=4)
>>> gt_bbox
"label='car'|score=1.0|x=2.0|y=6.0|w=2.0|h=4.0"
>>> pred_bbox = BBox2D(label='car', x=2, y=5, w=2, h=4, score=0.79)
>>> pred_bbox.area
8
>>> pred_bbox.intersect_with(gt_bbox)
True
>>> pred_bbox.intersection(gt_bbox)
6
>>> pred_bbox.union(gt_bbox)
10
>>> pred_bbox.iou(gt_bbox)
0.6

property area¶

Calculate area of this bounding box

Returns: width x height of the bound box

intersect_with(other)¶

Check whether this box intersects with other bounding box

Parameters: other (BBox2D) – other bounding box object to check intersection
Returns: True if two bounding boxes intersect, False otherwise

intersection(other)¶

Calculate the intersection area with other bounding box

Parameters: other (BBox2D) – other bounding box object to calculate intersection
Returns: float of the intersection area for two bounding boxes

iou(other)¶

Calculate intersection over union area with other bounding box

\[IOU = \frac{intersection}{union}\]

Parameters: other (BBox2D) – other bounding box object to calculate iou
Returns: float of the union area for two bounding boxes

union(other, intersection_area=None)¶

Calculate union area with other bounding box

Parameters

other (BBox2D) – other bounding box object to calculate union
intersection_area (float) – pre-calculated area of intersection

Returns

float of the union area for two bounding boxes

class datasetinsights.io.bbox.BBox3D(translation, size, label, sample_token, score=1, rotation: pyquaternion.quaternion.Quaternion = Quaternion(1.0, 0.0, 0.0, 0.0), velocity=nan, nan, nan)¶

Bases: object

Class for 3d bounding boxes which can either be predictions or ground-truths. This class is the primary representation in this repo of 3d bounding boxes and is based off of the Nuscenes style dataset.

property back_left_bottom_pt¶

Back-left-bottom point.

Type: Returns
Type: float

property back_left_top_pt¶

Back-left-top point.

Type: float

property back_right_bottom_pt¶

Back-right-bottom point.

Type: float

property back_right_top_pt¶

Back-right-top point.

Type: float

property front_left_bottom_pt¶

Front-left-bottom point.

Type: float

property front_left_top_pt¶

Front-left-top point.

Type: float

property front_right_bottom_pt¶

Front-right-bottom point.

Type: float

property front_right_top_pt¶

Front-right-top point.

Type: float

property p¶

list of all 8 corners of the box beginning with the the bottom: four corners and then the top

four corners, both in counterclockwise order (from birds eye view) beginning with the back-left corner

Type: Returns

datasetinsights.io.bbox.group_bbox2d_per_label(bboxes)¶

Group 2D bounding boxes with same label.

Parameters: bboxes (list[BBox2D]) – a list of 2D bounding boxes
Returns: a dictionary of 2d boundign box group. {label1: [bbox1, bboxes2, …], label2: [bbox1, …]}
Return type: dict

datasetinsights.io.checkpoint¶

Save estimator checkpoints

class datasetinsights.io.checkpoint.EstimatorCheckpoint(estimator_name, checkpoint_dir, distributed)¶

Bases: object

Saves and loads estimator checkpoints.

Assigns estimator checkpoint writer according to log_dir which is responsible for saving estimators. Writer can be a GCS or local writer. Assigns loader which is responsible for loading estimator from a given path. Loader can a local, GCS or HTTP loader.

Parameters

estimator_name (str) – name of the estimator
checkpoint_dir (str) – Directory where checkpoints are stored
distributed (bool) – boolean to determine distributed training

checkpoint_dir¶

Directory where checkpoints are stored

Type: str

distributed¶

boolean to determine distributed training

Type: bool

load(estimator, path)¶

Loads estimator from given path.

Path can be either a local path or GCS path or HTTP url.

Parameters

estimator (datasetinsights.estimators.Estimator) –
estimator object (datasetinsights) –
path (str) – path of estimator

save(estimator, epoch)¶

Save estimator to the log_dir.

Parameters

estimator (datasetinsights.estimators.Estimator) –
estimator object. (datasetinsights) –
epoch (int) – Epoch number.

class datasetinsights.io.checkpoint.GCSEstimatorWriter(cloud_path, prefix, *, suffix='estimator')¶

Bases: object

Writes (saves) estimator checkpoints on GCS.

Parameters

cloud_path (str) – GCS cloud path (e.g. gs://bucket/path/to/directoy)
prefix (str) – filename prefix of the checkpoint files
suffix (str) – filename suffix of the checkpoint files

save(estimator, epoch=None)¶

Save estimator to checkpoint files on GCS.

Parameters

estimator (datasetinsights.estimators.Estimator) – datasetinsights estimator object.
epoch (int) – the current epoch number. Default: None

Returns

Full GCS cloud path to the saved checkpoint file.

class datasetinsights.io.checkpoint.LocalEstimatorWriter(dirname, prefix, *, suffix='estimator', create_dir=True)¶

Bases: object

Writes (saves) estimator checkpoints locally.

Parameters

dirname (str) – Directory where estimator is to be saved.
prefix (str) – Filename prefix of the checkpoint files.
suffix (str) – Filename suffix of the checkpoint files.
create_dir (bool) – Flag for creating new directory. Default: True.

dirname¶

directory name of where checkpoint files are stored

Type: str

prefix¶

filename prefix of the checkpoint files

Type: str

suffix¶

filename suffix of the checkpoint files

Type: str

save(estimator, epoch=None)¶

Save estimator to locally to log_dir.

Parameters

estimator (datasetinsights.estimators.Estimator) – datasetinsights estimator object.
epoch (int) – The current epoch number. Default: None

Returns

Full path to the saved checkpoint file.

datasetinsights.io.checkpoint.load_from_gcs(estimator, full_cloud_path)¶

Load estimator from checkpoint files on GCS.

Parameters

estimator (datasetinsights.estimators.Estimator) – datasetinsights estimator object.
full_cloud_path – full path to the checkpoint file

datasetinsights.io.checkpoint.load_from_http(estimator, url)¶

Load estimator from checkpoint files on GCS.

Parameters

estimator (datasetinsights.estimators.Estimator) – datasetinsights estimator object.
url – URL of the checkpoint file

datasetinsights.io.checkpoint.load_local(estimator, path)¶: Loads estimator checkpoints from a local path.

datasetinsights.io.download¶

class datasetinsights.io.download.TimeoutHTTPAdapter(timeout, *args, **kwargs)¶

Bases: requests.adapters.HTTPAdapter

send(request, **kwargs)¶

Sends PreparedRequest object. Returns Response object.

Parameters

request – The PreparedRequest being sent.
stream – (optional) Whether to stream the request content.
timeout (float or tuple or urllib3 Timeout object) – (optional) How long to wait for the server to send data before giving up, as a float, or a (connect timeout, read timeout) tuple.
verify – (optional) Either a boolean, in which case it controls whether we verify the server’s TLS certificate, or a string, in which case it must be a path to a CA bundle to use
cert – (optional) Any user-provided SSL certificate to be trusted.
proxies – (optional) The proxies dictionary to apply to the request.

Return type

requests.Response

datasetinsights.io.download.checksum_matches(filepath, expected_checksum, algorithm='CRC32')¶

Check if the checksum matches

Parameters

filepath (str) – the doaloaded file path
expected_checksum (int) – expected checksum of the file
algorithm (str) – checksum algorithm. Defaults to CRC32

Returns

True if the file checksum matches.

datasetinsights.io.download.compute_checksum(filepath, algorithm='CRC32')¶

Compute the checksum of a file.

Parameters

filepath (str) – the doaloaded file path
algorithm (str) – checksum algorithm. Defaults to CRC32

Returns

the checksum value

Return type

int

datasetinsights.io.download.download_file(source_uri: str, dest_path: str, file_name: str = None)¶

Download a file specified from a source uri

Parameters

source_uri (str) – source url where the file should be downloaded
dest_path (str) – destination path of the file
file_name (str) – file name of the file to be downloaded

Returns

String of destination path.

datasetinsights.io.download.get_checksum_from_file(filepath)¶

This method return checksum of the file whose filepath is given.

Parameters: filepath (str) – Path of the checksum file. Path can be HTTP(s) url or local path.
Raises: ValueError – Raises this error if filepath is not local or not HTTP or HTTPS url.

datasetinsights.io.download.validate_checksum(filepath, expected_checksum, algorithm='CRC32')¶

Validate checksum of the downloaded file.

Parameters

filepath (str) – the doaloaded file path
expected_checksum (int) – expected checksum of the file
algorithm (str) – checksum algorithm. Defaults to CRC32

Raises

ChecksumError if the file checksum does not match. –

datasetinsights.io.exceptions¶

exception datasetinsights.io.exceptions.ChecksumError¶

Bases: Exception

Raises when the downloaded file checksum is not correct.

exception datasetinsights.io.exceptions.DownloadError¶

Bases: Exception

Raise when download file failed.

exception datasetinsights.io.exceptions.InvalidTrackerError¶

Bases: Exception

Raises when unknown tracker requested .

datasetinsights.io.gcs¶

class datasetinsights.io.gcs.GCSClient(**kwargs)¶

Bases: object

This class is used to download data from GCS location and perform function such as downloading the dataset and checksum validation.

GCS_PREFIX = '^gs://'¶

KEY_SEPARATOR = '/'¶

download(*, url=None, local_path=None, bucket=None, key=None)¶

This method is used to download the dataset from GCS.

Parameters

url (str) – This is the downloader-uri that indicates where the dataset should be downloaded from.
local_path (str) – This is the path to the directory where the download will store the dataset.
bucket (str) – gcs bucket name
key (str) – object key path

Examples –

>>> url = "gs://bucket/folder or gs://bucket/folder/data.zip"
>>> local_path = "/tmp/folder"
>>> bucket ="bucket"
>>> key ="folder/data.zip" or "folder"

get_most_recent_blob(url=None, bucket_name=None, key=None)¶

Get the last updated blob in a given bucket under given prefix

Parameters

bucket_name (str) – gcs bucket name
key (str) – object key path

upload(*, local_path=None, bucket=None, key=None, url=None, pattern='*')¶

Upload a file or list of files from directory to GCS

Parameters

url (str) – This is the gcs location that indicates where
dataset should be uploaded. (the) –
local_path (str) – This is the path to the directory or file
the data is stored. (where) –
bucket (str) – gcs bucket name
key (str) – object key path
pattern – Unix glob patterns. Use **/* for recursive glob.

Examples –

For file upload:

>>> url = "gs://bucket/folder/data.zip"
>>> local_path = "/tmp/folder/data.zip"
>>> bucket ="bucket"
>>> key ="folder/data.zip"

For directory upload:

>>> url = "gs://bucket/folder"
>>> local_path = "/tmp/folder"
>>> bucket ="bucket"
>>> key ="folder"
>>> key ="**/*"

datasetinsights.io.kfp_output¶

class datasetinsights.io.kfp_output.KubeflowPipelineWriter(tb_log_dir='/home/docs/checkouts/readthedocs.org/user_builds/memory-issue/checkouts/latest/runs/20210127-003502', kfp_log_dir='/home/docs/checkouts/readthedocs.org/user_builds/memory-issue/checkouts/latest/kfp/20210127-003502', kfp_metrics_filename='mlpipeline-metrics.json', kfp_ui_metadata_filename='mlpipeline-ui-metadata.json')¶

Bases: object

KFP Writer for serializing metrics dictionary genereated during model training/evaluation toJSON and store in a file and create KFP dashboard visualizer JSON file for tensorboard.

Parameters

filename (str) – Name of the file to which the writer will save metrics
kfp_log_dir (str) – Path where all files related to KFP will be stored
tb_log_dir (str) – Path where tensorobard logs are saved

filename¶

Name of the file to which the writer will save metrics

Type: str

filepath¶

Path where the file will be stored

Type: str

data_dict¶

A dictionary to save metrics name and value pairs

Type: dict

data¶: Dictionary to be JSON serialized

add_metric(name, val)¶

Adds metric to the data dictionary of the writer

Note: Using same name key will overwrite the previous value as the current strategy is to save only the metrics generated in last epoch

Parameters

name (str) – Name of the metric
val (float) – Value of the metric

create_tb_visualization_json(tb_log_dir, kfp_ui_metadata_filename)¶

write_metric()¶: Saves all the metrics added previously to a file in the format required by kubeflow

datasetinsights.io.loader¶

datasetinsights.io.loader.create_loader(dataset, *, dryrun=False, batch_size=1, num_workers=0, collate_fn=None)¶

Create data loader from dataset

Note: The data loader here is a pytorch data loader object which does not assume tensor_type to be pytorch tensor. We only require input dataset to support __getitem__ and __len__ mothod to iterate over items in the dataset.

Since collate_fn method in torch.utils.data.DataLoader behave differently when automatic batching is on, we might need to override this method. If create_loader method became too complicated in order to support different estimators, we might expect different estimator to have their own create_loader method.

https://pytorch.org/docs/stable/data.html#working-with-collate-fn

Parameters

dataset (Dataset) – dataset object derived from datasetinsights.data.datasets.Dataset class.
dryrun (bool) – indicator whether to use a very small subset of the dataset. This subset is useful to make sure we can quickly run estimator without loading the whole dataset. (default: False)
batch_size (int) – how many samples per batch to load (default: 1)
num_workers (int) – number of parallel workers used for data loader. Set to 0 to run on a single thread (instead of 1 which might introduce overhead). (default: 0)

Returns

torch.utils.data.DataLoader object as data loader

datasetinsights.io.transforms¶

class datasetinsights.io.transforms.Compose(transforms)¶: Bases: object

class datasetinsights.io.transforms.RandomHorizontalFlip(flip_prob=0.5)¶

Bases: object

Flip the image from top to bottom.

Parameters: flip_prob – the probability to flip the image

class datasetinsights.io.transforms.Resize(img_size=- 1, target_size=- 1)¶

Bases: object

Resize the (image, target) to the given sizes.

Parameters

img_size (tuple or int) – Desired output size. If size is a sequence like (h, w), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size)
target_size (tuple or int) – Desired output size. If size is a sequence like (h, w), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size)

datasetinsights.io.config_handler¶

datasetinsights.io.config_handler.load_config(path)¶

Load config file from local or remote locations.

Parameters

path (str) – This is the file-uri that indicates where the YAML config should be loaded from.

Examples –

>>> path = "gs://thea-dev/config.yaml"
>>> path = "https://thea-dev/config.yaml"
>>> path = "http://thea-dev/config.yaml"
>>> path = "file:///root/config.yaml" # absolute path
>>> path = "/root/config.yaml" # absolute path
>>> path = "datasetinsights/config.yaml" # relative path

Returns

config object of type yacs.config.CfgNode

class datasetinsights.io.BBox2D(label, x, y, w, h, score=1.0)¶

Bases: object

Canonical Representation of a 2D bounding box.

label¶

string representation of the label.

Type: str

x¶

x pixel coordinate of the upper left corner.

Type: float

y¶

y pixel coordinate of the upper left corner.

Type: float

w¶

width (number of pixels)of the bounding box.

Type: float

h¶

height (number of pixels) of the bounding box.

Type: float

score¶

detection confidence score. Default is set to score=1. if this is a ground truth bounding box.

Type: float

Examples

Here is an example about how to use this class.

>>> gt_bbox = BBox2D(label='car', x=2, y=6, w=2, h=4)
>>> gt_bbox
"label='car'|score=1.0|x=2.0|y=6.0|w=2.0|h=4.0"
>>> pred_bbox = BBox2D(label='car', x=2, y=5, w=2, h=4, score=0.79)
>>> pred_bbox.area
8
>>> pred_bbox.intersect_with(gt_bbox)
True
>>> pred_bbox.intersection(gt_bbox)
6
>>> pred_bbox.union(gt_bbox)
10
>>> pred_bbox.iou(gt_bbox)
0.6

property area¶

Calculate area of this bounding box

Returns: width x height of the bound box

intersect_with(other)¶

Check whether this box intersects with other bounding box

Parameters: other (BBox2D) – other bounding box object to check intersection
Returns: True if two bounding boxes intersect, False otherwise

intersection(other)¶

Calculate the intersection area with other bounding box

Parameters: other (BBox2D) – other bounding box object to calculate intersection
Returns: float of the intersection area for two bounding boxes

iou(other)¶

Calculate intersection over union area with other bounding box

\[IOU = \frac{intersection}{union}\]

Parameters: other (BBox2D) – other bounding box object to calculate iou
Returns: float of the union area for two bounding boxes

union(other, intersection_area=None)¶

Calculate union area with other bounding box

Parameters

other (BBox2D) – other bounding box object to calculate union
intersection_area (float) – pre-calculated area of intersection

Returns

float of the union area for two bounding boxes

class datasetinsights.io.EstimatorCheckpoint(estimator_name, checkpoint_dir, distributed)¶

Bases: object

Saves and loads estimator checkpoints.

Assigns estimator checkpoint writer according to log_dir which is responsible for saving estimators. Writer can be a GCS or local writer. Assigns loader which is responsible for loading estimator from a given path. Loader can a local, GCS or HTTP loader.

Parameters

estimator_name (str) – name of the estimator
checkpoint_dir (str) – Directory where checkpoints are stored
distributed (bool) – boolean to determine distributed training

checkpoint_dir¶

Directory where checkpoints are stored

Type: str

distributed¶

boolean to determine distributed training

Type: bool

load(estimator, path)¶

Loads estimator from given path.

Path can be either a local path or GCS path or HTTP url.

Parameters

estimator (datasetinsights.estimators.Estimator) –
estimator object (datasetinsights) –
path (str) – path of estimator

save(estimator, epoch)¶

Save estimator to the log_dir.

Parameters

estimator (datasetinsights.estimators.Estimator) –
estimator object. (datasetinsights) –
epoch (int) – Epoch number.

class datasetinsights.io.KubeflowPipelineWriter(tb_log_dir='/home/docs/checkouts/readthedocs.org/user_builds/memory-issue/checkouts/latest/runs/20210127-003502', kfp_log_dir='/home/docs/checkouts/readthedocs.org/user_builds/memory-issue/checkouts/latest/kfp/20210127-003502', kfp_metrics_filename='mlpipeline-metrics.json', kfp_ui_metadata_filename='mlpipeline-ui-metadata.json')¶

Bases: object

KFP Writer for serializing metrics dictionary genereated during model training/evaluation toJSON and store in a file and create KFP dashboard visualizer JSON file for tensorboard.

Parameters

filename (str) – Name of the file to which the writer will save metrics
kfp_log_dir (str) – Path where all files related to KFP will be stored
tb_log_dir (str) – Path where tensorobard logs are saved

filename¶

Name of the file to which the writer will save metrics

Type: str

filepath¶

Path where the file will be stored

Type: str

data_dict¶

A dictionary to save metrics name and value pairs

Type: dict

data¶: Dictionary to be JSON serialized

add_metric(name, val)¶

Adds metric to the data dictionary of the writer

Note: Using same name key will overwrite the previous value as the current strategy is to save only the metrics generated in last epoch

Parameters

name (str) – Name of the metric
val (float) – Value of the metric

create_tb_visualization_json(tb_log_dir, kfp_ui_metadata_filename)¶

write_metric()¶: Saves all the metrics added previously to a file in the format required by kubeflow

datasetinsights.io.create_downloader(source_uri, **kwargs)¶

This function instantiates the dataset downloader: after finding it with the source-uri provided

Parameters

source_uri – URI used to look up the correct dataset downloader
**kwargs –

Returns: The dataset downloader instance matching the source-uri.