datasetinsights.io¶
datasetinsights.io.bbox¶
-
class
datasetinsights.io.bbox.BBox2D(label, x, y, w, h, score=1.0)¶ Bases:
objectCanonical Representation of a 2D bounding box.
-
label¶ string representation of the label.
- Type
str
-
x¶ x pixel coordinate of the upper left corner.
- Type
float
-
y¶ y pixel coordinate of the upper left corner.
- Type
float
-
w¶ width (number of pixels)of the bounding box.
- Type
float
-
h¶ height (number of pixels) of the bounding box.
- Type
float
-
score¶ detection confidence score. Default is set to score=1. if this is a ground truth bounding box.
- Type
float
Examples
Here is an example about how to use this class.
>>> gt_bbox = BBox2D(label='car', x=2, y=6, w=2, h=4) >>> gt_bbox "label='car'|score=1.0|x=2.0|y=6.0|w=2.0|h=4.0" >>> pred_bbox = BBox2D(label='car', x=2, y=5, w=2, h=4, score=0.79) >>> pred_bbox.area 8 >>> pred_bbox.intersect_with(gt_bbox) True >>> pred_bbox.intersection(gt_bbox) 6 >>> pred_bbox.union(gt_bbox) 10 >>> pred_bbox.iou(gt_bbox) 0.6
-
property
area¶ Calculate area of this bounding box
- Returns
width x height of the bound box
-
intersect_with(other)¶ Check whether this box intersects with other bounding box
- Parameters
other (BBox2D) – other bounding box object to check intersection
- Returns
True if two bounding boxes intersect, False otherwise
-
intersection(other)¶ Calculate the intersection area with other bounding box
- Parameters
other (BBox2D) – other bounding box object to calculate intersection
- Returns
float of the intersection area for two bounding boxes
-
-
class
datasetinsights.io.bbox.BBox3D(translation, size, label, sample_token, score=1, rotation: pyquaternion.quaternion.Quaternion = Quaternion(1.0, 0.0, 0.0, 0.0), velocity=nan, nan, nan)¶ Bases:
objectClass for 3d bounding boxes which can either be predictions or ground-truths. This class is the primary representation in this repo of 3d bounding boxes and is based off of the Nuscenes style dataset.
-
property
back_left_bottom_pt¶ Back-left-bottom point.
- Type
Returns
- Type
float
-
property
back_left_top_pt¶ Back-left-top point.
- Type
float
-
property
back_right_bottom_pt¶ Back-right-bottom point.
- Type
float
-
property
back_right_top_pt¶ Back-right-top point.
- Type
float
-
property
front_left_bottom_pt¶ Front-left-bottom point.
- Type
float
-
property
front_left_top_pt¶ Front-left-top point.
- Type
float
-
property
front_right_bottom_pt¶ Front-right-bottom point.
- Type
float
-
property
front_right_top_pt¶ Front-right-top point.
- Type
float
-
property
p¶ - list of all 8 corners of the box beginning with the the bottom
four corners and then the top
four corners, both in counterclockwise order (from birds eye view) beginning with the back-left corner
- Type
Returns
-
property
datasetinsights.io.checkpoint¶
Save estimator checkpoints
-
class
datasetinsights.io.checkpoint.EstimatorCheckpoint(estimator_name, checkpoint_dir, distributed)¶ Bases:
objectSaves and loads estimator checkpoints.
Assigns estimator checkpoint writer according to log_dir which is responsible for saving estimators. Writer can be a GCS or local writer. Assigns loader which is responsible for loading estimator from a given path. Loader can a local, GCS or HTTP loader.
- Parameters
estimator_name (str) – name of the estimator
checkpoint_dir (str) – Directory where checkpoints are stored
distributed (bool) – boolean to determine distributed training
-
checkpoint_dir¶ Directory where checkpoints are stored
- Type
str
-
distributed¶ boolean to determine distributed training
- Type
bool
-
load(estimator, path)¶ Loads estimator from given path.
Path can be either a local path or GCS path or HTTP url.
- Parameters
estimator (datasetinsights.estimators.Estimator) –
estimator object (datasetinsights) –
path (str) – path of estimator
-
save(estimator, epoch)¶ Save estimator to the log_dir.
- Parameters
estimator (datasetinsights.estimators.Estimator) –
estimator object. (datasetinsights) –
epoch (int) – Epoch number.
-
class
datasetinsights.io.checkpoint.GCSEstimatorWriter(cloud_path, prefix, *, suffix='estimator')¶ Bases:
objectWrites (saves) estimator checkpoints on GCS.
- Parameters
cloud_path (str) – GCS cloud path (e.g. gs://bucket/path/to/directoy)
prefix (str) – filename prefix of the checkpoint files
suffix (str) – filename suffix of the checkpoint files
-
save(estimator, epoch=None)¶ Save estimator to checkpoint files on GCS.
- Parameters
estimator (datasetinsights.estimators.Estimator) – datasetinsights estimator object.
epoch (int) – the current epoch number. Default: None
- Returns
Full GCS cloud path to the saved checkpoint file.
-
class
datasetinsights.io.checkpoint.LocalEstimatorWriter(dirname, prefix, *, suffix='estimator', create_dir=True)¶ Bases:
objectWrites (saves) estimator checkpoints locally.
- Parameters
dirname (str) – Directory where estimator is to be saved.
prefix (str) – Filename prefix of the checkpoint files.
suffix (str) – Filename suffix of the checkpoint files.
create_dir (bool) – Flag for creating new directory. Default: True.
-
dirname¶ directory name of where checkpoint files are stored
- Type
str
-
prefix¶ filename prefix of the checkpoint files
- Type
str
-
suffix¶ filename suffix of the checkpoint files
- Type
str
-
save(estimator, epoch=None)¶ Save estimator to locally to log_dir.
- Parameters
estimator (datasetinsights.estimators.Estimator) – datasetinsights estimator object.
epoch (int) – The current epoch number. Default: None
- Returns
Full path to the saved checkpoint file.
-
datasetinsights.io.checkpoint.load_from_gcs(estimator, full_cloud_path)¶ Load estimator from checkpoint files on GCS.
- Parameters
estimator (datasetinsights.estimators.Estimator) – datasetinsights estimator object.
full_cloud_path – full path to the checkpoint file
-
datasetinsights.io.checkpoint.load_from_http(estimator, url)¶ Load estimator from checkpoint files on GCS.
- Parameters
estimator (datasetinsights.estimators.Estimator) – datasetinsights estimator object.
url – URL of the checkpoint file
-
datasetinsights.io.checkpoint.load_local(estimator, path)¶ Loads estimator checkpoints from a local path.
datasetinsights.io.download¶
-
class
datasetinsights.io.download.TimeoutHTTPAdapter(timeout, *args, **kwargs)¶ Bases:
requests.adapters.HTTPAdapter-
send(request, **kwargs)¶ Sends PreparedRequest object. Returns Response object.
- Parameters
request – The
PreparedRequestbeing sent.stream – (optional) Whether to stream the request content.
timeout (float or tuple or urllib3 Timeout object) – (optional) How long to wait for the server to send data before giving up, as a float, or a (connect timeout, read timeout) tuple.
verify – (optional) Either a boolean, in which case it controls whether we verify the server’s TLS certificate, or a string, in which case it must be a path to a CA bundle to use
cert – (optional) Any user-provided SSL certificate to be trusted.
proxies – (optional) The proxies dictionary to apply to the request.
- Return type
requests.Response
-
-
datasetinsights.io.download.checksum_matches(filepath, expected_checksum, algorithm='CRC32')¶ Check if the checksum matches
- Parameters
filepath (str) – the doaloaded file path
expected_checksum (int) – expected checksum of the file
algorithm (str) – checksum algorithm. Defaults to CRC32
- Returns
True if the file checksum matches.
-
datasetinsights.io.download.compute_checksum(filepath, algorithm='CRC32')¶ Compute the checksum of a file.
- Parameters
filepath (str) – the doaloaded file path
algorithm (str) – checksum algorithm. Defaults to CRC32
- Returns
the checksum value
- Return type
int
-
datasetinsights.io.download.download_file(source_uri: str, dest_path: str, file_name: str = None)¶ Download a file specified from a source uri
- Parameters
source_uri (str) – source url where the file should be downloaded
dest_path (str) – destination path of the file
file_name (str) – file name of the file to be downloaded
- Returns
String of destination path.
-
datasetinsights.io.download.get_checksum_from_file(filepath)¶ This method return checksum of the file whose filepath is given.
- Parameters
filepath (str) – Path of the checksum file. Path can be HTTP(s) url or local path.
- Raises
ValueError – Raises this error if filepath is not local or not HTTP or HTTPS url.
-
datasetinsights.io.download.validate_checksum(filepath, expected_checksum, algorithm='CRC32')¶ Validate checksum of the downloaded file.
- Parameters
filepath (str) – the doaloaded file path
expected_checksum (int) – expected checksum of the file
algorithm (str) – checksum algorithm. Defaults to CRC32
- Raises
ChecksumError if the file checksum does not match. –
datasetinsights.io.exceptions¶
-
exception
datasetinsights.io.exceptions.ChecksumError¶ Bases:
ExceptionRaises when the downloaded file checksum is not correct.
-
exception
datasetinsights.io.exceptions.DownloadError¶ Bases:
ExceptionRaise when download file failed.
-
exception
datasetinsights.io.exceptions.InvalidTrackerError¶ Bases:
ExceptionRaises when unknown tracker requested .
datasetinsights.io.gcs¶
-
class
datasetinsights.io.gcs.GCSClient(**kwargs)¶ Bases:
objectThis class is used to download data from GCS location and perform function such as downloading the dataset and checksum validation.
-
GCS_PREFIX= '^gs://'¶
-
KEY_SEPARATOR= '/'¶
-
download(*, url=None, local_path=None, bucket=None, key=None)¶ This method is used to download the dataset from GCS.
- Parameters
url (str) – This is the downloader-uri that indicates where the dataset should be downloaded from.
local_path (str) – This is the path to the directory where the download will store the dataset.
bucket (str) – gcs bucket name
key (str) – object key path
Examples –
>>> url = "gs://bucket/folder or gs://bucket/folder/data.zip" >>> local_path = "/tmp/folder" >>> bucket ="bucket" >>> key ="folder/data.zip" or "folder"
-
get_most_recent_blob(url=None, bucket_name=None, key=None)¶ Get the last updated blob in a given bucket under given prefix
- Parameters
bucket_name (str) – gcs bucket name
key (str) – object key path
-
upload(*, local_path=None, bucket=None, key=None, url=None, pattern='*')¶ Upload a file or list of files from directory to GCS
- Parameters
url (str) – This is the gcs location that indicates where
dataset should be uploaded. (the) –
local_path (str) – This is the path to the directory or file
the data is stored. (where) –
bucket (str) – gcs bucket name
key (str) – object key path
pattern – Unix glob patterns. Use **/* for recursive glob.
Examples –
- For file upload:
>>> url = "gs://bucket/folder/data.zip" >>> local_path = "/tmp/folder/data.zip" >>> bucket ="bucket" >>> key ="folder/data.zip"
- For directory upload:
>>> url = "gs://bucket/folder" >>> local_path = "/tmp/folder" >>> bucket ="bucket" >>> key ="folder" >>> key ="**/*"
-
datasetinsights.io.kfp_output¶
-
class
datasetinsights.io.kfp_output.KubeflowPipelineWriter(tb_log_dir='/home/docs/checkouts/readthedocs.org/user_builds/memory-issue/checkouts/latest/runs/20210127-003502', kfp_log_dir='/home/docs/checkouts/readthedocs.org/user_builds/memory-issue/checkouts/latest/kfp/20210127-003502', kfp_metrics_filename='mlpipeline-metrics.json', kfp_ui_metadata_filename='mlpipeline-ui-metadata.json')¶ Bases:
objectKFP Writer for serializing metrics dictionary genereated during model training/evaluation toJSON and store in a file and create KFP dashboard visualizer JSON file for tensorboard.
- Parameters
filename (str) – Name of the file to which the writer will save metrics
kfp_log_dir (str) – Path where all files related to KFP will be stored
tb_log_dir (str) – Path where tensorobard logs are saved
-
filename¶ Name of the file to which the writer will save metrics
- Type
str
-
filepath¶ Path where the file will be stored
- Type
str
-
data_dict¶ A dictionary to save metrics name and value pairs
- Type
dict
-
data¶ Dictionary to be JSON serialized
-
add_metric(name, val)¶ Adds metric to the data dictionary of the writer
Note: Using same name key will overwrite the previous value as the current strategy is to save only the metrics generated in last epoch
- Parameters
name (str) – Name of the metric
val (float) – Value of the metric
-
create_tb_visualization_json(tb_log_dir, kfp_ui_metadata_filename)¶
-
write_metric()¶ Saves all the metrics added previously to a file in the format required by kubeflow
datasetinsights.io.loader¶
-
datasetinsights.io.loader.create_loader(dataset, *, dryrun=False, batch_size=1, num_workers=0, collate_fn=None)¶ Create data loader from dataset
Note: The data loader here is a pytorch data loader object which does not assume tensor_type to be pytorch tensor. We only require input dataset to support __getitem__ and __len__ mothod to iterate over items in the dataset.
Since collate_fn method in torch.utils.data.DataLoader behave differently when automatic batching is on, we might need to override this method. If create_loader method became too complicated in order to support different estimators, we might expect different estimator to have their own create_loader method.
https://pytorch.org/docs/stable/data.html#working-with-collate-fn
- Parameters
dataset (Dataset) – dataset object derived from datasetinsights.data.datasets.Dataset class.
dryrun (bool) – indicator whether to use a very small subset of the dataset. This subset is useful to make sure we can quickly run estimator without loading the whole dataset. (default: False)
batch_size (int) – how many samples per batch to load (default: 1)
num_workers (int) – number of parallel workers used for data loader. Set to 0 to run on a single thread (instead of 1 which might introduce overhead). (default: 0)
- Returns
torch.utils.data.DataLoader object as data loader
datasetinsights.io.transforms¶
-
class
datasetinsights.io.transforms.Compose(transforms)¶ Bases:
object
-
class
datasetinsights.io.transforms.RandomHorizontalFlip(flip_prob=0.5)¶ Bases:
objectFlip the image from top to bottom.
- Parameters
flip_prob – the probability to flip the image
-
class
datasetinsights.io.transforms.Resize(img_size=- 1, target_size=- 1)¶ Bases:
objectResize the (image, target) to the given sizes.
- Parameters
img_size (tuple or int) – Desired output size. If size is a sequence like (h, w), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size)
target_size (tuple or int) – Desired output size. If size is a sequence like (h, w), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size)
datasetinsights.io.config_handler¶
-
datasetinsights.io.config_handler.load_config(path)¶ Load config file from local or remote locations.
- Parameters
path (str) – This is the file-uri that indicates where the YAML config should be loaded from.
Examples –
>>> path = "gs://thea-dev/config.yaml" >>> path = "https://thea-dev/config.yaml" >>> path = "http://thea-dev/config.yaml" >>> path = "file:///root/config.yaml" # absolute path >>> path = "/root/config.yaml" # absolute path >>> path = "datasetinsights/config.yaml" # relative path
- Returns
config object of type yacs.config.CfgNode
-
class
datasetinsights.io.BBox2D(label, x, y, w, h, score=1.0)¶ Bases:
objectCanonical Representation of a 2D bounding box.
-
label¶ string representation of the label.
- Type
str
-
x¶ x pixel coordinate of the upper left corner.
- Type
float
-
y¶ y pixel coordinate of the upper left corner.
- Type
float
-
w¶ width (number of pixels)of the bounding box.
- Type
float
-
h¶ height (number of pixels) of the bounding box.
- Type
float
-
score¶ detection confidence score. Default is set to score=1. if this is a ground truth bounding box.
- Type
float
Examples
Here is an example about how to use this class.
>>> gt_bbox = BBox2D(label='car', x=2, y=6, w=2, h=4) >>> gt_bbox "label='car'|score=1.0|x=2.0|y=6.0|w=2.0|h=4.0" >>> pred_bbox = BBox2D(label='car', x=2, y=5, w=2, h=4, score=0.79) >>> pred_bbox.area 8 >>> pred_bbox.intersect_with(gt_bbox) True >>> pred_bbox.intersection(gt_bbox) 6 >>> pred_bbox.union(gt_bbox) 10 >>> pred_bbox.iou(gt_bbox) 0.6
-
property
area¶ Calculate area of this bounding box
- Returns
width x height of the bound box
-
intersect_with(other)¶ Check whether this box intersects with other bounding box
- Parameters
other (BBox2D) – other bounding box object to check intersection
- Returns
True if two bounding boxes intersect, False otherwise
-
intersection(other)¶ Calculate the intersection area with other bounding box
- Parameters
other (BBox2D) – other bounding box object to calculate intersection
- Returns
float of the intersection area for two bounding boxes
-
-
class
datasetinsights.io.EstimatorCheckpoint(estimator_name, checkpoint_dir, distributed)¶ Bases:
objectSaves and loads estimator checkpoints.
Assigns estimator checkpoint writer according to log_dir which is responsible for saving estimators. Writer can be a GCS or local writer. Assigns loader which is responsible for loading estimator from a given path. Loader can a local, GCS or HTTP loader.
- Parameters
estimator_name (str) – name of the estimator
checkpoint_dir (str) – Directory where checkpoints are stored
distributed (bool) – boolean to determine distributed training
-
checkpoint_dir¶ Directory where checkpoints are stored
- Type
str
-
distributed¶ boolean to determine distributed training
- Type
bool
-
load(estimator, path)¶ Loads estimator from given path.
Path can be either a local path or GCS path or HTTP url.
- Parameters
estimator (datasetinsights.estimators.Estimator) –
estimator object (datasetinsights) –
path (str) – path of estimator
-
save(estimator, epoch)¶ Save estimator to the log_dir.
- Parameters
estimator (datasetinsights.estimators.Estimator) –
estimator object. (datasetinsights) –
epoch (int) – Epoch number.
-
class
datasetinsights.io.KubeflowPipelineWriter(tb_log_dir='/home/docs/checkouts/readthedocs.org/user_builds/memory-issue/checkouts/latest/runs/20210127-003502', kfp_log_dir='/home/docs/checkouts/readthedocs.org/user_builds/memory-issue/checkouts/latest/kfp/20210127-003502', kfp_metrics_filename='mlpipeline-metrics.json', kfp_ui_metadata_filename='mlpipeline-ui-metadata.json')¶ Bases:
objectKFP Writer for serializing metrics dictionary genereated during model training/evaluation toJSON and store in a file and create KFP dashboard visualizer JSON file for tensorboard.
- Parameters
filename (str) – Name of the file to which the writer will save metrics
kfp_log_dir (str) – Path where all files related to KFP will be stored
tb_log_dir (str) – Path where tensorobard logs are saved
-
filename¶ Name of the file to which the writer will save metrics
- Type
str
-
filepath¶ Path where the file will be stored
- Type
str
-
data_dict¶ A dictionary to save metrics name and value pairs
- Type
dict
-
data¶ Dictionary to be JSON serialized
-
add_metric(name, val)¶ Adds metric to the data dictionary of the writer
Note: Using same name key will overwrite the previous value as the current strategy is to save only the metrics generated in last epoch
- Parameters
name (str) – Name of the metric
val (float) – Value of the metric
-
create_tb_visualization_json(tb_log_dir, kfp_ui_metadata_filename)¶
-
write_metric()¶ Saves all the metrics added previously to a file in the format required by kubeflow
-
datasetinsights.io.create_downloader(source_uri, **kwargs)¶ - This function instantiates the dataset downloader
after finding it with the source-uri provided
- Parameters
source_uri – URI used to look up the correct dataset downloader
**kwargs –
Returns: The dataset downloader instance matching the source-uri.