Skip to main content

dataset

dataset()

Decorator function that wraps a dataset function. The wrapped function must output a Pandas dataframe.

The decorator ensures that:

  • If decorated function is passed in to layer.run(your_function), then Layer runs the function remotely and stores its output as a dataset.
  • If you run the function locally, your_function(), then Layer stores the output in the Layer backend as a dataset. This does not affect function execution.
  • Parameters

    • name (str) -- Name with which the dataset will be stored in Layer backend.
    • dependencies (Optional[List[Union**[layer.data_classes.Dataset, layer.data_classes.Model]]]) -- List of Datasets or Models that will be built by Layer backend prior to building the current function. This hints Layer what entities this function depends on and optimizes the build process.
  • Returns

    Function object being decorated.

  • Return type

    Callable[[...], Any]

import pandas as pd
from layer
from layer.decorators import dataset

# Define a new function for dataset generation:
# - The dependencies list includes entities that needs to be built before running the `create_my_dataset` code
# - `titanic` is a publicly accessible dataset to everyone using Layer
@dataset("my_titanic_dataset", dependencies=Dataset("titanic"))
def create_my_titanic_dataset():
df = layer.get_dataset("titanic").to_pandas()
return df

# Run function locally
df = create_my_titanic_dataset()
# Dataset will be stored in Layer backend and will be retrievable later
assert df == layer.get_dataset("my_titanic_dataset").to_pandas()

Here's another way to create a dataset. As long as the function outputs a Pandas dataframe, Layer doesn't really care how you get the data into it.

import pandas as pd
from layer.decorators import dataset

@dataset("my_products")
def create_product_dataset():
data = [[1, "product1", 15], [2, "product2", 20], [3, "product3", 10]]
dataframe = pd.DataFrame(data, columns=["Id", "Product", "Price"])
return dataframe

product_dataset = create_product_dataset()

If you use a dataset that is outside of your project, then you must explicitly call it out as a dependency. These dependencies are displayed in the External assets section in your Layer project.

import pandas as pd
from layer
from layer.decorators import dataset
from layer.client import Dataset

@dataset("raw_spam_dataset", dependencies=[Dataset('layer/spam-detection/datasets/spam_messages')])
def raw_spam_dataset():
# Get the spam_messages dataset and convert to Pandas dataframe.
df = layer.get_dataset("spam-detection/datasets/spam_messages").to_pandas()
return df