Tutorial on high computation Vaex package and interactive dashboards with Vaex and Dash
Published:
Dashboards are integrated elements with the analytics and insights extracted from the dataset, which extract the insightful findings from data features, produce business metrics, or track the performance of a model in production. The dashboard would provide the insights from the data extraction, and show the value for the organization.
In this article, you will learn the singular value decomposition and truncated SVD of the recommender system:
(1) Introduction
(2) Walk through on Vaex
(3) Methods to handle large dataset in Vaex
(4) Dash & Vaex
(5) Hands-on experience with python code
(6) Dashboard Tutorial in Dash and Vaex package
(7) Diverse Visualization plots
Walk through on Vaex
Vaex is a python library to process the large tabular datasets for visualization and produces the Out-of-Core Dataframes (similar to Pandas). The computation supports the statistics calculations such as mean, sum, count, and standard deviation. With its computation resources, Vaex computes the billion objects/rows per second on the N-dimensional grid. In terms of visualization, interactive visualization of big data such as histograms, density plots, and 3d volume rendering is also supported by Vaex. To optimize the performance, Vaex uses memory mapping, a zero memory copy policy, and lazy computations. Vaex exports the file in HDF5 format, flexible for other programming languages. Vaex is a visualization tool to generate graphs and explores large tabular datasets. With the 1 and 2d data input, Vaex extracts richer information from the subspaces of the columns (dimensions) analysis.
DataFrame is the class (data structure) in vaex, and is generated from the input of different data files. In vaex, the function open() is to open a file. With the open function, Vaex is able to connect to a remote server.
Alternatively, Vaex can read the data remotely from Amazon’s S3, and renders an HDF5 file. Hence, the data is lazily downloaded and cached to the local machine. Regarding “Lazily downloaded”, only portions of the data is downloaded. For instance, there is a large dataset with 100 columns and 5000 rows. The first and last 5 rows of data would be downloaded via print(df)
. With the plots generated from only 5 columns, these columns will be downloaded and cached to the local machine. By default, data is cached at the directory $HOME/.vaex/file-cache/s3
from the stream of S3. The access to the cache directory is as fast as the native disk. profile_name
argument is to use a specific S3 profile, and the file is saved in s3fs.core.S3FileSystem
.
In the S3 URL HTML, the parameters are introduced in S3 options:
- anon: anonymous access or not (false by default). (Allowed values are: true,True,1,false,False,0)
- se_cache: Use the disk cache or not, only set to false if the data should be accessed once. (Allowed values are: true,True,1,false,False,0)
- profile_name and other arguments are passed to s3fs.core.S3FileSystem
These arguments are passed as kwargs, but the argument such as anon passed as a boolean, not a string.
Methods to handle large dataset in Vaex
The common challenge that many organizations face is a large amount of data with the range of millions of rows of data, which is overwhelming to be processed. Data scientists face difficulty to work with large datasets, while most tools are not fit to process the data with such scale. It’s more challenging to build an interactive dashboard with the input of a large-scale dataset. Vaex, an Open Source DataFrame library in Python, enables us to work with a large dataset. Vaex supports memory mapping, and it would not be cached in RAM all at once. Through memory mapping, the same physical memory is shared amongst all processes. Such function is quite useful in Dash, which supports workers to scale vertically and Kubernetes to scale horizontally. Besides, Vaex processes the large dataset with efficient, fully parallelized out-of-core algorithms. The API shares a similar foundation set by Pandas.
Dash & Vaex
Vaex works along with Dash to build simple, and interactive analytical dashboards or web applications. Dash applications support reactive functions. With the users’ interactions of pushing a button or moving a slider, the callbacks are implemented on the server, which updates the application via the computation. With the stateless server, there is no memory required from the users’ interaction. Dash can both scale vertically with more workers and nodes. With the stateless function, Vaex computes the dataset such as filtering along with aggregating computation, and it processes the request instead of modifying or copying the data. Vaex produces a small result from the computation or group-by aggregations for visualizations since they will be transferred to the browser.
In addition, Vaex can process each request on a single node or worker within a short period, and it’s not required to set up a cluster. Distributed computing is another tool to tackle larger datasets. The article will introduce how to build an interactive web application with the input of a large dataset that barely fits into RAM on most machines (12 GB). Data manipulation, aggregation, and statistic computations are done through Vaex. Then, the plots would be visualized interactively through Plotly and Dash.
Hands-on experience with python code
The dataset used for the dashboard is New York Taxi, which is a public dataset to showcase the way for data manipulation with its relatability and size. The data contains 100 million trips over a full year of the YellowCab Taxi company. The interactive web dashboard shows the estimated cost and duration of their next trip to the prospective passengers with Dash and Vaex, while the web application would show the general trend of the taxi routes.
The public availability, relatability, and size have made the New York Taxi dataset the de facto standard for benchmarking and showcasing various approaches to manipulating large datasets. The following example uses a full year of the YellowCab Taxi company data from their prime, numbering over 100 million trips. We used Plotly, Dash, and Vaex in combination with the taxi data to build an interactive web application that informs prospective passengers of the likely cost and duration of their next trip, while at the same time giving insights to the taxi company managers of some general trends.
Dashboard Tutorial in Dash and Vaex package
We’ll walk through the tutorial to go over the dashboards with some functions provided with the input of the data that barely fits in memory with Dash and Vaex. The application, trip planner, enables users to select the pick-up locations in New York City in the interactive heatmap. The interactive map supports the pan and zoom function, and the map would be updated via recomputation after each action. The user can click on the map to select the origin and destination. Then, the dashboard would pop out the cost and duration based on the designated routes. Furthermore, the user can specify the day and hour range to gain detailed information about the trip.
Vaex would memory-map the data and input part of the data no matter how large the data is. While multiple workers are running in the dash application, each of them would be distributed with an equal memory-mapped file.
Next, the layout of the Dash application would be introduced
The next step is to set up the Dash application with a simple layout. In our case, these are the main components to consider:
- The components part of the “control panel” that lets the user select trips based on time dcc.Dropdown(id=’days’) and day of week dcc.Dropdown(id=’days’);
- The interactive map dcc.Graph(id=’heatmap_figure’);
- The resulting visualizations are based on the user input, which will show the distributions of the trip costs and durations, and a markdown block showing some key statistics. The components are dcc.Graph(id=’trip_summary_amount_figure’), dcc.Graph(id=’trip_summary_duration_figure’), and dcc.Markdown(id=’trip_summary_md’) respectively.
- Several dcc.Store() components track the users’ state at the client-side.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersapp = dash.Dash(__name__) # Set up the caching mechanism cache = Cache(app.server, config={ 'CACHE_TYPE': 'filesystem', 'CACHE_DIR': 'cache-directory' }) # set negative to disable (useful for testing/benchmarking) CACHE_TIMEOUT = int(os.environ.get('DASH_CACHE_TIMEOUT', '60')) app.layout = html.Div(className='app-body', children=[ # Stores dcc.Store(id='map_clicks', data=0), dcc.Store(id='trip_start', data=trip_start_initial), dcc.Store(id='trip_end', data=trip_end_initial), dcc.Store(id='heatmap_limits', data=heatmap_limits_initial), # Control panel html.Div(className="row", id='control-panel', children=[ html.Div(className="four columns pretty_container", children=[ html.Label('Select pick-up hours'), dcc.RangeSlider(id='hours', value=[0, 23], min=0, max=23, marks={i: str(i) for i in range(0, 24, 3)}) ]), html.Div(className="four columns pretty_container", children=[ html.Label('Select pick-up days'), dcc.Dropdown(id='days', placeholder='Select a day of week', options=[{'label': 'Monday', 'value': 0}, {'label': 'Tuesday', 'value': 1}, {'label': 'Wednesday', 'value': 2}, {'label': 'Thursday', 'value': 3}, {'label': 'Friday', 'value': 4}, {'label': 'Saturday', 'value': 5}, {'label': 'Sunday', 'value': 6}], value=[], multi=True), ]), ]), # Visuals html.Div(className="row", children=[ html.Div(className="seven columns pretty_container", children=[ dcc.Markdown(children='_Click on the map to select trip start and destination._'), dcc.Graph(id='heatmap_figure', figure=create_figure_heatmap(heatmap_data_initial, heatmap_limits_initial, trip_start_initial, trip_end_initial)) ]), html.Div(className="five columns pretty_container", children=[ dcc.Graph(id='trip_summary_amount_figure'), dcc.Graph(id='trip_summary_duration_figure'), dcc.Markdown(id='trip_summary_md') ]), ]), ])
Now let’s talk about how to make everything work. We organize our functions into three groups: - compute_ functions are the basis for the visualization to calculate the relevant aggregations and statistics
- create_figure_ function creates the figure from the aggregation compute
- Dash callback functions get to interact with the compute function, and transfer the output to the figure creation functions after the user makes changes.
The separated functions into three groups would better organize the dashboard functionality. Besides, the application can be pre-populated and avoids the callback triggering on the initial page load.
From the heatmap, the user can select the pick-up hour and day of the week from the Range Slider and Dropdown elements as the data subset.
Let’s start by computing the heatmap. The initial step is selecting the relevant subset of the data the user may have specified via the Range Slider and Dropdown elements that control the pick-up hour and day of the week respectively:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersdef create_selection(days, hours): df = df_original.copy() selection = None if hours: hour_min, hour_max = hours if hour_min > 0: df.select((hour_min <= df.pickup_hour), mode='and') selection = True if hour_max < 23: df.select((df.pickup_hour <= hour_max), mode='and') selection = True if (len(days) > 0) & (len(days) < 7): df.select(df.pickup_day.isin(days), mode='and') selection = True return df, selection
From the code above, the data frame is copied into the selection function, which is the stateful object in the DataFrame. Dash is multi-threaded, and it uses the selection method other than filtering to boost the performance. The code cell below computes the heatmap data:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters@cache.memoize(timeout=CACHE_TIMEOUT) def compute_heatmap_data(days, hours, heatmap_limits): df, selection = create_selection(days, hours) heatmap_data_array = df.count(binby=[df.pickup_longitude, df.pickup_latitude], selection=selection, limits=heatmap_limits, shape=256, array_type="xarray") return heatmap_data_array
All Vaex DataFrame methods are applied to all sizes of data with its parallelization and out-of-core functions. From the heatmap computation, two columns are passed via thebinby
argument to the.count()
method. Then, the number of samples is calculated in a grid specified by those axes. The grid is drawn from two elementsshape
(i.e. the number of bins per axis) andlimits
(or extent). The output of the data array isarray_type="xarray"
, where the numpy array has the labeled dimension. The numpy array is convenient for plotting.
With the heatmap computation, the code cell below will show how to create the figure on the dashboard.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersdef create_figure_heatmap(data_array, heatmap_limits, trip_start, trip_end): # Set up the layout of the figure legend = go.layout.Legend(orientation='h', x=0.0, y=-0.05, font={'color': 'azure'}, bgcolor='royalblue', itemclick=False, itemdoubleclick=False) margin = go.layout.Margin(l=0, r=0, b=0, t=30) # if we don't explicitly set the width, we get a lot of autoresize events layout = go.Layout(height=600, title=None, margin=margin, legend=legend, xaxis=go.layout.XAxis(title='Longitude', range=heatmap_limits[0]), yaxis=go.layout.YAxis(title='Latitude', range=heatmap_limits[1]), **fig_layout_defaults) # add the heatmap # Use plotly express in combination with xarray - easy plotting! fig = px.imshow(np.log1p(data_array.T), origin='lower') fig.layout = layout # add markers for the points clicked def add_point(x, y, **kwargs): fig.add_trace(go.Scatter(x=[x], y=[y], marker_color='azure', marker_size=8, mode='markers', showlegend=True, **kwargs)) if trip_start: add_point(trip_start[0], trip_start[1], name='Trip start', marker_symbol='circle') if trip_end: add_point(trip_end[0], trip_end[1], name='Trip end', marker_symbol='x') return fig
From the function above, the function of Plotly Express is applied to render the heatmap. Given thetrip_start
andtrip_end
coordinates, both variables would be added as individualplotly.graph_objs.Scatter
traces to the figure. The interactive Plotly figure supports the zooming, panning, and clicking functions.
The code cell below shows how to update the heatmap figure from the modifications made in the data selection or changes to the map view using the Dash callback.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters@app.callback(Output('heatmap_figure', 'figure'), [Input('days', 'value'), Input('hours', 'value'), Input('heatmap_limits', 'data'), Input('trip_start', 'data'), Input('trip_end', 'data')], prevent_initial_call=True) def update_heatmap_figure(days, hours, heatmap_limits, trip_start, trip_end): data_array = compute_heatmap_data(days, hours, heatmap_limits) return create_figure_heatmap(data_array, heatmap_limits, trip_start, trip_end)
From the code block above, the function would be called when there is a change of theInput
values. In the function,compute_heatmap_data
is to do the aggregation computation with the new input parameters, and the new heatmap figure is generated with the computed result.prevent_initial_call
argument of the decorator is to avoid the function from being called when the dashboard runs in the first round. Despite the fact thattrip_start
ortrip_end
parameters don’t appear incompute_heatmap_data
,compute_heatmap_data
is called when both parameters change the input, andupdate_heatmap_figure
is triggered. The decorator attached tocompute_heatmap_data
is to prevent several calls of the function.flask_caching
library, suggested in Plotly, is fast, easy, and simple to cache old computations for 60 seconds.
The code cell below shows the user interactions with the heatmap via panning and zooming from the Dash callback function.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters@app.callback( Output('heatmap_limits', 'data'), [Input('heatmap_figure', 'relayoutData')], [State('heatmap_limits', 'data')], prevent_initial_call=True) def update_limits(relayoutData, heatmap_limits): if relayoutData is None: raise dash.exceptions.PreventUpdate elif relayoutData is not None and 'xaxis.range[0]' in relayoutData: d = relayoutData heatmap_limits = [[d['xaxis.range[0]'], d['xaxis.range[1]']], [d['yaxis.range[0]'], d['yaxis.range[1]']]] else: raise dash.exceptions.PreventUpdate if heatmap_limits is None: heatmap_limits = heatmap_limits_initial return heatmap_limits
According to the dash callback below, it is to capture and respond to click events:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters@app.callback([Output('map_clicks', 'data'), Output('trip_start', 'data'), Output('trip_end', 'data')], [Input('heatmap_figure', 'clickData')], [State('map_clicks', 'data'), State('trip_start', 'data'), State('trip_end', 'data')], prevent_initial_call=True) def click_heatmap_action(click_data_heatmap, map_clicks, trip_start, trip_end): if click_data_heatmap is not None: point = click_data_heatmap['points'][0]['x'], click_data_heatmap['points'][0]['y'] new_location = point[0], point[1] # the 1st and 3rd and 5th click change the start point if map_clicks % 2 == 0: trip_start = new_location trip_end = None # and reset the end point else: # the 2nd, 4th etc set the end point trip_end = new_location map_clicks += 1 return map_clicks, trip_start, trip_end
Update key components in both the above callback functions are to render the heatmap. Therefore, when there are events like click or relay (pan or zoom), update_heatmap_figure function would be called from the updating key components, and it would update the heatmap figure. The function above creates the fully interactive heatmap figure. The heatmap would be updated via external controls such as the RangeSlider and Dropdown menu or through the interactive function in the figure.
Since the Dash application is stateless, reactive, and functional, functions in Dash are to create visualizations. In the Dash application, we can click and select trips starting from the “origin” and ending at the “destination” point. Regarding those trips, it would show up the cost distribution and duration, and highlight the most likely values for both. These can be coded in the function below.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters@cache.memoize(timeout=CACHE_TIMEOUT) def compute_trip_details(days, hours, trip_start, trip_end): # Apply the selection to the dataframe df, selection = create_selection(days, hours) # Radius around which to select trips # One mile is ~0.0145 deg; and in NYC there are approx 20 blocks per mile # We will select a radius of 3 blocks r = 0.0145 / 20 * 3 pickup_long, pickup_lat = trip_start dropoff_long, dropoff_lat = trip_end selection_pickup = (df.pickup_longitude - pickup_long)**2 + (df.pickup_latitude - pickup_lat)**2 <= r**2 selection_dropoff = (df.dropoff_longitude - dropoff_long)**2 + (df.dropoff_latitude - dropoff_lat)**2 <= r**2 df.select(selection_pickup & selection_dropoff, mode='and') selection = True # after this the selection is always True return {'counts': df.count(selection=selection), 'counts_total': df.count(binby=[df.total_amount], limits=[0, 50], shape=25, selection=selection), 'counts_duration': df.count(binby=[df.trip_duration_min], limits=[0, 50], shape=25, selection=selection) }
The helper function is defined to create the histogram figure with the input of the aggregated data.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersdef create_figure_histogram(x, counts, title=None, xlabel=None, ylabel=None): # settings color = 'royalblue' # list of traces traces = [] # Create the figure line = go.scatter.Line(color=color, width=2) hist = go.Scatter(x=x, y=counts, mode='lines', line_shape='hv', line=line, name=title, fill='tozerox') traces.append(hist) # Layout title = go.layout.Title(text=title, x=0.5, y=1, font={'color': 'black'}) margin = go.layout.Margin(l=0, r=0, b=0, t=30) legend = go.layout.Legend(orientation='h', bgcolor='rgba(0,0,0,0)', x=0.5, y=1, itemclick=False, itemdoubleclick=False) layout = go.Layout(height=230, margin=margin, legend=legend, title=title, xaxis=go.layout.XAxis(title=xlabel), yaxis=go.layout.YAxis(title=ylabel), **fig_layout_defaults) # Now calculate the most likely value (peak of the histogram) peak = np.round(x[np.argmax(counts)], 2) return go.FigureWidget(data=traces, layout=layout), peak def make_empty_plot(): layout = go.Layout(plot_bgcolor='white', width=10, height=10, xaxis=go.layout.XAxis(visible=False), yaxis=go.layout.YAxis(visible=False)) return go.FigureWidget(layout=layout)
With all the components, we can link them to the Dash application via a callback function:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters@app.callback([Output('trip_summary_amount_figure', 'figure'), Output('trip_summary_duration_figure', 'figure'), Output('trip_summary_md', 'children')], [Input('days', 'value'), Input('hours', 'value'), Input('trip_start', 'data'), Input('trip_end', 'data')] ) def trip_details_summary(days, hours, trip_start, trip_end): if trip_start is None or trip_end is None: fig_empty = make_empty_plot() if trip_start is None: text = '''Please select a start location on the map.''' else: text = '''Please select a destination location on the map.''' return fig_empty, fig_empty, text trip_detail_data = compute_trip_details(days, hours, trip_start, trip_end) counts = trip_detail_data['counts'] counts_total = np.array(trip_detail_data['counts_total']) counts_duration = np.array(trip_detail_data['counts_duration']) fig_amount, peak_amount = create_figure_histogram(df_original.bin_edges(df_original.total_amount, [0, 50], shape=25), counts_total, title=None, xlabel='Total amount [$]', ylabel='Numbe or rides') # The trip duration fig_duration, peak_duration = create_figure_histogram(df_original.bin_edges(df_original.trip_duration_min, [0, 50], shape=25), counts_duration, title=None, xlabel='Trip duration [min]', ylabel='Numbe or rides') trip_stats = f''' **Trip statistics:** - Number of rides: {counts} - Most likely trip total cost: ${peak_amount} - Most likely trip duration: {peak_duration} minutes ''' return fig_amount, fig_duration, trip_stats
The callback function above is updatable to the changes from the control panel, along with the click selection of new origin or destination points. Through the registered event, the callback function is triggered and will call the compute_trip_details and create_histogram_figure functions with new parameters input. Then, the visualization is updated with the values input to these functions.
There is one condition considered when a user only selects the starting point, but not yet click on the new destination. Therefore, the histogram would be “blank out” with the functions below.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersdef create_figure_empty(): layout = go.Layout(plot_bgcolor='white', width=10, height=10, xaxis=go.layout.XAxis(visible=False), yaxis=go.layout.YAxis(visible=False)) return go.FigureWidget(layout=layout)
Finally, the code in the source file below is to run the dashboard.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersif __name__ == '__main__': app.run_server(debug=True)
Then, the interactive Dash application is created! After downloading the taxi data from the github page, the dashboard would be executed locally through the source file via the command line ofpython app.py
in the terminal.
And there we have it: a simple yet powerful interactive Dash application! To run it locally, you can execute thepython app.py
command in your terminal, provided that you have named your source file as “app.py”, and you have the taxi data at hand. You can also review the entire source file via this GitHub Gist.
Diverse Visualization plots
There are many visualization plots in Plotly. Apart from the typical heatmaps and histograms, the dashboard includes several interactive, but not quite common methods to show the aggregated data visualization. On the first tab, there is a geographical map colored by the number of taxi pick-ups in NYC zones. A user is able to select the pickup and destination place on the map and gain information on popular destinations (zones and boroughs) via the Sankey and Sunburst diagrams. This functionality is created in the same way as the above Trip planner tab. The core of these functions is applied with the groupby operations to format the data to meet the Plotly requirements. The code is referenced from the Github.
In Conclusion
- Vaex is a python library to process the large tabular datasets for visualization and produces the Out-of-Core Dataframes (similar to Pandas). In terms of visualization, interactive visualization of big data such as histograms, density plots, and 3d volume rendering is also supported by Vaex. To optimize the performance, Vaex uses memory mapping, a zero memory copy policy, and lazy computations. Vaex exports the file in HDF5 format, flexible for other programming languages. Vaex is a visualization tool to generate graphs and explores large tabular datasets.
- Vaex supports memory mapping, and it would not be cached in RAM all at once. Through memory mapping, the same physical memory is shared amongst all processes. Such function is quite useful in Dash, which supports workers to scale vertically and Kubernetes to scale horizontally. Besides, Vaex processes the large dataset with efficient, fully parallelized out-of-core algorithms.
- Vaex works along with Dash to build simple, and interactive analytical dashboards or web applications. Dash applications support reactive functions. With the users’ interactions of pushing a button or moving a slider, the callbacks are implemented on the server, which updates the application via the computation. With the stateless server, there is no memory required from the users’ interaction. Dash can both scale vertically with more workers and nodes. With the stateless function, Vaex computes the dataset such as filtering along with aggregating computation, and it processes the request instead of modifying or copying the data.
Reference
- Github - dash-120million-taxi-app
https://github.com/vaexio/dash-120million-taxi-app - Interactive and scalable dashboards with Vaex and Dash
https://medium.com/plotly/interactive-and-scalable-dashboards-with-vaex-and-dash-9b104b2dc9f0