Using Fabric Python Notebooks, Daft & OrgApps
Simon Willison is one of my favorite bloggers. In fact, what I blog, how I blog & test, is inspired by him. He wrote a blog a couple of weeks ago about FourSquare Places data that has been open-sourced. I was exploring this dataset and ended up creating a few maps. I love OrgApps in Fabric and I truly believe as it matures, it will be THE way for analysts & data scientists to provide rich insights + traditional reports to business users. Notebooks can augment the Power BI reports to provide insights that are otherwise not possible. I have submitted a session on this topic to FabCon ‘25, let’s see. If it is selected, I hope to show how transformational it is and how businesses can use it.
I won’t go into super details about the code below, but a few things to note:
-
I used daft to scan 104M rows from an S3 bucket in Fabric Python notebook without downloading the entire dataset. Why daft ? Because it’s optimized for reading S3 data. If you run the below notebook, you will see there is minimal memory & CPU consumption. Look at Simon’s blog above, he used Duckdb. I cleaned the transformed the data lazily using daft.
-
I also used Polars because polars has a nice altair integration.
-
Folium for creating interactive maps and timeseries using Plotly.
-
Notebook is embedded in OrgApps for users to explore the data. You can also embed a Power BI report using
QuickVisualize
for users to explore the data (as long as it is a small dataset).
Steps:
Just download this notebook, import it in your Fabric workspace and execute it.
To get a list of files at this S3 location:
## list of files
s3 = fs.S3FileSystem(region='us-east-1')
path = "s3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/*.parquet"
file_info = s3.get_file_info(fs.FileSelector(
"fsq-os-places-us-east-1/release/dt=2024-11-19/",
recursive=True
))
for info in file_info:
print(info.path)
About the Author:
Sandeep Pawar
Data Science professional with experience in using Data Analytics, Statistics & Machine Learning to create business solutions. I primarily use Microsoft data stack (Power BI, Synapse Analytics, Azure ML) to create scalable data informed business solutions.
Reference:
Pawar, S (2024). Using Fabric OrgApps + Notebooks For Geospatial Data Exploration. Available at: Using Fabric OrgApps + Notebooks For Geospatial Data Exploration [Accessed: 11th December 2024].