2021-11-12 Data Platforms of the Future

Agenda

Attendees

Data Platforms of the Future

Presentation Deck

https://docs.google.com/presentation/d/16FHt9jcuHLkFVab5cCGKsmM0ucjtsBw2X0kmlYjRlWE/edit#slide=id.g10120aad5df_0_0

Rupert Berk, University of Washington

Shifting technical capabilities for Data Platforms

supporting real-time decision-making related to things like covid-vaccination-attestation and student retention
this has not been a sudden shift!
noting the investments in data science and the use of technologies such as predictive analytics
Using Amazon S3 for object Storage,
Shifting in modeling, data is moving from data warehouses (rdbms) to storage that enables better analytics
Conscious shift from ETL to ELT (techniques such as schema-on-demand).
Also moving from batch to streams, from overnight to near-real-time
APIs are everywhere! and fundamental to all of this

Hybrid Architecture comes together like this:

This is one example of a new unified hybrid approach, driving events through a common event hub
Data persistence (raw zone → data lake to optimized zone)
Quite a nice continuum of data preparation in that Data Persistence Service from raw=lake through to optimized=conformed)

Ashish Pandit, University of California at San Diego

UCSD took on a project for enterprise renewal, migrate to service infrastructure
"Each application is kind of a silo when it comes to its data storage, and there is a need to bring these data together for analytic and reporting."
Infrastructure to supports streaming of data instead of the old batched data transfers
Democratic access to data for analytics and reporting
This is the next-generation analytics platform at UCSD:
Required a strong API platform to support streaming
Eventually this data will connect to ML
The new streaming techniques must be idempotent and support publication-subscription models.
20–30 data sources, some of which are in the cloud, and some of which are on-premises, all feeding into the new "Activity Hub", which involves a couple of hundred endpoint interactions required to harvest the data:
Data is stored in SAP/HANA
Goal is to stream data as much as possible, transformations are done in the activity table. End users only have access to final curated views (CVs)
All of the data coming in go through the "Activity Table", which is a very-wide table, some 3,500 columns... and it's all quite expensive to operate!
Apache NiFi plus Kafka plus (Apache Airflow something related to workflow management)
Data is only allowed via Tableau, Cognos, other reporting systems. Nobody gets direct access to the data.
What is the hierarchy manager? (Youseff says maybe Taxonomy Manager)
Some aspects of the architecture were determined/inherited by historical drivers.
Security access controls is ingrained in SAP/HANA, looking to integrate this one step prior.

Satya Kunta, New York University

Serverless functionality is where NYU is heading.
Real-Time Streaming
3Vs of Future Platforms:
- Velocity: Faster time to market, Ingestion happens instantly, event driven, no batched ETL processes.
- Volume: Faster 24h processing, NYU has campus around the world.
- Variety: non-relational data, focus on hierarchical data JSON.

Institutional Data Store is where the heavy lifting happens.
(9th normal form?)
Data security and Access control is built into the data.

Using Collibra to assist with data cataloguing and data lineage management.
Using Snowflake extensively in the Landing & Staging; Data Processing; Data Warehouse pillars in the middle of that high-level technical-architecture diagram.

Ken Taylor, University of Illinois Urbana-Champaign

We have lots of users and lots of types of data but not a strong handle on who has access to or responsibility for what.
Traversing data topic areas like student, health, finance, require different protections, different governance, etc.
Legal considerations around protecting the data (health related e.g genomic) as it moves around the world.
Making extensive use of the AWS Organizational Units to try and implement the regional considerations around the access and management of data across regions... noting the need to also categorize activities by the risk of the data types at hand.
Focus on High Risk (right side of slides). Segmentation by region.
Also needing the ability to execute the managed exchange of data across regions (e.g., between EU and US).
Where possible, the preference is to leave the data in their "home" region — particularly when considering data related to biometrics and genomics and other types of sensitive and indigenous data.

Session Chat

From jeff kennedy to Everyone: 08:28 AM
@Ashish = expensive in terms of platform-as-a-service consumption costs? and i'm assuming it's expensive-but-worth-it-because-of-the-business-outcome-value!
"Nobody gets direct access to the data" = not even your data scientists or analytics specialists?
From jeff kennedy to Everyone: 08:42 AM
Q: to what extent are the security-and-access permissions defined in the source systems able to be transported with the data to be (re)applied in the data services layer and in the information products built with them --- rather than having to be (re)defined and (re)created in the BI & Analytic ecosystem? Similar question for business logic such as accruals and GPAs...
From Ashish Pandit to Everyone: 08:46 AM
@Jeff, The platform is expensive as a service in general. However, it offers lots of bells and whistles and the question is how fast can we convert those capabilities into business value that it promised and also to other campuses that are interested in similar values
We are also new to AWS, hence there is a lot of effort going on to optimize our cloud expenditure not just for for data platform but also for data logistic platform
From Dr. Mahmoud Youssef to Everyone: 08:48 AM
I would love to know the answer to Jeff’s question around transportation of security and business logic
From Mary Stevens to Everyone: 08:48 AM
I think it is always a challenge when you get tools is to leverage them and start showing value. Bells and whistles are very nice, but it can take time to get it all deployed
From Satya Kunta (he/him/his) to Everyone: 08:49 AM
The key is to preserve the application layer security to be mapped on to the distribution layer but the challenges come in where there is a different domain data needs to be mashed up at the BI/data layer, this is where a attribute based access controls mechanism and specific business rules needs to be built in at the data/app layer as security policies
From Henry Pruitt to Everyone: 08:49 AM
are there separate instances of the applications that leverage the data? can you create an integrated view of aggregate information?
From Mary Stevens to Everyone: 08:50 AM
One of the things we are grappling with is the amount that is cultural, business process, vs. making the right technology.
From Louis King - Yale University to Everyone: 08:52 AM
What is the transformational driver around velocity? Are your organizations becoming exponentially more transactional and therefore need that velocity to operate on an hourly basis? Or are you at lifecycle events that our allowing you to rethink your approach? Or something else?
From jeff kennedy to Everyone: 08:55 AM
i sense an anti-pattern around the tension between retro-fitting security and access rules into the aggregated/consolidated data platforms of the future [vs] restricting access to those data platforms only to the few highly-trusted and can-see-everything data-and-analytics team members (and something about subsequent constraints taking that approach will place on an institution's ability to create (and co-create) information products that are useful to its people) = for another day!
From Jim Phelps (UW - he/him) to Everyone: 08:57 AM
@Jeff - I think there is a second anti-pattern between the culture of data jailor-think (by system owners) where they want a business case and to really slowly and carefully define every data element for each and every use; and future need for transaction (even wrong) data that can drive automation and discovery.
From Ashish Pandit to Everyone: 08:57 AM
@Jeff, Regarding “data access”, users other than data analyst don’t have access currently. Eventually scientist will do need some access for scientists and need to figure out how to provide them proper access
From Mary Stevens to Everyone: 08:57 AM
Jeff, Totally agree, there is a huge culture that wants to keep data tightly controlled, locked in drawers only to be accessed by a privileged few, and then there are some of us hoping that by freeing data (responsibly) the business can harness it and grow and succeed.
From Dr. Mahmoud Youssef to Everyone: 09:00 AM
For GW, we are looking into user experience and personalization in addition to analytics
From Glenn Donaldson (OhioState - he/him) to Everyone: 09:00 AM
Yes & yes. All the things.
COVID-19 data is an example of the need across all areas.
From Mary Stevens to Everyone: 09:01 AM
Totally agree on the covid 19 data was very enlightening at our org

Page tree

2021-11-12 Data Platforms of the Future

Agenda

Attendees

Data Platforms of the Future

Presentation Deck

Rupert Berk, University of Washington

Ashish Pandit, University of California at San Diego

Satya Kunta, New York University

Ken Taylor, University of Illinois Urbana-Champaign

Session Chat