Data platform requirements and expectations

Donations Make us online

A big data platform is a complex and sophisticated system that enables organizations to store, process, and analyze large volumes of data from a variety of sources.

It is composed of several components that work together in a secured and governed platform. As such, a big data platform must meet a variety of requirements to ensure that it can handle the diverse and evolving needs of the organization.

Note, due to the extensive nature of the domain, it is not feasible to provide a comprehensive and exhaustive list of requirements. We invit you to contact us to share additionnal enhancements.

Data ingestion

This area includes the ingestion of data from various sources, their treatment, and their storage in a suitable format.

Data sources

Ability to consume data from various sources including databases, file systems, APIs, and data streams.
Ingestion mode

Ability to consume data in both batch and streaming.
Data format

Support for reading and writing file formats and table formats such as JSON, CSV, XML, Avro, Parquet, Delta Lake and Iceberg.
Data quality

Definition for the quality requirements for the data, such as data completeness, data accuracy, and data consistency, and ensure that the ingestion pipeline can validate and cleanse the data as needed.
Transformation des données

Determine whether the data needs to be transformed or enriched before it can be stored or analyzed.
Data Availability

Ensure that the ingestion pipeline can handle failures or outages of the data sources or the ingestion pipeline itself, and can recover and resume ingestion without data loss.
Volume

Provide solutions capable of addressing expected volume and throughput variations.

Data storage

This area includes the storage, the managment, and the retrieval of large volumes of data.

Disponibilité

The ability to access the data reliably and with minimal downtime, ensuring high availability of the data.
Durability

The ability to ensure data is not lost due to hardware failures or other errors, with data replication and backup strategies in place.
Performance

The ability to store and retrieve data quickly and efficiently, with low latency and high throughput.
Elasticity

Storage and management of growing volumes of data, with the ability to scale up and down as needed by acquiring and releasing additional resources.
Data lifecycle

Data lifecycle management by applying changes and adding missing data and the possibility of reverting to a previous version.

Data processing in the data lake

This area includes the processes for preparing and exposing the data for further analysis.

Flexibility

Ability to support multiple data types and formats and ability to integrate with various distributed data processing and analysis tools.
Data cleaning

Cleanse the data to remove or correct errors, inconsistencies, and missing values.
Data integration

Combine and integrate multiple data sources into a single dataset, resolving any schema or format differences.
Data transformation

Transform the data to prepare it for downstream processing or analysis, such as aggregating, filtering, sorting, or pivoting.
Data enrichment

Enhance the data with additional information to provide more context and insights.
Data reduction

Reduce the volume of data by summarizing or sampling it, while preserving the essential characteristics and insights.
Data normalization and denormalization

Normalize the data to remove redundancies and inconsistencies, ensuring that the data is stored in a consistent format and denormalization to improve performances.

Data observability

This area is the practice of monitoring and managing the quality, integrity, and performance of data as it flows through the platform.

Data validation

Ensuring that the data is valid, accurate, and consistent, and meets the expected format and schema.
Data lineage

Tracking the path of data as it flows through the system to identify any issues or anomalies.
Data quality monitoring

Continuously monitoring the quality of data and raising alerts when anomalies or errors are detected.
Performance monitoring

Monitoring the performance of the system, including latency, throughput, and resource utilization, to ensure that the system is performing optimally.
Metadata management

Managing the metadata associated with the data, including data schema, data dictionaries, and data catalog, to ensure that it is accurate and up-to-date.

Data usage

This area includes the requirements to access, transfer, analyze and visualize the data to extract insights and actionable information.

User interfaces

CLI environments and graphical interfaces available to users for data processing and visualization.
Communication Interfaces

Provision of data access via REST, RPC and JDBC/ODBC communication protocols.
Data mining

Perform exploratory data analysis to understand data characteristics and quality, extract patterns, relationships, or insights from the data, using statistical or machine learning algorithms.
Data access

Ensure that the data is secure and protected from unauthorized access or breaches, by implementing appropriate security controls and protocols.
Data Visualization

Visualize the data to communicate insights and findings to stakeholders, using charts, graphs, or other visualizations.

Platform Security and Operation

The area cover the security and the management of a big data platform.

Data regulation and compliance

The ability to ensure compliance with data governance policies and regulations, such as data privacy laws, data usage practices, data retention policies, and data access controls.
Fine-grained access control

Ability to control access and data sharing on all proposed services with management policies taking into account the characteristics and specificities of each.
Data filtering and masking

Filtering of data by row and by column, application of masks on sensitive data.
Encryption

Encryption at rest and in transit with SSL/TLS.
Integration into the information system

Integration of users and user groups with the corporate directory.
Security perimeter

Isolation of the platform in the network and centralize access through a single entry point.
Admin interface

Provision of a graphical interface for the configuration and monitoring of services, the management of data access controls and the governance of the platform.
Monitoring and alerts

Exposing metrics and alerts that monitor and ensure the health and performance of the various services and applications.

Hardware and maintance

This area covers the acquisition of new resources as well as the maintenance requirements.

Targetted infrastructure

Selection between a cloud or an on-premise infrastructure, taking into account that cloud offers flexible and scalable storage and processing of large datasets with cost efficiencies, while on-premise deployment provides greater control, security and compliance over data but requires significant upfront investment and ongoing maintenance costs.
Asymmetrical architecture

Dissociation between resources dedicated to storage and processing and, in some circumstances, collocation of processing and data.
Storage

Provision of a storage infrastructure in line with the volumes expressed.
Compute

Provision of a computing infrastructure capable of evolving with future usages brought by projects and users in the fields of data engineering, data analysis and data science.
Cost-effectiveness

The ability to store and manage data cost-effectively, with consideration of the cost of storage and the cost of managing and operating the storage solution.
Cost management and total cost of ownership (TCP)

Control and calculation of the total cost of the solution taking into account all the factors and specificities of the platform such as infrastructure, staff, acquisition of licenses, deadlines, use, team turnover, technical debt, …
User support

Support for platform users with the aim of ensuring the acquisition of new skills for the teams, the validation of the architecture choices, the deployment of patches and features, and the proper use of the available resources.

Conclusion

Overall, a big data platform must be able to handle the diverse and evolving needs of the organization, while ensuring that the solution is highly flexible, resilient, and performant, that data is secure, compliant, and of high quality, that insights and findings are communicated effectively accross the various stakeholders, and that it remains cost-effective to operate over time.

Source link