July 10, 2017

Key Considerations for a Cloud Data Warehouse

Mike Boyarski

Senior Director of Product Marketing

Key Considerations for a Cloud Data Warehouse

Data growth and diversity has put new pressures on traditional data warehouses, resulting in a slew of new technology evaluations. The data warehouse landscape offers a variety of options, including popular cloud solutions that offer pay-as-you-go pricing in an easy-to-use and scale package. Here are some considerations to help you select the best cloud data warehouse.

First, Identify Your Use Case

A cloud data warehouse supports numerous use cases for a variety of business needs. Here are some common use cases along with the notable capabilities required for each.

Ad Hoc Analysis

Ad hoc analysis provides guided or open queries to the data warehouse, giving the end user flexibility to explore deeper questions. Users use native SQL or an interactive visual analysis tool such as Tableau or Looker. Each query result often prompts the user to dive further into the data, going from summary or aggregate views into distinct row level detail. A data warehouse that is good at ad hoc analysis delivers fast consistent responses across a variety of query types.

How does a data warehouse support ad hoc analysis?

Efficient query processing that can scan, join, and aggregate data in a variety of table structures.
Columnstore table format for optimized disk usage and accelerated aggregate query response.
Relational data format with ANSI SQL query syntax provides a familiar, easy to use structured language.
Built-in statistical functions such as MAX, MIN, SUM, COUNT, STD, NTILE, and RANK, to name a few, will make it easier to build sophisticated queries.
Data security ensures different users are shielded from sensitive or unauthorized data, requiring user authentication, role based access control, and row level security.
Scalable concurrency for supporting thousands of users running a variety of queries simultaneously.
Native connectivity to leading business intelligence tools for easier visual analysis and collaborative dashboards.

Machine Learning and Data Science

Data science and machine learning use a data warehouse to identify trends, discover hidden data relationships, and predict future events with sophisticated algorithms. Machine learning is a technique that can learn and improve insight discovery without explicitly being programmed to do so. Data scientists will often require large volumes of data to improve their predictions and correlations. Data is often enriched and cleaned or packaged into sample data sets for faster experimentation. Experiments are commonly performed offline due to the intense processing power required for the analysis. Advances in algorithms, hardware, machine learning and artificial intelligence tooling have led to more advanced data processing methods that can automatically identify hard to find events with relatively little human coordination.

How does a data warehouse support machine learning and data science?

Support a variety of data types including relational, CSV, JSON, and geospatial formats.
Provide native interoperability with data preparation and statistical tooling, such as Spark, SparkML, Python, R, SAS, and TensorFlow.
To maximize resource savings, offer rapid sandbox configuration for quick experimentation with easy spin-up and termination of databases as load requirements change.
To support collaboration and sharing of analyses, offer native connectivity with modern business intelligence tools such as Tableau, Zoomdata, and Looker.

Real-Time and Operational Analytics

Operational analytics applications often manage Key Performance Indicators (KPIs) by querying data continuously. The insights might be used several times a day by people or machines. The speed of response for an operational or real-time analytics solution can vary based on the systems in place and the organizational readiness. Gartner’s Roy Schulte said it best in his report, How to Move Analytics to Real Time:

“Business real time is about situation awareness; sensing and responding to what is happening in the world now, rather than to what happened a few hours or days ago, or what is predicted to happen based on historical data.”

How does a data warehouse support real-time analytics?

Streaming ingestion of data that can be immediately queried.
Fast processing of repeat queries, potentially by thousands of users or applications.
To reduce outages and maintain 24/7 operational support, high availability that includes redundancy and auto-failover.
To improve accuracy and decision speeds, exactly once semantics for real-time data de-duplication and enrichment.

Mixed Workload Analytics

Most organizations want a single source of data to improve decision accuracy and support a variety of workloads across ad hoc, machine learning, and real-time analytics. These expanded use cases place a strong emphasis on performance, security, and user or application concurrency. Due to the variety of applications requiring sub-second data access, mixed workloads can be a challenge to tune and govern.

How does a data warehouse support mixed workload analytics?

A robust, efficient, and distributed query processor that can support a broad range of queries without overpaying for extra hardware resources or require hard-to-manage database configurations.
Rapid easy-to-scale architecture that can address changes in workload complexity and user concurrency load.
Comprehensive security to shield users from seeing sensitive data without requiring custom database schemas or views.
Broad data ingestion to support real-time streaming and batch load requirements.

Next Up, Understanding Cloud Data Warehouse Capabilities

As you evaluate your next cloud data warehouse investment, it’s important to know the range of capabilities that are important for your project or business. Below is a list of capabilities organized by category to help you identify the right data warehouse:

Usability

Rapid provisioning: Setup should be self-service and take a few minutes from the point of sign up to a running functioning database
Accessibility: For easy query processing and integration with existing applications, tools, and skills, the environment should support relational data using ANSI SQL
Easy data loading: A guided or integrated data loading process should give users an easy integrated way to deploy a real-time data pipeline or bulk load ingestion
Optimized query processing: The database should have a distributed query optimizer that can process most queries with minimal specialized tuning
Simplified capacity management: As data or user growth expands, the data warehouse should provide managed or automated capacity adjustment to quickly address changing workloads

Performance

Ingest to analysis: Streaming data ingestion with simultaneous query processing ensures the fastest possible insights on live and historical data
Fast queries: Subsecond query response against billions of rows with vectorized query processing and columnstore structure for ad-hoc dashboards or operational reports
Operationally tuned: Compiled SQL queries accelerate query execution for added performance gains

Cost

On-demand pricing: Sometimes a data warehouse is not required for 24/7 operation; hourly billing can tightly associate the usage to payment.
Annual discounts: Reserved pricing discounts should be an option for operational deployments that are always available

Flexibility

Multicloud: To maximize the proximity of your data and get the ultimate performance for your applications, you need the freedom to choose the cloud service provider you prefer or have standardized on
Hybrid cloud: Maintain existing investments by spanning data warehouse investments across on-premises and cloud on a single platform
Elastic: Driven by growth in data, users, or query sophistication, rapidly scale out or down for new capacity requirements
Interoperable: To ensure compatibility with existing tools, applications, and skills, support JDBC/ODBC connectivity, MySQL wire protocol, and ANSI SQL

Scalability

Concurrency support: Scale-out distributed architecture ensures that high volume ingest and write queries do not degrade dashboard or report performance
High Availability: Efficient replication and distributed architecture ensures no single point of failure for operational requirements
Durable: All data should reside on disk for audit or regulatory requirements along with expedited recovery from unexpected failures

Security

Comprehensive: Data should be secured across the analysis lifecycle, from single sign-on (SSO) authentication, role based access control (RBAC), SSL encryption of data over the wire, encryption of data at rest, granular audit logging, and separation of concerns for database administrators
Consistent: Ensure a consistent security model across on-premises to the cloud with strong security capabilities across deployments

Conclusion: Considerations for Singlestore Helios

Singlestore Helios offers all the desired capabilities described above as a full-featured cloud data warehouse that is easy to set up and use for supporting a mix of workloads in a single integrated platform. The product delivers a fast, flexible, and secure environment that is capable of analyzing both live and historical data. The pay-as-you-go service gives organizations an affordable approach to real-time analytics. Try Singlestore Helios today and get a \$300 free credit offer resulting in up to 300 hours of free usage.

Try Singlestore Helios Now