Cloudera Accelerates Data Science Workloads for Apache Hadoop

PALO ALTO, Calif., Feb. 17, 2016 -- Cloudera, the global provider of the fastest, easiest, and most secure data management and analytics platform built on Apache Hadoop and the latest open source technologies, today announced new advancements to further Hadoop as a mainstream platform for data science. Building on recent announcements around Apache Spark and Python that better enable data engineering and data science workloads across big data, Cloudera and Continuum Analytics are making it easier to work with the Python ecosystem through seamless integration of the Anaconda platform with Hadoop. In addition, Cloudera, together with the open source community, announced Apache Arrow, a new open source in-memory columnar data format, to support interoperability and improved performance of Python in the Hadoop ecosystem. These efforts will help data scientists to better take advantage of Hadoop using their preferred skills and tools, and lay the foundation for native data interchange and efficient performance for data engineering and machine learning workloads.

Improving the Python Experience for Data Scientists on Hadoop

Python is the language of choice for data scientists and data engineers due to its power, elegance, and robust libraries and third-party integrations for expressing complex workflows. With frameworks like Apache Spark supporting Python, and new emerging tools like Ibis that better support Python natively for big data, Python has become an increasingly popular choice for data engineering and advanced analytics on Hadoop.

To make it easier for data scientists to get started with Python, Cloudera has partnered with Continuum Analytics - the creator and driving force behind Anaconda, a leading open source Python platform. The jointly-developed Anaconda for Cloudera packaging provides a simple, fast experience for customers installing Python, including popular packages such as NumPy, Pandas, and Scikit-Learn, on a Hadoop cluster. Users can deploy Anaconda seamlessly through Cloudera Manager and easily build and run Python-based solutions across Cloudera Enterprise, including under Spark.

"We are grateful to have worked with Cloudera to bring Anaconda to the Cloudera ecosystem," said Peter Wang, chief technology officer and co-founder of Continuum Analytics. "The integration of Anaconda and Cloudera’s platform allows enterprises to realize the full potential of their data by making it easier to get started and distribute Anaconda across Hadoop clusters to support critical data science workloads."

Additionally, Cloudera announced its community involvement with the new Apache Arrow project. Together with developers from Amazon, Databricks, Dremio, MapR, Trifacta, and Twitter, Cloudera is developing Arrow as a new in-memory columnar data structure to standardize in-memory processing and interchange across the ecosystem. Its efficient design will also accelerate analytic workloads across Hadoop frameworks (including Impala and Spark), and enable native interoperability for languages like Python and R for better data access and high-performance analytics.

“Cloudera has been paving the way for data scientists and engineers to become more deeply immersed in the Hadoop ecosystem,” said Wes McKinney, software engineer at Cloudera and the creator of Python pandas. “As the technology continues to mature, the vision of Python programmers leveraging the full-scale Hadoop ecosystem for complex data analysis becomes more tangible. We will continue to improve and expand data science capabilities across the platform, including ongoing development to make languages such as Python first-class citizens for the platform.”

These new advancements in making Hadoop more accessible and usable to the data science community are complemented by Cloudera’s recent development and leadership in this area, including:

Spark MLlib in Cloudera 5.5: In the latest Cloudera Enterprise 5.5 release, Cloudera added Spark MLlib, broadening Spark’s ease of use and performance gains to machine learning applications within Hadoop. Cloudera also included Spark SQL extending the capabilities of Spark for developers and data scientists by allowing SQL to seamlessly embed within Spark applications.
Ibis in Cloudera Labs: As a new open source project incubating in Cloudera Labs, Ibis is aimed at enabling advanced data analysis on a 100 percent Python stack and bringing a native Python experience to Hadoop at scale.
SparkOnHBase in Cloudera Labs: Originating in Cloudera Labs and now committed to the Apache HBase 2.0 branch, SparkOnHBase provides more flexibility for building analytic applications that rely on Spark Streaming.
Spark Runner for Apache Beam (incubating) in Cloudera Labs: Originating in Cloudera Labs and now part of the Beam SDK (formerly Google Dataflow), this project helps data scientists more easily build practical, massive-scale data processing pipelines for execution on Spark.
Apache Spark Training: With unprecedented expertise and experience with Hadoop and its ecosystem, Cloudera brings a real-world approach to training and certifications for data scientists and developers to take full advantage of Spark as part of a complete Hadoop platform.

Enabling data scientists to leverage the full power of the Hadoop ecosystem means opening up new possibilities for enterprises looking to build faster, more intelligent data applications and predictive models that improve customer experiences and drive new revenue streams. Through this ongoing evolution, Cloudera is committed to offering seamless accessibility, productivity, and ease-of-use to the data science community.

Learn More at Spark Summit East 2016

Cloudera will be attending Spark Summit East 2016 from February 16-18 in New York City. Additionally, Cloudera will be presenting at the show:

Wednesday, February 17 at 3:00 p.m. - “Time Series Analysis with Spark” with Sandy Ryza
Wednesday, February 17 at 6:30 p.m. - “Securing Apache Spark on Production Hadoop Clusters” with Kostas Sakellis at the Spark-NYC meetup (hosted by Collective Media)
Wednesday, February 17 at 7:00 p.m. - “Enabling Python to Become a Better Big Data Citizen” with Wes McKinney at the New York Python Meetup Group (hosted by ODSC)
Thursday, February 18 at 1:50 p.m. - “Top 5 Mistakes When Writing Spark Applications” with Mark Grover and Ted Malaska

For more information on how Cloudera is making Hadoop a primary platform for data science stop by Booth #103 at the event.

About Cloudera

Cloudera delivers the modern data management and analytics platform built on Apache Hadoop and the latest open source technologies. The world’s leading organizations trust Cloudera to help solve their most challenging business problems with Cloudera Enterprise, the fastest, easiest and most secure data platform available for the modern world. Our customers efficiently capture, store, process and analyze vast amounts of data, empowering them to use advanced analytics to drive business decisions quickly, flexibly and at lower cost than has been possible before. To ensure our customers are successful, we offer comprehensive support, training and professional services. Learn more at http://cloudera.com.

Connect with Cloudera

Read our blogs: cloudera.com/engblog and vision.cloudera.com

Visit us on Facebook: facebook.com/cloudera

Join the Cloudera Community: cloudera.com/community

Cloudera, Cloudera's Platform for Big Data, Cloudera Enterprise Data Hub Edition, Cloudera Enterprise Flex Edition, Cloudera Enterprise Basic Edition, Cloudera Navigator Optimizer and CDH are trademarks or registered trademarks of Cloudera Inc. in the United States, and in jurisdictions throughout the world. All other company and product names may be trademarks of their respective owners.

###

Deborah Wiltshire
Cloudera
[email protected]
+1 (650) 644-3900

Cloudera Accelerates Data Science Workloads for Apache Hadoop

Editor's Picks

Welcome to EconoTimes