header banner
Default

Day 2 of QCon San Francisco 2023: JVM Trends, Platform Engineering, Design for Resilience, and Modern ML


Day Two of the 17th annual QCon San Francisco conference was held on October 3rd, 2023, at the Hyatt Regency San Francisco in San Francisco, California. This five-day event, consisting of three days of presentations and two days of workshops, is organized by C4Media, a software media company focused on unbiased content and information in the enterprise development community and creators of InfoQ and QCon. It included a keynote address by Neha Narkhede and presentations from these four tracks:

  • Designing for Resilience
    • Hosted by Javier Fernandez-Ivern, staff software engineer at Netflix
    • Offers attendees the chance to delve into each of the areas of hardware failures, unreliable networks, accidents and malicious attacks to provide attendees with the tools they need to build resilient systems and empower operators
  • Platform Engineering Done Well
    • Hosted by Daniel Bryant, Java Champion, co-author of "Mastering API Architecture," independent technical consultant and InfoQ news manager
    • Offers attendees the opportunity to explore the people perspective of platform engineering in addition to the technical aspects
    • Hosted by Hien Luu, Sr. engineering manager at DoorDash and author of Beginning Apache Spark 3, speaker and conference committee chair
    • Offers attendees the chance to delve into the latest trends and techniques for building modern Machine Learning (ML) systems and applications
    • Hosted by Monica Beckwith, Java Champion, First Lego League Coach, passionate about JVM Performance at Microsoft
    • Offers attendees a deep-dive into the transformative potential of the Java Virtual Machine (JVM) as it consistently reshapes the realm of high-performance applications

Wes Reisz, technical principal at Thoughtworks, creator/co-host of The InfoQ Podcast and QCon San Francisco 2023 program committee chair, kicked off the day two activities by welcoming the attendees and providing an overview from day one. There were 21 editorial presentations, three unconference sessions and ten presentations from sponsors. He highlighted a list of recommended day-one sessions based on attendee feedback, namely: Managing 238M Memberships at Netflix presented by Surabhi Diwan, senior software engineer at Netflix; Sleeping at Scale - Delivering 10k Timers per Second per Node with Rust, Tokio, Kafka, and Scylla presented by Lily Mara, engineering manager at OneSignal and author of "Refactoring to Rust," and Hunter Laine, software engineer at OneSignal; CI/CD Beyond YAML presented by Conor Barber, senior software engineer, Infrastructure at Airbyte; and Risk and Failure on the Path to Staff Engineer presented by Caleb Hyde, site reliability engineer at Expel.

Daniel Bryant presented an overview of InfoQ editorial activities and core values. InfoQ editors are software engineers and professional developers, as opposed to professional writers. Approximately 40 pieces of content per week are published in the form of 18 news items, five feature-length articles, 15 presentations and two podcasts. InfoQ editorial content is skewed to the innovator and early adopter personas from the "crossing the chasm" model pioneered by Geoffrey Moore in his book of the same name. While the early majority and late majority categories are important, InfoQ focuses on covering innovation for the enterprise. These models are frequently used at InfoQ as a tool for content creation, especially the InfoQ Trends Reports.

Pia von Beren, QCon product manager and diversity lead at C4Media, discussed the QCon Alumni Program, and making QCon an experience for everyone, for example, by offering healthy foods and catering to dietary needs at QCon conferences.

Sid Anand, chief architect at Datazoom, Committer/PMC Apache Airflow, introduced the keynote speaker, Neha Narkhede.

Keynote Address: Generative AI - Shaping a New Future for Fraud Prevention

VIDEO: PlatformCon 2023, Day 2 EU stream
Platform Engineering

Neha Narkhede, co-founder of Oscilar and Confluent, and co-creator of Apache Kafka, presented her keynote address entitled, Generative AI: Shaping a New Future for Fraud Prevention. Inspiration for Narkhede's keynote came from her experience, and she was intrigued by applications that dealt with fraud and risk space. Narkhede realized that companies still struggled with MLOps, so she wanted to share a vision of using generative AI to shape the future of fraud prevention.

Markhede started her presentation by defining two reasons for increased fraud: an evolution in economic activity as economies shift and consumer behavior changes; and higher expectations of digital experiences in which consumers assume they are secure.

The top fraud trends in the industry include: automation that allows fraudsters to use software or bots to mount more scalable attacks; escalating fraud costs as global annual fraud amounts to $5.4T; synthetic identity fraud, a fast-growing trend that amounts to 85% of all identity fraud; balance between customer friction and fraud losses in which consumers expect seamless digital experiences, but the strictest verification procedures cannot always be deployed; and proliferation of point solutions that focus on a subset of signals that lack a 360° view of a consumer's risk.

Narkhede discussed the evolution of fraud detection that included three generations of fraud and risk technology: static rules-based systems, an if-this-then-do-that approach to detect previously known fraud challenges; rules + traditional ML, an approach in which traditional ML models are stopped by rules; and traditional ML + Generative AI, an approach to detect complex and emerging forms of fraud that may not have been previously seen.

The drawbacks of existing fraud detection methods include: limited scalability due to the increasing complexity of transactions; human oversight due to the resource-intensive methods that require human intervention; data imbalance due to fraudulent transactions being typically rare compared to legitimate transactions; lack of context due to the limited effectiveness in identifying more complex or subtle fraud schemes; feature engineering overload due to the time-consuming manual effort; and lack of adaptability due less agile status rules-based systems and traditional ML models.

Narkhede then introduced the rapid advances of Generative AI, a significant leap forward in fraud detection that offered these advantages: adaptive learning, an adaptive and evolving solution for modern fraud detection; data augmentation where Generative AI can create synthetic data that mimics real transactions that can enrich training datasets and improve model performance; anomaly detection where Generative AI trains on diverse datasets and "fraud world knowledge," an understanding of what is truly an anomaly; and reducing false positives though sophisticated algorithms to more accurately detect anomalies and fraud.

AI Risk Decisioning, a new foundation for protecting online transactions, is not an incremental improvement but a fundamental shift to combat fraud by aggregating + processing + understanding virtually unlimited risk data sources.

The six pillars of Generative AI, as explained by Narkhede, included case studies and demos from the company she founded, Oscilar, the AI Risk Decisioning platform: knowledge, a 360° cognitive core for risk management; creation, a natural language interface to create customized workflows, rules, models and integrations; recommendations that are proactive and automatic for effective risk mitigation; understanding, human-understandable reasoning to detect new patterns, build the necessary defenses and communicate fraud trends; guidance that allows risk experts to focus on informed and strategic decision-making based on reliable and comprehensive insights; and automation that enables consolidation and analysis of large amounts of data into automatically generated reports.

"Generative AI is not a silver bullet," said Narkhede, as she concluded by stating that the AI Risk Decisioning platform must integrate insights from traditional ML models and Generative AI to form a more complex understanding of fraud and risk.

Highlighted Presentations: Netflix, LinkedIn, JVM for the Cloud, Collaboration and Innovation with Java

VIDEO: PlatformCon 2023, Day 2 US stream
Platform Engineering

How Netflix Really Uses Java was presented by Paul Bakker, Java Platform at Netflix, Java Champion, and co-author of "Java 9 Modularity." Bakker put an end to the myth that "Netflix is all RxJava microservices with Hystrix and Spring Cloud and Chaos Monkeys running the show."

Bakker described the original architecture behind the familiar Netflix movie application, accessed via television and other devices, which connects to their Groovy-enable API server using REST and gRPC connections to their various services.

The first upgrades featured multiple remote calls, parallel computing and fault tolerance implemented with RxJava and Hystrix. However, there were limitations such as: a script required for each endpoint, UI developers who generally don't like Groovy and Java; and the fact that reactive programming is hard.

Bakker then introduced the GraphQL Federation, an architecture model that allows multiple GraphQL services, known as subgraphs or federated services, to be combined into a single schema or API, and the concept behind GraphQL as an alternative to the over-fetching and under-fetching issues inherent in REST.

Their GraphQL Federated Gateway connecting to Domain Graph Services (DGS) essentially replaced the original API server to communicate with the various services via gRPC. Benefits included: no API duplication; no server-side development for the UI developers; a shared GraphQL schema; and no Java client libraries.

Java remains in active development at Netflix. They support Azul Zulu 17, Azul's downstream distribution of OpenJDK, with active testing on JDK 21, running approximately 2800 applications built with approximately 1500 libraries. Gradle, Nebula and IntelliJ IDEA are their preferred build tools.

Bakker provided a retrospective of their JDK 17 upgrade that provided performance benefits, especially since they were running JDK 8 as recently as this year. With their active testing on JDK 21, Bakker feels that a subsequent upgrade to JDK 21 will be much faster and that use of Generational ZGC will be a much better fit for a variety of workloads.

"Virtual Threads are not a free lunch," Bakker maintained, as he warned that simply adding virtual threads to an application can decrease performance if the libraries are CPU intensive.

Netflix also supports Spring Cloud with Spring Cloud Netflix, a subproject that provides Netflix open-source software integrations for Spring Boot apps.

The Journey to a Million Ops/Sec/Node in Venice was presented by Alex Dubrouski, technical lead of Server Performance Team at LinkedIn, and Gaojie Liu, senior staff software engineer at LinkedIn, Open Source Contributor at Venice. Liu introduced Venice, an open-source, derived data platform, key-value storage system providing characteristics such as: high throughput asynchronous ingestion; low-latency online reads; and active-active replication.

Liu also presented graphical representation of the Venice Architecture and its related characteristics of: Cluster Management comprised of Apache Helix, Venice Controller, Venice Server and Venice Router; Ingestion that is eventual consistency compliant using Apache Kafka and Apache Samza, and RocksDB, an embedded key-value store; Read that align with different latency service level agreements; Admin, a two layer architecture that is eventual consistency compliant and resilient; and Multi-Region that provides active-active replication with timestamp-based deterministic conflict resolution and compliant with eventual consistency.

Venice Write consists of: Data Merge to provide improved read latency and read resilience in the write path; a simple and straightforward Dedicated Pipeline (deprecated) that offers fast development, but has been declared as inefficient; Drainer Pool (deprecated), a shared data processing service for improved control of processing resources, but is only efficient with fewer data stores; Topic-Wise Shared Consumer Service (default), a shared consumer service featuring a fixed consumer pool, a better GC and 1:1 ingestion task:consumer ratio; and Partition-Wise Shared Consumer Service (in development), a shared consumer service featuring a fixed consumer pool, improved throughput and 1:N ingestion task:consumer ratio.

Venice Read consists of: Venice Thin Client, a three-layer architecture with complex components controlled by the backend and an easy-to-roll-out routing optimization; Venice Fast Client, a two-layer architecture with latency reduction and hardware saving; Da Vinci, a client for small use cases; Read Compute, a DSL that can move some computations to the server via a distributed compute for reduced data transfer.

Liu then described three transport layer optimizations: JDK SSL to OpenSSL, which provides reduced GC overhead, improved throughput and latency, and a 10-15% improvement in end-to-end latency; Streaming Support that enables end-to-end streaming with minimal intermediate buffering to speed up processing yielding a 15% reduction in end-to-end latency; and HTTP 2.0 Adoption to prevent connection storm and remove connection warm up.

Dubrouski then presented common optimizations and how they affect performance. "Premature optimization is the root of all evil," Dubrouski said, crediting Donald Knuth, Sir Tony Hoare and Edsger Dijkstra, as there are misconceptions in this area.

Dubrouski maintained that JDK version upgrades are the cheapest way to improve performance at scale. At Venice, their migration to JDK11 improved latency and stop-the-world GC events by double digits. Then, their migration to JDK17 improved it even further with thread local handshakes and concurrent stack walking.

He then introduced RocksDB, an embeddable persistent key-value store for fast storage. Dubrouski said that "even code written in statically compiled languages can be tuned" as accepting the defaults is not always the best choice. A switch from the default Block Cache to the PlainTable format reduced server-side compute latency by 25%.

After a suspected memory leak, switching to Jemalloc, a general-purpose malloc implementation, yielded a 30% reduction in JVM RSS. Dubroski also introduced FastAvro, AKA Avro-Util, a collection of LinkedIn utilities and libraries to allow Java projects for improved interoperability with Apache Avro. Originally developed by RTBHouse, the runtime-generated serializers and de-serializers yielded up to 90% reduction in deserialization time. Optimizations from various FastAvro optimizations to reduce memory allocation and improve performance include: primitive object collections, partial deserialization caching, objects reuse and VarHandles.

Based on their experience, Dubrouski maintained that observability could be "too much of a good thing" as a latency metric was spending 5% of CPU on clock calls and a large number of different gauges and counters cause significant memory overhead.

Optimizing JVM for the Cloud: Strategies for Success was presented by Tobi Ajila, Eclipse OpenJ9 JVM engineer at IBM. Ajila kicked off his presentation with some cloud statistics: 94% of all companies use cloud computing; and SMEs spent 47% of their technology budgets on cloud services in 2022, a 67% increase from 2021 to 2022. Network, storage and compute are the major costs in deploying to the cloud.

Ajila then provided some tips on how to save in the cloud, namely: scale to zero by not paying for what isn't being used and scale up and down with demand; and increase density, that is do more with less and increase memory efficiency. Ajila then focused on two main themes: how to improve JVM startup time; and how to improve memory density.

Tools for improving JVM startup time include: OpenJDK; Eclipse OpenJ9; qbicc, an experimental native image compiler for Java; and GraalVM. In particular, Ajila discussed the benefits and drawbacks on: Class-Data Sharing; static compilation, Coordinated Restore at Checkpoint (CRaC); CRIU, Checkpoint Restore in Userspace (CRiU). He also introduced the CRIUSupport API via the CRIUSupport class in OpenJ9.

Tools for improving memory density include IBM Semeru Runtimes, OpenJ9, and Java itself. Classes, the Java heap, native memory and the Just in Time compiler (JIT) contribute to the JVM footprint. Ajila described OpenJ9 shared classes for faster startup and a smaller footprint. He provided multiple demos of applications that compared traditional vs. enhanced and traditional vs. InstantOn. A factor of 10 in performance was observed.

The -Xtune:virtualized OpenJ9 command-line switch reduced memory consumption by 25%. A demo of the IBM JITServer in which Ajila provided a standard startup then added load demonstrated improved ramp-up time, container density and cost. The memory is used by the JITServer, not the node.

Ajila discussed cloud compilers and compared them with a traditional JIT. He demonstrated how a cloud compiler works and stated that the benefits of a cloud compiler are provisioning, performance and resiliency. He then provided a demo and cloud compiler improves container density and cost. Ajila recommended that cloud compilers should be used in latency environments, resource-constrained environments, and scale out.

Ajila summarized his presentation with these tips: using a cloud compiler can improve density in the cloud; features like the CRIUSupport API can boost JVM startup time and help with scale-to-zero policies; there is a cost-benefit in the cloud with the right tools.

The Keys to Developer Productivity: Collaborate and Innovate was presented by Heather VanCura, senior director of Standards, Strategy & Architecture at Oracle, director & chairperson at Java Community Process (JCP) Program, MySQL Community/DevRel and board member.

VanCura started her presentation with a familiar African proverb, "If you want to go fast, go alone, if you want to go far, go together." There is more collaboration and more innovation in the Java community and she discussed some of the more recent JVM trends as examples, namely: streaming systems with Netflix; high-performance data platforms with LinkedIn; harnessing exotic hardware as with Project Panama; and optimized cloud-native transformations at IBM with Eclipse OpenJ9. "We achieve more when we work together," VanCura maintained.

2023 marks significant milestone anniversaries with the Java programming language at 28 years and the Java Community Process (JCP) at 25 years. Both were celebrated at a special event in New York City.

The five main tenets of Java are: performance, stability, security, compatibility and maintainability. Moving Java forward relies on trust, innovation and predictability. Java currently runs on 60 billion JVMs and 38 billion cloud-based JVMs. Since the advent of the six-month release cadence with the release of JDK 10, new Java features, in the form of JDK Enhancement Proposals (JEPs), have been predictable and consistent.

VanCura discussed the recent release of JDK 21, a long-term support release, and provided a more detailed overview of Record Patterns, Pattern Matching for switch, Virtual Threads, Sequenced Collections and Generational ZGC.

There were 2585 resolved issues delivered in JDK 21, 700 of which were contributed by the Java community outside of Oracle.

VanCura proclaimed "strength in numbers" as she enumerated the 28 years of Java, 10 million Java developers, over 360 Java User Groups, and 375 Java Champions.

VanCura provided an overview of the JCP activities that include: the members, the executive committee, the collaborative development process, the compatibility triangle, the Java Specification Request (JSR) development cycle, and examples of JSRs.

Oracle supports community outreach programs such as Java in Education, an initiative to promote Java in local educational institutions, and jDuchess, a program built on the vision of diversity and inclusion, ensuring a vibrant representation in the Java community.

VanCura maintained that all these evolutions and pivots are possible due to collaboration and innovation in the Java community ecosystem. The Java innovation pipeline is stronger than ever, and the collaboration is higher than ever before.

About the Author

VIDEO: AU 2023: The Design & Make Conference
Autodesk University

Michael Redlich

VIDEO: Software Architecture for Systems that Use Quantum Computers
Software Engineering Institute | Carnegie Mellon University

Sources


Article information

Author: Lisa Cox

Last Updated: 1699696203

Views: 1487

Rating: 3.9 / 5 (48 voted)

Reviews: 96% of readers found this page helpful

Author information

Name: Lisa Cox

Birthday: 1960-10-30

Address: 569 Rios Light, North Sharon, CA 61487

Phone: +4004437498670745

Job: Librarian

Hobby: Survival Skills, Cycling, Reading, Quilting, Video Editing, Crochet, Rowing

Introduction: My name is Lisa Cox, I am a cherished, tenacious, Colorful, resolute, unguarded, Precious, treasured person who loves writing and wants to share my knowledge and understanding with you.