Fully challenge OpenAI! Google Releases Multi-modal Big Model Family Bucket： From AI Assistant to Wensheng Video Model

One day after being preempted by OpenAI, Google, a technology giant, not to be outdone, launched its latest multi-modal AI (Artificial Intelligence) product.

On May 14th, local time, in the keynote speech at the Google I/O Developers Conference, Google showed the AI assistant project Project Astra driven by the upgraded Gemini model, the Wensheng video model Veo of Sora, and the sixth generation Tensor processor unit (TPU)Trillium chip released in hardware. According to the final official statistics of the press conference, this keynote speech focusing entirely on AI mentioned AI 121 times in total.

This keynote speech focused entirely on AI mentioned AI 121 times in total. Source: Google I/O Keynote Speech

Sundar Pichai, CEO of Google, said that all Google’s work is centered on the generative AI model Gemini. "We hope everyone can benefit from what Gemini has done.".

On the 14th, Google (Nasdaq：GOOGL) shares closed at $171.93 per share, up 0.6%, with a total market value of $2.12 trillion.

AI search supports video input, which is new on Gemini and Gemma.

As a search engine giant, AI search is part of what Google has done.

According to reports, with the blessing of the latest Gemini, Google Search will have multi-step reasoning ability, which can handle long problems with multiple restrictions at one time, help users brainstorm, and support video search, allowing users to search for solutions by shooting videos. These features will be launched in the United States first, and Google expects to bring them to more than 1 billion people by the end of this year.

Google AI search will have multi-step reasoning ability. Source: Google

Google’s Gemini is famous for its long context window. At the conference, Google highlighted the multimodal and long text capabilities of the big model Gemini 1.5 Pro, and launched a series of updates for it. Google will provide Gemini Advanced subscribers in more than 150 countries and regions around the world with the latest model, Gemini 1.5 Pro with 1 million tokens, which supports more than 35 languages and is priced at $3.50 per 1 million token 3.5

According to Pichai, Gemini 1.5 will provide "the longest context window of all basic models to date". Gemini 1.5 Pro will further expand the window to 2 million token later this year, expanding the boundary of simultaneous processing of multimodal information.

Starting this summer, Gemini will support real-time voice interaction, and will launch real-time video interaction later this year. In the next few months, Google will also launch a custom AI assistant function similar to GPTs, called Gems, which can be linked with the whole set of "Google Family Buckets".

In addition, for some scenes that need quick response, Google launched the Gemini 1.5 Flash model. Flash will be the fastest Gemini model in the API, which is optimized for large-scale and large-scale matching tasks, and also has a long context window of 1 million token.

Google also brought Gemma 2, the latest version of its open source model Gemma, at the press conference. According to reports, Gemma 2 adopts a new architecture, with parameters reaching 27B, which has breakthrough performance and efficiency.

AI assistant Project Astra challenges GPT-4o.

After OpenAI launched the intelligent assistant GPT-4o, which can realize human-level response, Google’s AI agent project Project Astra also made a heavy debut.

In the demo video, Astra can analyze and respond to voice commands through what the mobile phone camera or smart glasses see. It successfully identifies the code sequence, puts forward suggestions for improvement of the circuit diagram, can "see" King’s Cross in London through the lens, and reminds users of the placement position of smart glasses.

Project Astra is Google’s vision for future AI assistants. Source: Google

Google AI Assistant will be able to observe the world with users through smart glasses. Source: Google

According to reports, Google developed the prototype of Astra on the basis of Gemini, which can process information faster by continuously encoding video frames and combining video and voice input into the event timeline. Through the voice model, Google has also strengthened the speaking ability of intelligent assistants, enabling them to give faster responses.

However, in the demo video,It seems that the response speed of Google AI assistant will still be slightly slower than that of GPT-4o, and the emotional color shown by the voice is also dull.

Pichai said that Google plans to add Astra’s functions to its Gemini application and its products from this year. However, he also stressed that although the ultimate goal is "to make Astra seamlessly connected in the company’s software", the product will be launched cautiously and "the road to commercialization will be driven by quality".

In addition to the competition in the field of AI assistants, Google also countered Sora of OpenAI through the Vinson video model Veo. Veo can generate high-quality 1080p video according to the prompts of words, pictures and videos, and create "consistent and coherent" shots. Users can customize the lighting, lens language and video color style. However, Google did not announce the specific launch time of Veo.

Source: Google I/O Keynote Speech

In addition, Google has also announced a series of generative AI tools related to images and music, including Imagen 3, an image tool that can show a higher level of detail, and the AI music tool "AI Music Sandbox" that cooperates with Youtube and musicians.

On the hardware side, Google will launch TPU Trillium, a sixth-generation data center AI chip, later this year. Pichai said that the computing performance of each chip will be 4.7 times higher than that of the fifth generation. Google achieved this goal by expanding the matrix multiplication unit (MXU) of the chip and improving the overall clock speed. In addition, the sixth generation will save 67% more energy than the fifth generation, and Google will double the memory bandwidth of Trillium chip.

After the conference, Andrew Ng, a famous AI scholar, congratulated Google, saying that he was looking forward to "having a context window of 2 million token and a Gemini that better supports AI on devices", believing that it would bring new opportunities to application builders.

Jim Fan, a senior research scientist in NVIDIA, said: "One thing Google is doing is right: they are finally seriously integrating AI into search … Google’s most powerful moat is distribution. Gemini does not have to be the best model, but it can be the most commonly used model in the world. "

Earlier, in a program broadcast on May 9, Google CEO sundar pichai talked about the company’s competition with Microsoft and OpenAI in an interview. He said that although Google started late in the field of chat bots, he was not worried about the company’s long-term competitiveness, and the AI wave was still in its early days.

On April 25th, Google’s parent company Alphabet released its financial report for the first quarter of 2024 as of March 31st. According to the financial report, Alphabet achieved revenue of 80.539 billion US dollars in the first quarter, up 15% year-on-year, which is the fastest quarter of the company’s revenue growth since the beginning of 2022. Non-GAAP net profit was $23.662 billion, a year-on-year increase of 57%; Diluted earnings per share was $1.89, higher than the market expectation of $1.51.

The Paper reporter Hu Hanyan

(This article is from The Paper, please download the "The Paper" APP for more original information)

Reporting/feedback