Data Science Central (2024)

18/02/2024

30 Python Libraries that I Often Use https://mltblog.com/3ONhMWi

This list covers well-known as well as specialized libraries that I use rather frequently. Applications include GenAI, data animations, LLM, synthetic data generation and evaluation, ML optimization, scientific computing, statistics, web crawling, APIs, SQL, and more. I also mention my owns, and issues that I faced with standard libraries. In several instances, for instance sound generation, I did not use any library. In addition, included some functions that I regularly call. Many times, I explain why I had to create my home-made versions.

30 Python libraries to solve most AI problems, including GenAI, data videos, synthetization, model evaluation, computer vision and more.

15/02/2024

Gemini Ultra Unleashed: Google's Best LLM Now Available https://mltblog.com/3SBZzMz

A lot has changed for the better since the first announcement not long ago.

Hands-on workshop for developers and AI professionals, on state-of-the-art technology. Live demo and code-sharing session to see Gemini Ultra in action. Recording and GitHub material will be available to registrants who cannot attend the free 60-min session.

11/02/2024

Probabilistic ANN: The Swiss Army Knife of GenAI https://mltblog.com/48hQWfY

ANN — Approximate Nearest Neighbors — is at the core of fast vector search, itself central to GenAI, especially GPT and LLM. My new methodology, abbreviated as PANN, has many other applications: clustering, classification, measuring the similarity between two datasets (images, soundtracks, time series, and so on), tabular data synthetization (improving poor synthetizations), model evaluation, […]

ANN -- Approximate Nearest Neighbors -- is at the core of fast vector search, itself central to GenAI, especially GPT and LLM. My new methodology, abbreviated as PANN, has many other applications: clustering, classification, measuring the similarity between two datasets (images, soundtracks, time se...

10/02/2024

Actions in GPTs: Developer Tips, Tricks & Techniques https://mltblog.com/3utzlDZ

Hands-on workshop for developers and AI professionals, on state-of-the-art technology. Recording and GitHub material will be available to registrants who cannot attend the free 60-min session.

07/02/2024

How to Automate Data Cleaning, in a Nutshell

Issues and solutions to automate data cleaning. Free your data scientists from the most boring tasks, making them happier and reducing costs.

06/02/2024

Massively Speed-Up your Learning Algorithm, with Stochastic Thinning. Includes use case, Python code, regression and neural network illustrations.

Dramatically Speed-Up your Learning Algorithm, with Stochastic Thinning. Includes use case, Python code, regression and neural network illustrations.

06/02/2024

More Fun Math Problems for Machine Learning Practitioners

This is part of a series featuring the following aspects of machine learning: Mathematics, simulations, benchmarking algorithms based on synthetic data (in short, experimental data science) Opinions, for instance about the value of a PhD in our field, or the use of some techniques Methods, principle...

05/02/2024

Better, Faster, Less Expensive Synthetic Data Without Deep Learning

My talk at the ODSC Conference, San Francisco, October 2023. Includes Notebook demonstration, using our open-source Python libraries. View or download the PowerPoint presentation, here. I discuss NoGAN, an alternative to standard tabular data synthetization. It runs 1000x faster than GAN, consistent...

05/02/2024

AI-based Object/Image Detection for Inventory Management https://mltblog.com/3SMRJRC

Hands-on workshop for developers and AI professionals, on state-of-the-art technology. Recording and GitHub material will be available to registrants who cannot attend the free 60-min session.

This is one of the AI applications where many compagnies recognize the value and are ready to invest, with guaranteed return thanks to low costs, proven technology, and automation.

Many of the requests we get from potential enterprise clients - even brick and mortar companies - are actually focused on this topic: automated classification and management of inventory or digital content, with an interest in automated image labeling and classification, as well as creating document taxonomies and better search tools (sometimes with automated data analysis) to help internal customers quickly find what they need.

05/02/2024

NoGAN: Ultrafast Data Synthesizer and New Evaluation Metric - My Presentation at ODSC San Francisco

Our presentation/workshop about NoGAN at ODSC San Francisco, October 2023. Runs 1000x faster than GAN, consistently delivering better results according to th...

05/02/2024

The Riemann Hypothesis in One Picture

With visual, simple, intuitive method for supervised classification

04/02/2024

Simple Introduction to Public-Key Cryptography and Cryptanalysis: Illustration with Random Permutations

In this article, I illustrate the concept of asymmetric key with a simple example. Rather than discussing algorithms such as RSA, (still widely used, for instance to set up a secure website) I focus on a system easier to understand, based on random permutations. I discuss how to generate these rando...

03/02/2024

GenAI: Fast Vector Search at Scale (Demo on AWS)

Register at https://mltblog.com/3UGF0l5.

ANN stands for Approximate Nearest Neighbors, a faster yet high-quality alternative to exact but slow KNN, for vector search in GenAI contexts (LLM, GPT, multimodal, and so on). My team is actually developing proprietary technology on this topic, with paper coming soon. In the meanwhile, if you want to see real enterprise case studies, and an existing fully scaled algorithm in action, this hands-on workshop is for you.

Intended to developers and AI professionals, featuring state-of-the-art GenAI technology. Recording and GitHub material will be available to registrants who cannot attend the free 60-min session.

02/02/2024

Synthetizing the Insurance Dataset Using Copulas - Towards Better Synthetization

This article is an extract from my book “Synthetic Data and Generative AI”, available here. In the context of synthetic data generation, I've been asked a few times to provide a case study focusing on real-life tabular data used in the finance or health industry. Here we go: this article fills t...

02/02/2024

A Simple Regression Problem

This article is part of a new series featuring problems with solution, to help you hone your machine learning and pattern recognition skills. Try to solve this problem by yourself first, before looking at the solution. Today’s problem also has an intriguing mathematical appeal and solution: this a...

01/02/2024

Generative AI: Synthetic Data Vendor Comparison and Benchmarking Best Practices

The goal of data synthetization is to produce artificial data that mimics the patterns and features present in existing, real data. Many generation methods and evaluation techniques are available, depending on purposes, the type of data, and the application field. Everyone is familiar with synthetic...

01/02/2024

Book: Intuitive Machine Learning and Explainable AI

Intuitive Machine Learning with focus on explainable AI, human-friendly intelligence, powerful visualizations and applications.

31/01/2024

Machine Learning Cloud Regression: The Swiss Army Knife of Optimization

Entitled “Machine Learning Cloud Regression: The Swiss Army Knife of Optimization”, the full version in PDF format is accessible in the “Free Books and Articles” section, here. Also discussed in details with Python code in chapter 1 in my book “Intuitive Machine Learning and Explainable AI...

31/01/2024

Better LLMs with Shorter Embeddings: Part 3 https://mltblog.com/3HGj6Xi

Variable Length Embeddings and fast ANN-like search (approximated nearest neighbors) for better, lighter and less expensive LLMs

Variable Length Embeddings and fast ANN-like search (approximated nearest neighbors) for better, lighter and less expensive LLMs

31/01/2024

18 Differences Between Good and Great Data Scientists

machine learning, data science career, business analytics, data science lifecycle, data visualizations

30/01/2024

How to Choose the Best Machine Learning Technique: Comparison Table

30/01/2024

Creating Embeddings on Large, Real-Time Data with OpenAI https://mltblog.com/3SiMGXF

Hands-on workshop for developers and AI professionals, on state-of-the-art GenAI technology. Recording and GitHub material will be available to registrants who cannot attend the free 60-min session.

I recently showed how to optimize embeddings and RAG architecture in LLMs and GPT-like applications, with home-made systems. This webinar discusses a real business case, with much larger input data in real time, using efficient tools. Embeddings is the central piece.

30/01/2024

New Python Library to Evaluate AI-generated Data and Compare Models

Called GenAI-Evalution, you use it for instance to assess the quality of tabular synthetic data. In this case, it measures how faithfully the synthetization mimics the real data it is derived from, by comparing the full joint empirical distributions (ECDF) attached to the two datasets. It works both...

29/01/2024

A Synthetic Stock Exchange Played with Real Money. Includes Python code dealing with gigantic numbers using exact arithmetic.

Not only that, but you can predict -- more precisely compute with absolute certainty -- what the value of any stock will be tomorrow. Transaction fees are well below 0.05% and the market, at least in the version presented here, is fair: in other words, a zero-sum game if you play by luck. If instead

29/01/2024

Python Code and Material from the Book "Stochastic Processes and Simulations" - GitHub Repository

This repository contains the material (datasets, code, videos, spreadsheets) related to my book Stochastic Processes and Simulations - A Machine Learning Perspective. - GitHub - VincentGranville/Po...

29/01/2024

An Intriguing Job Interview Question for AI/ML Professionals

Intriguing technical job interview questions for candidates applying to machine learning and AI jobs, with 4 difficulty levels.

28/01/2024

Book: Interpretable Machine Learning

Intuitive Machine Learning with focus on explainable AI, human-friendly intelligence, powerful visualizations and applications. By Vincent Granville Ph.D, published in September 2022. PDF format, 156 pages. Version 1.0 with Python code. The book is available here. For my upcoming course based on thi...

27/01/2024

Build Document/Image Analytics with GPT-4 Vision https://mltblog.com/48Odh69

Showcasing a conceptual application demo that can analyze insurance claims data, interpret PDF documents and photos of car accidents to infer damage types and estimate payouts.

Hands-on workshop for developers and AI professionals, on state-of-the-art GenAI technology. Recording and GitHub material will be available to registrants who cannot attend the free 60-min session.

27/01/2024

New GenAI Evaluation Metric, Ultrafast Search, and Perfect Randomness

This article covers three different GenAI topics. First, I introduce one of the best random number generators (PRNG) with infinite period. Then I show how to evaluate the synthesized numbers using the full multivariate empirical distribution (same as KS that I used for NoGAN evaluation), but this ti...

26/01/2024

My Book on Poisson-binomial Stochastic Processes and Simulations

The book covers supervised classification, including fractal classification, as well as unsupervised clustering, using an innovative approach. Datasets are first mapped onto an image, then processed using image filtering techniques. I discuss the analogy with neural networks, comparing very deep but...

Data Science Central

18/02/2024

15/02/2024

11/02/2024

10/02/2024

07/02/2024

06/02/2024

06/02/2024

05/02/2024

05/02/2024

05/02/2024

05/02/2024

04/02/2024

03/02/2024

02/02/2024

02/02/2024

01/02/2024

01/02/2024

31/01/2024

31/01/2024

31/01/2024

30/01/2024

30/01/2024

30/01/2024

29/01/2024

29/01/2024

29/01/2024

28/01/2024

27/01/2024

27/01/2024

26/01/2024

Address

Website

Alerts

Shortcuts

Share