Machine Learning Techniques

Home
Machine Learning Techniques

Blog on machine learning, statistical engineering, data science, computational statistics, operation

20/08/2024

The Myth of Analytic Talent Shortage

I tested the job market in the last two weeks, both as an applicant, and as a hiring manager. I share my experience here. It is radically different from what you read in the news, or from what most people say. Data scientists and machine learning engineers looking for a new job are out there.… Re...

19/08/2024

Build Scalable AI-Powered Recommendation Systems with Fast Vector Search: Tutorial, Case Studies - https://mltblog.com/4cyc5og

2-hour workshop featuring scalable, high-performance architecture, best practices, tips, case studies, coding sessions, exercises, real-time, unstructured big data, and more. Learn how to leverage the very tools that power the best LLMs.

Recording and GitHub material will be available to registrants who cannot attend the free workshop. Upon request, participants will receive a copy of my book "State of the Art in GenAI & LLMs — Creative Projects, with Solutions".

Register at https://mltblog.com/4cyc5og

19/08/2024

How to dramatically improve GPT and related apps such as Google search and internal search boxes, with simple techniques and no training

19/08/2024

Common Errors in Machine Learning due to Poor Statistics Knowledge

Probably the worst error is thinking there is a correlation when that correlation is purely artificial. Take a data set with 100,000 variables, say with 10 observations. Compute all the (99,999 * 100,000) / 2 cross-correlations. You are almost guaranteed to find one above 0.999. This is best illustr...

18/08/2024

Maximum runs in Bernoulli trials: simulations and results

Bernoulli trials are random experiments with two possible outcomes: “yes” and “no” (in the case of polls), “success” and “failure” (in the case of gambling or clinical trials). The trials are independent from each other: for instance tossing a coin multiple times, or testing the succ...

18/08/2024

Fuzzy Regression: A Generic, Model-free, Math-free Machine Learning Technique

Some people climb Mount Everest solo in winter, with no oxygen. Some mathematicians prove difficult theorems using only elementary arithmetic. The proof, despite labeled as ``elementary" is typically far more complicated than those based on advanced mathematical theory. The people accomplishing thes...

17/08/2024

NoGAN: Ultrafast Data Synthesizer and New Evaluation Metric - My Presentation at ODSC San Francisco

Our presentation/workshop about NoGAN at ODSC San Francisco, October 2023. Runs 1000x faster than GAN, consistently delivering better results according to th...

17/08/2024

Podcast: Explainable AI, Blackboxes and Synthetic Data

In this 45 minutes podcast hosted by Ben Cole, Executive Editor at TechTarget, Vincent Granville, founder of MLTechniques.com and Executive ML Scientist, discusses the following topics: Synthetic data design techniques, and how to identify business processes where most useful How to test the quality...

16/08/2024

LLMs as Recommendation Engines: Stock Market / Portfolio Management Case Study https://mltblog.com/3AroPjv

Not only for stock recommendations (when and what to buy/sell, at what prices) but to automatically manage a well-balanced portfolio depending on your goals: retirement, fast growth, and so on, with a well-balanced mix that matches your intent.

This hands-on workshop is for developers and AI professionals, featuring state-of-the-art technology, case studies, code-share, and live demos. Recording and GitHub material will be available to registrants who cannot attend the free 60-min session.

You’ll learn:

- The fundamentals of Large Language Models (LLMs) and their application in financial services.
- How to build personalized portfolio recommendation engines using LLMs.
- Strategies for integrating LLMs into existing financial services infrastructure.
- Insights into the benefits and challenges of using LLMs for financial recommendations.
- A live demonstration of creating a personalized portfolio recommendation engine.

Register at https://mltblog.com/3AroPjv

16/08/2024

NoGAN: Ultrafast Data Synthesizer – My Talk & Workshop at ODSC San Francisco

My talk at the ODSC Conference, San Francisco, October 2023. Includes Notebook demonstration, using our open-source Python libraries. View or download the PowerPoint presentation, here. I discuss NoGAN, an alternative to standard tabular data synthetization. It runs 1000x faster than GAN, consistent...

16/08/2024

Book: Understanding Deep Learning

By Simon Prince, computer science Professor at the University of Alberta. To be published by MIT Press, Dec 2023. The author shares the associated Jupyter notebooks on his website, here. Very popular, it got over 5,000 likes when the author announced the upcoming book on LinkedIn. I pre-ordered my c...

15/08/2024

How to Speed-up Full-stack Apps by Factor 100x: Case Study https://mltblog.com/3M4jgtQ

Discover how SingleStore Kai revolutionizes application performance, enabling you to build and deploy lightning-fast full-stack apps with ease. Learn from industry experts through practical insights, live demonstrations, and real-world use cases that highlight the speed and efficiency gains achievable with SingleStore Kai.

You will Learn:

- Techniques for optimizing full-stack applications to achieve 100X faster performance.
- Hands-on examples of a full-stack NextJS starter app.
- JSON query comparisons over ecommerce data.
- Running vector queries over JSON data.
- Insights into the benefits of using SingleStore Kai for full-stack development.
- The fundamentals of integrating MongoDB and JSON data with SingleStore Kai.

This hands-on workshop is for developers and AI professionals, featuring state-of-the-art technology, case studies, code-share, and live demos. Recording and GitHub material will be available to registrants who cannot attend the free 60-min session.

Register at https://mltblog.com/3M4jgtQ

15/08/2024

The Short Lifecycle of Tech Careers

Strategies to help you prepare transitioning to the second part of your career: at 40, when getting a tech job becomes almost impossible

15/08/2024

Book: Stochastic Processes and Simulations - A Machine Learning Perspective

The book covers supervised classification, including fractal classification, as well as unsupervised clustering, using an innovative approach. Datasets are first mapped onto an image, then processed using image filtering techniques. I discuss the analogy with neural networks, comparing very deep but...

14/08/2024

Gentle Introduction To Chaotic Dynamical Systems

New Book: Gentle Introduction to Chaotic Dynamical Systems - Data Science Central

14/08/2024

Detecting Subtle Departures from Randomness

Entitled “Detecting Subtle Departures from Randomness”, the full version in PDF format is accessible in the “Free Books and Articles” section, here. Also discussed in details in chapter 9 in my book “Intuitive Machine Learning and Explainable AI”, available here. Figure 1 below shows two...

13/08/2024

Cool AI Platform Writes Code, Manages Teamwork, Does GitHub Integration, Summarizes, and More https://mltblog.com/4cu6j7c

From EDA to production, the Zerve platform enables end-to-end AI development that is stable, secure, scalable & cost effective.
I recently got early access to their new GitHub integration. Zerve seems to put a large emphasis on enhancing collaboration and tracking amongst data teams and I think their new GitHub integration is a big step in that direction!

Key highlights:

- Seamless GitHub connection within Zerve's canvas & perform standard operations like commits, source control etc.
- Syncs with issue tracking tools for better project management (I don’t see many data teams operating like this today!)
- Through branching, data teams can start the CI/CD process, staying aligned with engineering teams

Now that this feature is available to the general public, I’ll be very interested to see how data teams and management start to use this, so let me know in the comments what your experience is like! For access, you can request a free trial at https://mltblog.com/4cu6j7c

13/08/2024

Book on Poisson-binomial Stochastic Processes and Simulations

13/08/2024

GenAI: How to Synthesize Data 1000x Faster with Better Results and Lower Costs

Here's a rundown of issues with synthesizing data, and how GenAI can exponentially improve speed and lower costs.

12/08/2024

8 Ways to Massively Speed Up Database Queries for Faster AI https://mltblog.com/3M4jgtQ

These days, it is possible to switch to a different platform, keep your data and slow queries unchanged, even if written in traditional SQL or dealing with JSON. See how it is done, at https://mltblog.com/3M4jgtQ

The new platform will optimize your database transparently. Here are 8 strategies used to achieve this goal.

• Switch to different architecture with better query engine, for instance from JSON or SQL to vector DB. The new engine may also optimize configuration parameters.

• Efficiently encode your fields, with minimum or no loss, especially for long text elements. This is done automatically when switching to a high performance database.

• Eliminate features or rows that are never used. Work with smaller vectors.

• Leverage the cloud, distributed architecture, and GPU.

• Optimize queries to avoid expensive operations. This can be done automatically with AI, transparently to the user. For instance, when switching to this platform: https://mltblog.com/3M4jgtQ

• Use cache for common queries or rows/columns most frequently accessed.

• Load parts of the database in memory and perform in-memory queries. That's how I get queries running at least 100 times faster in my LLM app, compared to vendors.

• Use techniques such as approximate nearest neighbor search for faster retrieval, especially in RAG apps. This is done automatically when switching to a high performance platform.

➡️ To learn more, see https://mltblog.com/3M4jgtQ. This hands-on workshop is for developers and AI professionals, featuring state-of-the-art technology, case studies, code-share, and live demos. Recording and GitHub material will be available to registrants who cannot attend the free 60-min session.

12/08/2024

How the New Breed of LLMs is Replacing OpenAI and the Likes

The new breed of Large Language Models: why and how it will replace OpenAI and the likes, and why the current startup funding model is flawed

12/08/2024

The Riemann Hypothesis in One Picture

With visual, simple, intuitive method for supervised classification

12/08/2024

New Random Generators for Large-Scale Reproducible AI https://mltblog.com/4fGDLu0

Modern GenAI apps rely on billions if not trillions of pseudo-random numbers. You find them in the construction of latent variables in nearly all deep neural networks and almost all applications: computer vision, synthetization, and LLMs. Yet, few AI systems offer reproducibility, though those described in my recent book, do.

When producing so many random numbers or for strong encryption, you need top grade generators. The most popular one — adopted by Numpy and other libraries — is the Mersenne twister. It is known for its flaws, with new ones discovered during my research, and shared with you.

This paper has its origins in the development of a new foundational framework to prove the conjectured randomness and other statistical properties of the digits of infinitely many simple math constants, such as e or π. Here, I focus on three main areas. First, how to efficiently compute the digits of the mathematical constants in question to use them at scale. Then, new tests to compare two types of random numbers: those generated by Python, versus those from the math constants investigated here, and help decide which systems are best. Finally, I propose a new type of strongly random digits based on an incredibly simple formula (one small line of code) leading to fast computations.

One of the benefits of my proposed random bit sequences, besides stronger randomness and fast implementation at scale, is to not rely on external libraries that may change over time. These libraries may get updated and render your results non-replicable in the long term if (say) Numpy decides to modify the internal parameters of its random generator. By combining billions of constants, each with its own seed, with billions of digits from each constant, it is impossible to guess what formula you used to generate your digits, when security is important.

Some of my randomness tests involve predicting the value of a string given the values of previous strings in a sequence, a topic at the core of many large language models (LLMs). Methods based on neural networks — mines being an exception — are notorious for hiding the seeds used in the various random generators involved. It leads to non-replicable results. It is my hope that this article will raise awareness about this issue, while offering better generators that do not depend on which library version you use.

Last but not least, the datasets used here are infinite, giving you the opportunity to work on truly big data and infinite numerical precision. And at the same time, get a glimpse at deep number theory resulys and concepts, explained in simple English.

➡️ To access full article with code: dowload paper #44 at https://mltblog.com/3EQd2cA

11/08/2024

Why and How I Created my Own LLM from Scratch

XLLM: new approach to OpenAI / GPT with fast, customized search, simple architecture and better results, based on extreme LLM

11/08/2024

New Book: Practical GenAI & Machine Learning - Projects and Datasets

New Book: Statistical Optimization for GenAI and ML. State-of-the-art on RAG, xLLM, embeddings, fast vector search, NoGAN, synthetization, and more.

10/08/2024

New Book: State of the Art in GenAI & LLMs — Creative Projects, with Solutions

With 23 top projects, 96 subprojects, and 6000 lines of Python code, this vendor-neutral coursebook is a goldmine for any analytic professional or AI/ML engineer interested in developing superior GenAI or LLM enterprise apps using ground-breaking technology. This is not another book discussing the s...

10/08/2024

Dynamic Clouds and Landscape Generation: Morphing and Evolutionary Processes. With full Python code, details about the model with connection to random walks and Brownian motions.

Entitled “Dynamic Clouds and Landscape Generation: Morphing and Evolutionary Processes”, the full version in PDF format is accessible in the “Free Books and Articles” section, here. Also discussed in details with Python code in my book “Synthetic Data”, available here. My previous articl...

09/08/2024

10 Types of AI Databases Explained in One Sentence

There are more database architectures out there than most people think. In this quick overview, I present a taxonomy of the current DB ecosystem.

Vector and graph databases are among the most popular these days, especially for GenAI and LLM apps. Some can handle tasks performed by traditional databases and understand SQL and other languages (NoSQL, NewSQL). Some are optimized for fast search and real time.

➡️For one one of the most efficient and versatile, see https://mltblog.com/3AhZqbP

In vector DBs, features (the columns in a tabular dataset) are processed jointly and encoded, rather than column by column. Graph DBs store information as nodes and node connections. For instance, knowledge graphs and taxonomies with related categories and sub-categories. JSON and bubble databases deal with unstructured data such as text and web content. In my case, I use key-value schemas, also known as hash tables or dictionaries in Python.

Some DBs are column-oriented while the standard is based on rows. Some fit in memory: they are called in-memory databases, achieving faster ex*****on. Another way to increase performance is via distributed architecture, for instance Hadoop.

In object-oriented databases, data is stored as objects, similar to object-oriented programming languages, which allows for direct mapping of objects in code to objects in the database.

Hierarchical databases are good at representing tree structures, a special kind of graph. Network databases go one step further, allowing more complex relationships than hierarchical databases, in particular multiple parent-child relationships.

For special needs, consider time series, geospatial and multimodel databases (not to be confused with multimodal). Multimodel DBs support multiple data models (document, graph, key-value) within a single engine. Image and soundtrack repositories can also be organized as databases.

➡️ For an example featuring multimodel, see https://mltblog.com/3AhZqbP

In a future article, I will discuss how to write code that writes queries to automatically retrieve information from databases, automating a lot of mundane work.

09/08/2024

More Machine Learning Tricks, Recipes, and Statistical Models

Source for picture: here The first part of this list was published here. These are articles that I wrote in the last few years. The whole series will feature articles related to the following aspects of machine learning: Mathematics, simulations, benchmarking algorithms based on synthetic data (in s...

09/08/2024

Number Theory: Longest Runs of Zeros in Binary Digits of Square Root of 2

Studying the longest head runs in coin tossing has a very long history, starting in gaming and probability theory. Today, it has applications in cryptography and insurance. For random sequences or Bernoulli trials, the associated statistical properties and distributions have been studied in detail,....

Address

Website

https://mltechniques.com/

Alerts

Be the first to know and let us send you an email when Machine Learning Techniques posts news and promotions. Your email address will not be used for any other purpose, and you can unsubscribe at any time.

Contact The Business

Send a message to Machine Learning Techniques:

Videos

Machine Learning Cloud Regression: The Swiss Army Knife of Optimization - https://mltblog.com/3Khub1z. Many machine learning and statistical techniques exist as seemingly unrelated, disparate algorithms designed and used by practitioners from various fields, under various names. Why learn 50 types of regressions when you can solve your problems with one simple generic version that covers all of them and more?

New data video, illustrating chaotic convergence of some iterative algorithms: details published at https://mltblog.com/3HdKcDZ

Upcoming article: The Art of Visualizing High Dimensional Data (with synthetic data and Python code). Subscribe to my newsletter at MLTechniques.com/resources to not miss this article and other related ML tutorials. Below is just one example with astral theme: comets circling the sun (simulations) to assess risk of collision.

Shortcuts

Address
Alerts
Contact The Business
Videos
Claim ownership or report listing
Want your business to be the top-listed Media Company?

Country:

City:

20/08/2024

19/08/2024

19/08/2024

19/08/2024

18/08/2024

18/08/2024

17/08/2024

17/08/2024

16/08/2024

16/08/2024

16/08/2024

15/08/2024

15/08/2024

15/08/2024

14/08/2024

14/08/2024

13/08/2024

13/08/2024

13/08/2024

12/08/2024

12/08/2024

12/08/2024

12/08/2024

11/08/2024

11/08/2024

10/08/2024

10/08/2024

09/08/2024

09/08/2024

09/08/2024

Address

Website

Alerts

Contact The Business

Videos

Shortcuts

Share