Название: Thinking Data Science: A Data Science Practitioner’s Guide Автор: Poornachandra Sarang Издательство: Springer Серия: The Springer Series in Applied Machine Learning Год: 2023 Страниц: 366 Язык: английский Формат: pdf (true), epub Размер: 77.2 MB
This definitive guide to Machine Learning projects answers the problems an aspiring or experienced data scientist frequently has: Confused on what technology to use for your ML development? Should I use GOFAI, ANN/DNN or Transfer Learning? Can I rely on AutoML for model development? What if the client provides me Gig and Terabytes of data for developing analytic models? How do I handle high-frequency dynamic datasets? This book provides the practitioner with a consolidation of the entire data science process in a single "Cheat Sheet". To understand the data, there are several libraries available for data visualization. The most commonly used Python libraries are Matplotlib and Seaborn that can provide a very advanced level of visualization.
CatBoost is yet another open-source library that provides a gradient boosting framework. Originally developed at Yandex, it was open-sourced in July 2017 and is still in active development with Yandex and the community. With their novel gradient boosting scheme, the designers were able to reduce the over-fitting. It also provides excellent support for GPUs for faster training and inference. Besides classification and regression tasks, it also supports ranking. It works on Linux, Windows, and macOS. It is available in Python and R. The models built using CatBoost can be used for predictions in C++, Java, C#, Rust, and other languages. All said, it is practically being used in many applications developed by Yandex and companies like CERN and Cloudflare—to name a few it is used in search, recommendation systems, personal assistants, self-driving cars, and weather forecasting. I will now show you how to use CatBoost in your projects.
Chapter 1 (Data Science Process) introduces you to the Data Science process that is followed by a modern data scientist in developing those highly acclaimed AI applications. It describes both the traditional and modern approach followed by a current day data scientist in model building. In today’s world, a data scientist has to deal with not just the numeric data, but he needs to handle even text and image datasets. The high-frequency datasets are another major challenge for a data scientist.As we have a very large number of machine learning algorithms, which can apply to your datasets, the model development process becomes time consuming and resource intensive. The chapter introduces you to AutoML that eases this model development process and hyper-parameter tuning for the selected algorithm.
Chapter 2 (Dimensionality Reduction) teaches you several techniques for bringing down the dimensions of your dataset to a manageable level. The chapter gives you an exhaustive coverage of dimensionality reduction techniques followed by a data scientist.
Chapter 3 (Regression Analysis) discusses several regression algorithms, starting with simple linear to Ridge, Lasso, ElasticNet, Bayesian, Logistic, and so on. You will learn their practical implementations and how to evaluate which best fits for a dataset.
Chapter 4 (Decision Trees) deals with decision trees—a fundamental block for many Machine Learning algorithms.
Chapter 5 (Ensemble: Bagging and Boosting) talks about the statistical ensemble methods used to improve the performance of decision trees. You will learn several algorithms in this chapter, such as Random Forest, ExtraTrees, BaggingRegressor, and BaggingClassifier. Under boosting, you will learn AdaBoost, Gradient Boosting, XGBoost, CatBoost, and LIghtGBM.
Chapter 6 (K-Nearest Neighbors) describes K-Nearest Neighbors, also called KNN, which is the simplest and starting algorithm for classifications.
Chapter 7 (Naive Bayes) describes Naive Bayes’ theorem and its advantages and disadvantages. I also discuss the various types, such as Multinomial, Bernoulli, Gaussian, Complement, and Categorical Naive Bayes. The Naive Bayes is useful in classifying huge datasets.
Chapter 8 (Support Vector Machines) gives an in-depth coverage to this algorithm.
Now comes the next challenge for a data scientist, and that is clustering a dataset without having labeled data points. We call this unsupervised learning. I have a huge section (Part II) comprising Chaps. 9 through 16 for clustering, giving you an in-depth coverage for several clustering techniques.
Chapter 9 (Centroid-Based Clustering) discusses the centroid-based clustering algorithms, which are probably the simplest and are the starting points for clustering huge spatial datasets. The chapter covers both K-Means and K-Medoids clustering algorithms.
Part III (ANN: Overview) provides you an overview of this technology. In Chap. 17 (Artificial Neural Networks), I introduce you to ANN/DNN technology.
Chapter 18 (ANN-Based Applications) deals with two practical examples of using ANN/DNN. One example deals with text data and the other one with image data.
Chapter 19 (Automated Tools) talks about the automated tools for developing machine learning applications.
The last chapter, Chap. 20 (Data Scientist’s Ultimate Workflow), is the most important one. It merges all your lessons. In this chapter, I provide you with a definite path and guidelines on how to develop those highly acclaimed AI applications and become a Modern Data Scientist.
Скачать Thinking Data Science: A Data Science Practitioner’s Guide
Внимание
Уважаемый посетитель, Вы зашли на сайт как незарегистрированный пользователь.
Мы рекомендуем Вам зарегистрироваться либо войти на сайт под своим именем.
Информация
Посетители, находящиеся в группе Гости, не могут оставлять комментарии к данной публикации.