现在的位置: 首页 > 综合 > 正文

Big Data with MATLAB

2018年11月01日 ⁄ 综合 ⁄ 共 3935字 ⁄ 字号 评论关闭

How to work with huge and fast data sets

Big data refers to the dramatic increase in the amount and rate of data being created and made available for analysis.

A primary driver of this trend is the ever increasing digitization of information. The number and types of acquisition devices and other data generation mechanisms are growing all the time.

Big data sources include streaming data from instrumentation sensors, satellite and medical imagery, video from security cameras, as well as data derived from financial markets and retail operations. Big data sets from these sources can contain gigabytes or
terabytes of data, and may grow on the order of megabytes or gigabytes per day.

Big data represents an opportunity for analysts and data scientists to gain greater insight and to make more informed decisions, but it also presents a number of challenges. Big data sets may not fit into available memory, may take too long to process, or may
stream too quickly to store. Standard algorithms are usually not designed to process big data sets in reasonable amounts of time or memory. There is no single approach to big data. Therefore, MATLAB provides
a number of tools to tackle these challenges.

Working with Big Data in MATLAB

  1. 64-bit Computing. The 64-bit
    version of MATLAB
     drastically increases the amount of data you can hold in memory – typically up to 2000 times more than any 32-bit program.  While 32-bit programs limit you to addressing only 2 GB of memory, 64-bit MATLAB lets you address up to the physical
    memory limits of the OS. For Windows 8, that’s 500 GB for desktop versions and 4 TB for Windows Server.
  2. Memory Mapped Variables. The memmapfile function
    in MATLAB lets you map
    a file, or a portion of a file, to a MATLAB variable in memory
    . This allows you to efficiently access big data sets on disk that are too large to hold in memory or that take too long to load.
  3. Disk Variables. The matfile function
    lets you access MATLAB variables directly from MAT-files on disk, using MATLAB indexing commands, without
    loading the full variables into memory
    . This allows you to do block processing on big data sets that are otherwise too large to fit in memory.
  4. Intrinsic Multicore Math. Many of the built-in mathematical functions in MATLAB, such as fft,inv,
    and eig, are multithreaded.
    By running in parallel, these functions take full advantage of the multiple cores of your computer, providing high-performance computation of big data sets.
  5. GPU Computing. If you’re working with GPUs, GPU-optimized
    mathematical functions
     in Parallel Computing Toolbox provide even higher performance for big data sets.
  6. Parallel Computing. Parallel
    Computing Toolbox
     provides a parallel for-loop that
    runs your MATLAB code and algorithms in parallel on multicore computers. If you use MATLAB
    Distributed Computing Server
    , you can execute in parallel on clusters of machines that can scale up to thousands of computers.
  7. Cloud Computing. You can run MATLAB computations in parallel using MATLAB Distributed Computing Server on Amazon’s
    Elastic Computing Cloud (EC2)
     for on-demand parallel processing on hundreds or thousands of computers. Cloud computing lets you process big data without having to buy or maintain your own cluster or data center.
  8. Distributed Arrays. Using Parallel Computing Toolbox and MATLAB Distributed Computing Server, you can work with matrices
    and multidimensional arrays that are distributed
     across the memory of a cluster of computers. Using this approach, you can store and perform computations on big data sets that are too large to fit in a single computer’s memory.
  9. Streaming Algorithms. Using System
    objects
    , you can perform stream processing on incoming streams of data that are too large or too fast to hold in memory. In addition, you can generate embedded C/C++ code from your MATLAB algorithms using MATLAB
    Coder
    , and run the resulting code on high-performance real-time systems.
  10. Image Block Processing. The blockproc function
    in Image Processing Toolbox lets
    you work
    with really big images
     by processing them efficiently a block at a time. Computations run in parallel on multiple cores and GPUs when used with Parallel Computing Toolbox.
  11. Machine Learning. Machine learning is helpful for extracting insights and developing predictive models with big data sets. A wide variety
    of machine learning algorithms
    including boosted and bagged decision trees, K-means and hierarchical clustering, K-nearest neighbor search, Gaussian mixtures, the expectation maximization algorithm, hidden Markov models, and neural networks are available in Statistics
    Toolbox
     and Neural Network
    Toolbox
    .

抱歉!评论已关闭.